Buffer usage in EXPLAIN and pg_stat_statements (review)

Started by Euler Taveira de Oliveiraover 16 years ago67 messages
1 attachment(s)

Hi Itagaki-san,

I'm reviewing your patch. Your patch is in good shape. It applies cleanly. All
of the things are built as intended (including the two contrib modules). It
doesn't include docs but I wrote it. Basically, I produced another patch (that
are attached) correcting some minor gripes; docs are included too. The
comments are in-line.

static bool auto_explain_log_analyze = false;
static bool auto_explain_log_verbose = false;
+ static bool auto_explain_log_buffer = false;

Rename it to auto_explain_log_buffers. That's because I renamed the option for
plural form. See above.

es.verbose = auto_explain_log_verbose;
+ es.buffer = auto_explain_log_buffer;

Change this check to look at es.analyze too. So the es.buffers will only be
enabled iif the es.analyze is enabled too.

+ 	/* Build a tuple descriptor for our result type */
+ 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ 		elog(ERROR, "return type must be a row type");
+ 
per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
oldcontext = MemoryContextSwitchTo(per_query_ctx);

! tupdesc = CreateTupleDescCopy(tupdesc);

Out of curiosity, any reason for this change?

else if (strcmp(opt->defname, "costs") == 0)
es.costs = defGetBoolean(opt);
+ 		else if (strcmp(opt->defname, "buffer") == 0)
+ 			es.buffer = defGetBoolean(opt);

I decided to change "buffer" to "buffers". That's because we already have
"costs" and the statistics will not be about one kind of buffer so plural form
sounds more natural.

+ 			if (es->format == EXPLAIN_FORMAT_TEXT)
+ 			{
+ 				appendStringInfo(es->str, " (gets=%ld reads=%ld temp=%ld)",
+ 					num_gets, num_reads, num_temp);

Rename "gets" and "reads" to "hit" and "read". Maybe we could prefix it with
"buf_" or something else.

Rename "num_gets" and "num_reads" to "num_hit" and "num_read". The later
terminology is used all over the code.

+ 			}
+ 			else
+ 			{
+ 				ExplainPropertyLong("Buffer Gets", num_gets, es);
+ 				ExplainPropertyLong("Buffer Reads", num_reads, es);
+ 				ExplainPropertyLong("Buffer Temp", num_temp, es);

I didn't like these terminologies. I came up with "Hit Buffers", "Read
Buffers", and "Temp Buffers". I confess that I don't like the last ones.
"Read Buffers"? We're reading from disk blocks. "Read Blocks"? "Read Disk
Blocks"? "Read Data Blocks"?
"Temp Buffers"? It could be temporary sort files, hash files (?), or temporary
relations. "Hit Local Buffers"? "Local Buffers"? "Hit Temp Buffers"?

#include "parser/parsetree.h"
+ #include "storage/buf_internals.h"

It's not used. Removed.

+ 		CurrentInstrument = instr->prev;
+ 	}
+ 	else
+ 		elog(WARNING, "Instrumentation stack is broken");

WARNING? I changed it to DEBUG2 and return immediately (as it does some lines
of code above).

+ /* for log_[parser|planner|executor|statement]_stats */
+ static long GlobalReadBufferCount;
+ static long GlobalReadLocalBufferCount;
+ static long GlobalBufferHitCount;
+ static long GlobalLocalBufferHitCount;
+ static long GlobalBufferFlushCount;
+ static long GlobalLocalBufferFlushCount;
+ static long GlobalBufFileReadCount;
+ static long GlobalBufFileWriteCount;
+ 

I'm not sure if this is the right place for these counters. Maybe we should
put it in buf_init.c. Ideas?

bool costs; /* print costs */
+ bool buffer; /* print buffer stats */

Rename it to "buffers".

+ 	/* Buffer usage */
+ 	long		buffer_gets;	/* # of buffer reads */
+ 	long		buffer_reads;	/* # of disk reads */
+ 	long		buffer_temp;	/* # of temp file reads */

Rename them to "buffers_hit", "buffers_read", and "buffers_temp".

I didn't test this changes with "big" queries because I don't have some at
hand right now. Also, I didn't notice any slowdowns caused by the patch.
Comments? If none, it is ready for a committer.

--
Euler Taveira de Oliveira
http://www.timbira.com/

Attachments:

buffer_usage-20090928.diff.gzapplication/x-gzip; name=buffer_usage-20090928.diff.gzDownload
#2Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Euler Taveira de Oliveira (#1)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Euler Taveira de Oliveira <euler@timbira.com> wrote:

I'm reviewing your patch. Your patch is in good shape. It applies cleanly. All
of the things are built as intended (including the two contrib modules). It
doesn't include docs but I wrote it. Basically, I produced another patch (that
are attached) correcting some minor gripes; docs are included too. The
comments are in-line.

Thanks. Except choice of words, can I think the basic architecture of
the patch is ok? The codes to handle global variables like ReadBufferCount
and GlobalReadBufferCount could be rewrite in cleaner way if we could
drop supports of log_{parser|planner|executor|statement}_stats.

+ 	/* Build a tuple descriptor for our result type */
+ 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ 		elog(ERROR, "return type must be a row type");
+ 
per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
oldcontext = MemoryContextSwitchTo(per_query_ctx);

! tupdesc = CreateTupleDescCopy(tupdesc);

Out of curiosity, any reason for this change?

That's because the new code is cleaner, I think. Since the result tuple
type is defined in OUT parameters, we don't have to re-define the result
with CreateTemplateTupleDesc().

+ 				appendStringInfo(es->str, " (gets=%ld reads=%ld temp=%ld)",
+ 					num_gets, num_reads, num_temp);

Rename "gets" and "reads" to "hit" and "read". Maybe we could prefix it with
"buf_" or something else.

Rename "num_gets" and "num_reads" to "num_hit" and "num_read". The later
terminology is used all over the code.

We should define the meanings of "get" and "hit" before rename them.
I'd like to propose the meanings as following:
* "get" : number of page access (= hit + read)
* "hit" : number of cache read (no disk read)
* "read" : number of disk read (= number of read() calls)

But there are some confusions in postgres; ReadBufferCount and
BufferHitCount are used for "get" and "hit", but "heap_blks_read"
and "heap_blks_hit" are used for "read" and "hit" in pg_statio_all_tables.
Can I rename ReadBufferCount to BufferGetCount? And which values should
we show in EXPLAIN and pg_stat_statements? (two of the three are enough)

I didn't like these terminologies. I came up with "Hit Buffers", "Read
Buffers", and "Temp Buffers". I confess that I don't like the last ones.
"Read Buffers"? We're reading from disk blocks. "Read Blocks"? "Read Disk
Blocks"? "Read Data Blocks"?
"Temp Buffers"? It could be temporary sort files, hash files (?), or temporary
relations. "Hit Local Buffers"? "Local Buffers"? "Hit Temp Buffers"?

I borrowed the terms "Buffer Gets" and "Buffer Reads" from STATSPACK report
in Oracle Database. But I'm willing to rename them if appropriate.
http://www.oracle.com/apps_benchmark/doc/awrrpt_20090325b_900.html#600

Presently "Temp Buffers" contains temporary sort files, hash files, and
materialized executor plan. Local buffer statistics for temp tables are
not included here but merged with shared buffer statistics. Are there
any better way to group them?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

In reply to: Itagaki Takahiro (#2)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Itagaki Takahiro escreveu:

Thanks. Except choice of words, can I think the basic architecture of
the patch is ok? The codes to handle global variables like ReadBufferCount
and GlobalReadBufferCount could be rewrite in cleaner way if we could
drop supports of log_{parser|planner|executor|statement}_stats.

Yes, it is. I'm afraid someone is relying on that piece of code. We can ask
people if it is ok to deprecated it; but it should be removed after 2 releases
or so. BTW, Isn't it make sense to move the Global* variables to buf_init.c,
is it?

We should define the meanings of "get" and "hit" before rename them.
I'd like to propose the meanings as following:
* "get" : number of page access (= hit + read)
* "hit" : number of cache read (no disk read)
* "read" : number of disk read (= number of read() calls)

+1.

But there are some confusions in postgres; ReadBufferCount and
BufferHitCount are used for "get" and "hit", but "heap_blks_read"
and "heap_blks_hit" are used for "read" and "hit" in pg_statio_all_tables.

I see. :(

Can I rename ReadBufferCount to BufferGetCount? And which values should
we show in EXPLAIN and pg_stat_statements? (two of the three are enough)

Do you want to include number of page access in the stats? If not, we don't
need to rename the variables for now (maybe a separate patch?). And IMHO we
should include "hit" and "read" because "get" is just a simple math.

I borrowed the terms "Buffer Gets" and "Buffer Reads" from STATSPACK report
in Oracle Database. But I'm willing to rename them if appropriate.
http://www.oracle.com/apps_benchmark/doc/awrrpt_20090325b_900.html#600

Hmm... I don't have a strong opinion about those names as I said. So if you
want to revert the old names...

Presently "Temp Buffers" contains temporary sort files, hash files, and
materialized executor plan. Local buffer statistics for temp tables are
not included here but merged with shared buffer statistics. Are there
any better way to group them?

Are you sure? Looking at ReadBuffer_common() function I see an 'if
(isLocalBuf)' test.

As I said your patch is in good shape and if we solve those terminologies, it
is done for a committer.

Would you care to submit another patch if you want to do some modifications?

--
Euler Taveira de Oliveira
http://www.timbira.com/

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Euler Taveira de Oliveira (#3)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Euler Taveira de Oliveira <euler@timbira.com> writes:

Itagaki Takahiro escreveu:

Thanks. Except choice of words, can I think the basic architecture of
the patch is ok? The codes to handle global variables like ReadBufferCount
and GlobalReadBufferCount could be rewrite in cleaner way if we could
drop supports of log_{parser|planner|executor|statement}_stats.

Yes, it is. I'm afraid someone is relying on that piece of code.

If we have a better substitute, I think there'd be nothing wrong with
removing those features. They were never anything but pretty crufty
anyway, and they are *not* a compatibility issue because applications
have no direct way to access those stats. However, you'd have to be
sure that the substitute covers all the use-cases for the old stats
... which strikes me as a lot more territory than this patch has
proposed to cover.

regards, tom lane

#5Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Euler Taveira de Oliveira (#3)
1 attachment(s)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Euler Taveira de Oliveira <euler@timbira.com> wrote:

But there are some confusions in postgres; ReadBufferCount and
BufferHitCount are used for "get" and "hit", but "heap_blks_read"
and "heap_blks_hit" are used for "read" and "hit" in pg_statio_all_tables.

I see. :(

I fixed the confusions of get, hit and read in your patch.
long num_hit = ReadBufferCount + ReadLocalBufferCount;
long num_read = num_hit - BufferHitCount - LocalBufferHitCount;
should be
long num_get = ReadBufferCount + ReadLocalBufferCount;
long num_hit = BufferHitCount + LocalBufferHitCount;
long num_read = num_get - num_hit;

ReadBufferCount means "number of buffer access" :(

Patch attached.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

buffer_usage_20091001.patchapplication/octet-stream; name=buffer_usage_20091001.patchDownload
diff -cprN head/contrib/auto_explain/auto_explain.c work/contrib/auto_explain/auto_explain.c
*** head/contrib/auto_explain/auto_explain.c	2009-08-10 14:46:49.000000000 +0900
--- work/contrib/auto_explain/auto_explain.c	2009-10-01 11:17:24.503897062 +0900
*************** PG_MODULE_MAGIC;
*** 22,27 ****
--- 22,28 ----
  static int	auto_explain_log_min_duration = -1; /* msec or -1 */
  static bool auto_explain_log_analyze = false;
  static bool auto_explain_log_verbose = false;
+ static bool auto_explain_log_buffers = false;
  static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
  static bool auto_explain_log_nested_statements = false;
  
*************** _PG_init(void)
*** 92,97 ****
--- 93,108 ----
  							 NULL,
  							 NULL);
  
+ 	DefineCustomBoolVariable("auto_explain.log_buffers",
+ 							 "Log buffers usage.",
+ 							 NULL,
+ 							 &auto_explain_log_buffers,
+ 							 false,
+ 							 PGC_SUSET,
+ 							 0,
+ 							 NULL,
+ 							 NULL);
+ 
  	DefineCustomEnumVariable("auto_explain.log_format",
  							 "EXPLAIN format to be used for plan logging.",
  							 NULL,
*************** explain_ExecutorEnd(QueryDesc *queryDesc
*** 220,225 ****
--- 231,237 ----
  			ExplainInitState(&es);
  			es.analyze = (queryDesc->doInstrument && auto_explain_log_analyze);
  			es.verbose = auto_explain_log_verbose;
+ 			es.buffers = (es.analyze && auto_explain_log_buffers);
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
diff -cprN head/contrib/pg_stat_statements/pg_stat_statements.c work/contrib/pg_stat_statements/pg_stat_statements.c
*** head/contrib/pg_stat_statements/pg_stat_statements.c	2009-07-27 13:09:55.000000000 +0900
--- work/contrib/pg_stat_statements/pg_stat_statements.c	2009-10-01 11:17:24.504906936 +0900
***************
*** 26,31 ****
--- 26,32 ----
  #include "catalog/pg_type.h"
  #include "executor/executor.h"
  #include "executor/instrument.h"
+ #include "funcapi.h"
  #include "mb/pg_wchar.h"
  #include "miscadmin.h"
  #include "pgstat.h"
*************** PG_MODULE_MAGIC;
*** 43,49 ****
  #define PGSS_DUMP_FILE	"global/pg_stat_statements.stat"
  
  /* This constant defines the magic number in the stats file header */
! static const uint32 PGSS_FILE_HEADER = 0x20081202;
  
  /* XXX: Should USAGE_EXEC reflect execution time and/or buffer usage? */
  #define USAGE_EXEC(duration)	(1.0)
--- 44,50 ----
  #define PGSS_DUMP_FILE	"global/pg_stat_statements.stat"
  
  /* This constant defines the magic number in the stats file header */
! static const uint32 PGSS_FILE_HEADER = 0x20090928;
  
  /* XXX: Should USAGE_EXEC reflect execution time and/or buffer usage? */
  #define USAGE_EXEC(duration)	(1.0)
*************** typedef struct Counters
*** 77,82 ****
--- 78,86 ----
  	int64		calls;			/* # of times executed */
  	double		total_time;		/* total execution time in seconds */
  	int64		rows;			/* total # of retrieved or affected rows */
+ 	int64		hit;			/* total # of buffer hits */
+ 	int64		read;			/* total # of disk blocks read */
+ 	int64		temp;			/* total # of local buffer read */
  	double		usage;			/* usage factor */
  } Counters;
  
*************** pgss_store(const char *query, const Inst
*** 633,638 ****
--- 637,645 ----
  		e->counters.calls += 1;
  		e->counters.total_time += instr->total;
  		e->counters.rows += rows;
+ 		e->counters.hit += instr->buffers_hit;
+ 		e->counters.read += instr->buffers_read;
+ 		e->counters.temp += instr->buffers_temp;
  		e->counters.usage += usage;
  		SpinLockRelease(&e->mutex);
  	}
*************** pg_stat_statements_reset(PG_FUNCTION_ARG
*** 654,660 ****
  	PG_RETURN_VOID();
  }
  
! #define PG_STAT_STATEMENTS_COLS		6
  
  /*
   * Retrieve statement statistics.
--- 661,667 ----
  	PG_RETURN_VOID();
  }
  
! #define PG_STAT_STATEMENTS_COLS		9
  
  /*
   * Retrieve statement statistics.
*************** pg_stat_statements(PG_FUNCTION_ARGS)
*** 688,709 ****
  				 errmsg("materialize mode required, but it is not " \
  						"allowed in this context")));
  
  	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
  	oldcontext = MemoryContextSwitchTo(per_query_ctx);
  
! 	tupdesc = CreateTemplateTupleDesc(PG_STAT_STATEMENTS_COLS, false);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "userid",
! 					   OIDOID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "dbid",
! 					   OIDOID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "query",
! 					   TEXTOID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "calls",
! 					   INT8OID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "total_time",
! 					   FLOAT8OID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "rows",
! 					   INT8OID, -1, 0);
  
  	tupstore = tuplestore_begin_heap(true, false, work_mem);
  	rsinfo->returnMode = SFRM_Materialize;
--- 695,708 ----
  				 errmsg("materialize mode required, but it is not " \
  						"allowed in this context")));
  
+ 	/* Build a tuple descriptor for our result type */
+ 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ 		elog(ERROR, "return type must be a row type");
+ 
  	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
  	oldcontext = MemoryContextSwitchTo(per_query_ctx);
  
! 	tupdesc = CreateTupleDescCopy(tupdesc);
  
  	tupstore = tuplestore_begin_heap(true, false, work_mem);
  	rsinfo->returnMode = SFRM_Materialize;
*************** pg_stat_statements(PG_FUNCTION_ARGS)
*** 757,762 ****
--- 756,764 ----
  		values[i++] = Int64GetDatumFast(tmp.calls);
  		values[i++] = Float8GetDatumFast(tmp.total_time);
  		values[i++] = Int64GetDatumFast(tmp.rows);
+ 		values[i++] = Int64GetDatumFast(tmp.hit);
+ 		values[i++] = Int64GetDatumFast(tmp.read);
+ 		values[i++] = Int64GetDatumFast(tmp.temp);
  
  		Assert(i == PG_STAT_STATEMENTS_COLS);
  
diff -cprN head/contrib/pg_stat_statements/pg_stat_statements.sql.in work/contrib/pg_stat_statements/pg_stat_statements.sql.in
*** head/contrib/pg_stat_statements/pg_stat_statements.sql.in	2009-01-05 07:19:59.000000000 +0900
--- work/contrib/pg_stat_statements/pg_stat_statements.sql.in	2009-10-01 11:17:24.504906936 +0900
*************** CREATE FUNCTION pg_stat_statements(
*** 15,21 ****
      OUT query text,
      OUT calls int8,
      OUT total_time float8,
!     OUT rows int8
  )
  RETURNS SETOF record
  AS 'MODULE_PATHNAME'
--- 15,24 ----
      OUT query text,
      OUT calls int8,
      OUT total_time float8,
!     OUT rows int8,
!     OUT bufs_hit int8,
!     OUT bufs_read int8,
!     OUT bufs_temp int8
  )
  RETURNS SETOF record
  AS 'MODULE_PATHNAME'
diff -cprN head/doc/src/sgml/auto-explain.sgml work/doc/src/sgml/auto-explain.sgml
*** head/doc/src/sgml/auto-explain.sgml	2009-08-10 14:46:50.000000000 +0900
--- work/doc/src/sgml/auto-explain.sgml	2009-10-01 11:17:24.505661275 +0900
*************** LOAD 'auto_explain';
*** 104,109 ****
--- 104,128 ----
  
     <varlistentry>
      <term>
+      <varname>auto_explain.log_buffers</varname> (<type>boolean</type>)
+     </term>
+     <indexterm>
+      <primary><varname>auto_explain.log_buffers</> configuration parameter</primary>
+     </indexterm>
+     <listitem>
+      <para>
+       <varname>auto_explain.log_buffers</varname> causes <command>EXPLAIN
+       (ANALYZE, BUFFERS)</> output, rather than just <command>EXPLAIN</> 
+       output, to be printed when an execution plan is logged. This parameter is 
+       off by default. Only superusers can change this setting. Also, this
+       parameter only has effect if <varname>auto_explain.log_analyze</>
+       parameter is set.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
+     <term>
       <varname>auto_explain.log_format</varname> (<type>enum</type>)
      </term>
      <indexterm>
diff -cprN head/doc/src/sgml/pgstatstatements.sgml work/doc/src/sgml/pgstatstatements.sgml
*** head/doc/src/sgml/pgstatstatements.sgml	2009-05-18 20:08:24.000000000 +0900
--- work/doc/src/sgml/pgstatstatements.sgml	2009-10-01 11:17:24.505661275 +0900
***************
*** 85,90 ****
--- 85,111 ----
        <entry>Total number of rows retrieved or affected by the statement</entry>
       </row>
  
+      <row>
+       <entry><structfield>bufs_hit</structfield></entry>
+       <entry><type>bigint</type></entry>
+       <entry></entry>
+       <entry>Total number of buffer hits by the statement</entry>
+      </row>
+ 
+      <row>
+       <entry><structfield>bufs_read</structfield></entry>
+       <entry><type>bigint</type></entry>
+       <entry></entry>
+       <entry>Total number of disk blocks read by the statement</entry>
+      </row>
+ 
+      <row>
+       <entry><structfield>bufs_temp</structfield></entry>
+       <entry><type>bigint</type></entry>
+       <entry></entry>
+       <entry>Total number of local buffer read by the statement</entry>
+      </row>
+ 
      </tbody>
     </tgroup>
    </table>
diff -cprN head/doc/src/sgml/ref/explain.sgml work/doc/src/sgml/ref/explain.sgml
*** head/doc/src/sgml/ref/explain.sgml	2009-08-10 14:46:50.000000000 +0900
--- work/doc/src/sgml/ref/explain.sgml	2009-10-01 11:17:24.505661275 +0900
*************** PostgreSQL documentation
*** 31,37 ****
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
--- 31,37 ----
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | BUFFERS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
*************** ROLLBACK;
*** 140,145 ****
--- 140,157 ----
     </varlistentry>
  
     <varlistentry>
+     <term><literal>BUFFERS</literal></term>
+     <listitem>
+      <para>
+       Include information on the buffers. Specifically, include the number of
+       buffer hits, number of disc blocks read, and number of local buffer read.
+       This parameter should be used with <literal>ANALYZE</literal> parameter.
+       Also, this parameter defaults to <literal>FALSE</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
      <term><literal>FORMAT</literal></term>
      <listitem>
       <para>
diff -cprN head/src/backend/commands/explain.c work/src/backend/commands/explain.c
*** head/src/backend/commands/explain.c	2009-08-22 11:06:32.000000000 +0900
--- work/src/backend/commands/explain.c	2009-10-01 11:17:24.506655332 +0900
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 127,132 ****
--- 127,134 ----
  			es.verbose = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "costs") == 0)
  			es.costs = defGetBoolean(opt);
+ 		else if (strcmp(opt->defname, "buffers") == 0)
+ 			es.buffers = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "format") == 0)
  		{
  			char   *p = defGetString(opt);
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 150,155 ****
--- 152,162 ----
  							opt->defname)));
  	}
  
+ 	if (es.buffers && !es.analyze)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 			 errmsg("EXPLAIN option BUFFERS requires ANALYZE")));
+ 
  	/* Convert parameter type data to the form parser wants */
  	getParamListTypes(params, &param_types, &num_params);
  
*************** ExplainNode(Plan *plan, PlanState *plans
*** 923,928 ****
--- 930,954 ----
  			ExplainPropertyFloat("Actual Rows", rows, 0, es);
  			ExplainPropertyFloat("Actual Loops", nloops, 0, es);
  		}
+ 
+ 		if (es->buffers)
+ 		{
+ 			long	num_hit = planstate->instrument->buffers_hit;
+ 			long	num_read = planstate->instrument->buffers_read;
+ 			long	num_temp = planstate->instrument->buffers_temp;
+ 
+ 			if (es->format == EXPLAIN_FORMAT_TEXT)
+ 			{
+ 				appendStringInfo(es->str, " (hit=%ld read=%ld temp=%ld)",
+ 					num_hit, num_read, num_temp);
+ 			}
+ 			else
+ 			{
+ 				ExplainPropertyLong("Hit Buffers", num_hit, es);
+ 				ExplainPropertyLong("Read Buffers", num_read, es);
+ 				ExplainPropertyLong("Temp Buffers", num_temp, es);
+ 			}
+ 		}
  	}
  	else if (es->analyze)
  	{
diff -cprN head/src/backend/executor/execMain.c work/src/backend/executor/execMain.c
*** head/src/backend/executor/execMain.c	2009-09-28 05:09:57.000000000 +0900
--- work/src/backend/executor/execMain.c	2009-10-01 11:17:24.507627637 +0900
*************** standard_ExecutorRun(QueryDesc *queryDes
*** 267,272 ****
--- 267,273 ----
  	DestReceiver *dest;
  	bool		sendTuples;
  	MemoryContext oldcontext;
+ 	Instrumentation *save_TopInstrument = NULL;
  
  	/* sanity checks */
  	Assert(queryDesc != NULL);
*************** standard_ExecutorRun(QueryDesc *queryDes
*** 282,288 ****
--- 283,293 ----
  
  	/* Allow instrumentation of ExecutorRun overall runtime */
  	if (queryDesc->totaltime)
+ 	{
  		InstrStartNode(queryDesc->totaltime);
+ 		save_TopInstrument = TopInstrument;
+ 		TopInstrument = queryDesc->totaltime;
+ 	}
  
  	/*
  	 * extract information from the query descriptor and the query feature.
*************** standard_ExecutorRun(QueryDesc *queryDes
*** 320,326 ****
--- 325,340 ----
  		(*dest->rShutdown) (dest);
  
  	if (queryDesc->totaltime)
+ 	{
  		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ 		if (save_TopInstrument)
+ 		{
+ 			save_TopInstrument->buffers_hit += queryDesc->totaltime->buffers_hit;
+ 			save_TopInstrument->buffers_read += queryDesc->totaltime->buffers_read;
+ 			save_TopInstrument->buffers_temp += queryDesc->totaltime->buffers_temp;
+ 		}
+ 		TopInstrument = save_TopInstrument;
+ 	}
  
  	MemoryContextSwitchTo(oldcontext);
  }
diff -cprN head/src/backend/executor/instrument.c work/src/backend/executor/instrument.c
*** head/src/backend/executor/instrument.c	2009-01-02 02:23:41.000000000 +0900
--- work/src/backend/executor/instrument.c	2009-10-01 11:19:17.666239902 +0900
***************
*** 16,22 ****
--- 16,26 ----
  #include <unistd.h>
  
  #include "executor/instrument.h"
+ #include "storage/buf_internals.h"
+ #include "storage/bufmgr.h"
  
+ Instrumentation *CurrentInstrument = NULL;
+ Instrumentation *TopInstrument = NULL;
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
*************** InstrStartNode(Instrumentation *instr)
*** 37,42 ****
--- 41,50 ----
  		INSTR_TIME_SET_CURRENT(instr->starttime);
  	else
  		elog(DEBUG2, "InstrStartNode called twice in a row");
+ 
+ 	/* push stack */
+ 	instr->prev = CurrentInstrument;
+ 	CurrentInstrument = instr;
  }
  
  /* Exit from a plan node */
*************** InstrStopNode(Instrumentation *instr, do
*** 45,50 ****
--- 53,88 ----
  {
  	instr_time	endtime;
  
+ 	if (instr == CurrentInstrument)
+ 	{
+ 		long	num_get = ReadBufferCount + ReadLocalBufferCount;
+ 		long	num_hit = BufferHitCount + LocalBufferHitCount;
+ 		long	num_read = num_get - num_hit;
+ 		long	num_temp = BufFileReadCount;
+ 
+ 		/* count buffer usage per plan node */
+ 		instr->buffers_hit += num_hit;
+ 		instr->buffers_read += num_read;
+ 		instr->buffers_temp += num_temp;
+ 
+ 		/* accumulate per-node buffer statistics into top node */
+ 		if (TopInstrument && TopInstrument != CurrentInstrument)
+ 		{
+ 			TopInstrument->buffers_hit += num_hit;
+ 			TopInstrument->buffers_read += num_read;
+ 			TopInstrument->buffers_temp += num_temp;
+ 		}
+ 
+ 		/* reset buffer usage and pop stack */
+ 		ResetLocalBufferUsage();
+ 		CurrentInstrument = instr->prev;
+ 	}
+ 	else
+ 	{
+ 		elog(DEBUG2, "Instrumentation stack is broken");
+ 		return;
+ 	}
+ 
  	/* count the returned tuples */
  	instr->tuplecount += nTuples;
  
diff -cprN head/src/backend/storage/buffer/bufmgr.c work/src/backend/storage/buffer/bufmgr.c
*** head/src/backend/storage/buffer/bufmgr.c	2009-06-11 23:49:01.000000000 +0900
--- work/src/backend/storage/buffer/bufmgr.c	2009-10-01 11:17:24.508671491 +0900
*************** static bool IsForInput;
*** 79,84 ****
--- 79,94 ----
  /* local state for LockBufferForCleanup */
  static volatile BufferDesc *PinCountWaitBuf = NULL;
  
+ /* statistics counters for log_[parser|planner|executor|statement]_stats */
+ static long GlobalReadBufferCount;
+ static long GlobalReadLocalBufferCount;
+ static long GlobalBufferHitCount;
+ static long GlobalLocalBufferHitCount;
+ static long GlobalBufferFlushCount;
+ static long GlobalLocalBufferFlushCount;
+ static long GlobalBufFileReadCount;
+ static long GlobalBufFileWriteCount;
+ 
  
  static Buffer ReadBuffer_common(SMgrRelation reln, bool isLocalBuf,
  				  ForkNumber forkNum, BlockNumber blockNum,
*************** ShowBufferUsage(void)
*** 1620,1646 ****
  	float		hitrate;
  	float		localhitrate;
  
  	initStringInfo(&str);
  
! 	if (ReadBufferCount == 0)
  		hitrate = 0.0;
  	else
! 		hitrate = (float) BufferHitCount *100.0 / ReadBufferCount;
  
! 	if (ReadLocalBufferCount == 0)
  		localhitrate = 0.0;
  	else
! 		localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
  
  	appendStringInfo(&str,
  	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
! 				ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
  	appendStringInfo(&str,
  	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
! 					 ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate);
  	appendStringInfo(&str,
  					 "!\tDirect blocks: %10ld read, %10ld written\n",
! 					 BufFileReadCount, BufFileWriteCount);
  
  	return str.data;
  }
--- 1630,1658 ----
  	float		hitrate;
  	float		localhitrate;
  
+ 	ResetLocalBufferUsage();
+ 
  	initStringInfo(&str);
  
! 	if (GlobalReadBufferCount == 0)
  		hitrate = 0.0;
  	else
! 		hitrate = (float) GlobalBufferHitCount *100.0 / GlobalReadBufferCount;
  
! 	if (GlobalReadLocalBufferCount == 0)
  		localhitrate = 0.0;
  	else
! 		localhitrate = (float) GlobalLocalBufferHitCount *100.0 / GlobalReadLocalBufferCount;
  
  	appendStringInfo(&str,
  	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
! 		GlobalReadBufferCount - GlobalBufferHitCount, GlobalBufferFlushCount, hitrate);
  	appendStringInfo(&str,
  	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
! 		GlobalReadLocalBufferCount - GlobalLocalBufferHitCount, GlobalLocalBufferFlushCount, localhitrate);
  	appendStringInfo(&str,
  					 "!\tDirect blocks: %10ld read, %10ld written\n",
! 					 GlobalBufFileReadCount, GlobalBufFileWriteCount);
  
  	return str.data;
  }
*************** ResetBufferUsage(void)
*** 1656,1661 ****
--- 1668,1704 ----
  	LocalBufferFlushCount = 0;
  	BufFileReadCount = 0;
  	BufFileWriteCount = 0;
+ 
+ 	GlobalBufferHitCount = 0;
+ 	GlobalReadBufferCount = 0;
+ 	GlobalBufferFlushCount = 0;
+ 	GlobalLocalBufferHitCount = 0;
+ 	GlobalReadLocalBufferCount = 0;
+ 	GlobalLocalBufferFlushCount = 0;
+ 	GlobalBufFileReadCount = 0;
+ 	GlobalBufFileWriteCount = 0;
+ }
+ 
+ void
+ ResetLocalBufferUsage(void)
+ {
+ 	BufferHitCount = 0;
+ 	ReadBufferCount = 0;
+ 	BufferFlushCount = 0;
+ 	LocalBufferHitCount = 0;
+ 	ReadLocalBufferCount = 0;
+ 	LocalBufferFlushCount = 0;
+ 	BufFileReadCount = 0;
+ 	BufFileWriteCount = 0;
+ 
+ 	GlobalReadBufferCount += ReadBufferCount;
+ 	GlobalReadLocalBufferCount += ReadLocalBufferCount;
+ 	GlobalBufferHitCount += BufferHitCount;
+ 	GlobalLocalBufferHitCount += LocalBufferHitCount;
+ 	GlobalBufferFlushCount += BufferFlushCount;
+ 	GlobalLocalBufferFlushCount += LocalBufferFlushCount;
+ 	GlobalBufFileReadCount += BufFileReadCount;
+ 	GlobalBufFileWriteCount += BufFileWriteCount;
  }
  
  /*
diff -cprN head/src/backend/tcop/postgres.c work/src/backend/tcop/postgres.c
*** head/src/backend/tcop/postgres.c	2009-09-01 11:54:51.000000000 +0900
--- work/src/backend/tcop/postgres.c	2009-10-01 11:17:24.510655527 +0900
***************
*** 44,49 ****
--- 44,50 ----
  #include "catalog/pg_type.h"
  #include "commands/async.h"
  #include "commands/prepare.h"
+ #include "executor/instrument.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "libpq/pqsignal.h"
*************** PostgresMain(int argc, char *argv[], con
*** 3482,3487 ****
--- 3483,3492 ----
  		 */
  		doing_extended_query_message = false;
  
+ 		/* Reset buffer usage counters */
+ 		CurrentInstrument = TopInstrument = NULL;
+ 		ResetLocalBufferUsage();
+ 
  		/*
  		 * Release storage left over from prior query cycle, and create a new
  		 * query input buffer in the cleared MessageContext.
diff -cprN head/src/include/commands/explain.h work/src/include/commands/explain.h
*** head/src/include/commands/explain.h	2009-08-10 14:46:50.000000000 +0900
--- work/src/include/commands/explain.h	2009-10-01 11:17:24.510655527 +0900
*************** typedef struct ExplainState
*** 29,34 ****
--- 29,35 ----
  	bool		verbose;		/* be verbose */
  	bool		analyze;		/* print actual times */
  	bool		costs;			/* print costs */
+ 	bool		buffers;		/* print buffer usage */
  	ExplainFormat format;		/* output format */
  	/* other states */
  	PlannedStmt *pstmt;			/* top of plan */
diff -cprN head/src/include/executor/instrument.h work/src/include/executor/instrument.h
*** head/src/include/executor/instrument.h	2009-01-02 02:23:59.000000000 +0900
--- work/src/include/executor/instrument.h	2009-10-01 11:17:24.510655527 +0900
*************** typedef struct Instrumentation
*** 29,36 ****
--- 29,45 ----
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
+ 	/* Buffer usage */
+ 	long		buffers_hit;	/* # of buffer hits */
+ 	long		buffers_read;	/* # of disk blocks read */
+ 	long		buffers_temp;	/* # of local buffer read */
+ 	/* previous node in stack */
+ 	struct Instrumentation *prev;
  } Instrumentation;
  
+ extern Instrumentation *CurrentInstrument;
+ extern Instrumentation *TopInstrument;
+ 
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
  extern void InstrStopNode(Instrumentation *instr, double nTuples);
diff -cprN head/src/include/storage/bufmgr.h work/src/include/storage/bufmgr.h
*** head/src/include/storage/bufmgr.h	2009-06-11 23:49:12.000000000 +0900
--- work/src/include/storage/bufmgr.h	2009-10-01 11:17:24.511663344 +0900
*************** extern void InitBufferPoolAccess(void);
*** 175,180 ****
--- 175,181 ----
  extern void InitBufferPoolBackend(void);
  extern char *ShowBufferUsage(void);
  extern void ResetBufferUsage(void);
+ extern void ResetLocalBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
  extern void CheckPointBuffers(int flags);
In reply to: Itagaki Takahiro (#5)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Itagaki Takahiro escreveu:

I fixed the confusions of get, hit and read in your patch.

Works for me. Will mark it ready for a committer.

PS> BTW, your patch (20091001112006.9C36.52131E4D@oss.ntt.co.jp) doesn't seem
to be in archives.p.o. though I've received a copy from the server.

--
Euler Taveira de Oliveira
http://www.timbira.com/

#7Alvaro Herrera
alvherre@commandprompt.com
In reply to: Euler Taveira de Oliveira (#6)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Euler Taveira de Oliveira wrote:

Itagaki Takahiro escreveu:

I fixed the confusions of get, hit and read in your patch.

Works for me. Will mark it ready for a committer.

PS> BTW, your patch (20091001112006.9C36.52131E4D@oss.ntt.co.jp) doesn't seem
to be in archives.p.o. though I've received a copy from the server.

That's indeed very strange -- I have it locally and I wasn't CCed, so
Majordomo must have delivered it.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#8Alvaro Herrera
alvherre@commandprompt.com
In reply to: Alvaro Herrera (#7)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Alvaro Herrera wrote:

Euler Taveira de Oliveira wrote:

Itagaki Takahiro escreveu:

I fixed the confusions of get, hit and read in your patch.

Works for me. Will mark it ready for a committer.

PS> BTW, your patch (20091001112006.9C36.52131E4D@oss.ntt.co.jp) doesn't seem
to be in archives.p.o. though I've received a copy from the server.

That's indeed very strange -- I have it locally and I wasn't CCed, so
Majordomo must have delivered it.

Something was wrong with last month's archive. For some reason it had
96000 files in that directory. I have rerun mhonarc on it and it has
normalized now (~2100 files).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#9Robert Haas
robertmhaas@gmail.com
In reply to: Itagaki Takahiro (#5)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

On Wed, Sep 30, 2009 at 10:40 PM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Euler Taveira de Oliveira <euler@timbira.com> wrote:

But there are some confusions in postgres; ReadBufferCount and
BufferHitCount are used for "get" and "hit", but "heap_blks_read"
and "heap_blks_hit" are used for "read" and "hit" in pg_statio_all_tables.

I see. :(

I fixed the confusions of get, hit and read in your patch.
   long        num_hit = ReadBufferCount + ReadLocalBufferCount;
   long        num_read = num_hit - BufferHitCount - LocalBufferHitCount;
should be
   long        num_get = ReadBufferCount + ReadLocalBufferCount;
   long        num_hit = BufferHitCount + LocalBufferHitCount;
   long        num_read = num_get - num_hit;

ReadBufferCount means "number of buffer access" :(

Patch attached.

I took a look at this today and I have a couple of comments. The
basic functionality looks useful, but I think the terminology is too
terse. Specific commens:

1. In the EXPLAIN output, I think that the buffers information should
be output on its own line, rather than appended to the line that
already contains costs and execution times. The current output
doesn't include the word "buffers" or "blocks" anywhere, which seems
to me to be a critical flaw. I would suggest something like "Blocks
Read: %ld Hit: %ld Temp Read: %ld\n". See the way we handle output
of sort type and space usage, for example.

2. Similarly, in pg_stat_statements, the Counters structure could
easily use the same names for the structure members that we already
use in e.g. pg_stat_database - blks_hit, blks_read, and, say,
blks_temp_read. In fact I tend to think we should stick with "blocks"
rather than "buffers" overall, for consistency with what the system
does elsewhere.

3. With respect to the doc changes in explain.sgml, we consistently
use "disk" rather than "disc" in the documentation; but it may not be
necessary to use that word at all, and I think the paragraph can be
tightened up a bit: "Include information on the number of blocks read,
the number of those that are hits (already in shared buffers and do
not need to be read in), and the number of those that are reads on
temporary, backend-local buffers. This parameter requires that the
<literal>ANALYZE</literal> parameter also be used. This parameter
defaults to <literal>FALSE</literal>".

4. "Instrumentation stack is broken" doesn't seem terribly helpful in
understanding what has gone wrong.

...Robert

#10Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#9)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Robert Haas <robertmhaas@gmail.com> wrote:

1. I would suggest something like "Blocks
Read: %ld Hit: %ld Temp Read: %ld\n". See the way we handle output
of sort type and space usage, for example.

I have some questions:
* Did you use single space and double spaces in your example intentionally?
* Should we use lower cases here?
* Can I use "temp" instead of "Temp Read" to shorten the name?

2. Similarly, in pg_stat_statements, the Counters structure could
easily use the same names for the structure members that we already
use in e.g. pg_stat_database - blks_hit, blks_read, and, say,
blks_temp_read. In fact I tend to think we should stick with "blocks"
rather than "buffers" overall, for consistency with what the system
does elsewhere.

I agree to rename them into blks_*, but EXPLAIN (blocks) might be
misleading; EXPLAIN (buffer) can be interpreted as "buffer usage",
but normally we don't call it "block usage".

My suggestion is:
* EXPLAIN (buffers) prints (blocks read: %ld hit: %ld temp: %ld)
* auto_explain.log_buffers are not changed
* pg_stat_statements uses blks_hit and blks_read

4. "Instrumentation stack is broken" doesn't seem terribly helpful in
understanding what has gone wrong.

This message is only for hackers and should not occur.
Assert() might be ok instead.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#11Robert Haas
robertmhaas@gmail.com
In reply to: Itagaki Takahiro (#10)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

On Sun, Oct 4, 2009 at 11:22 PM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

1. I would suggest something like "Blocks
Read: %ld  Hit:  %ld  Temp Read: %ld\n".  See the way we handle output
of sort type and space usage, for example.

I have some questions:
 * Did you use single space and double spaces in your example intentionally?

No, that was unintentional.

 * Should we use lower cases here?

No. We don't anywhere else in explain.c.

 * Can I use "temp" instead of "Temp Read" to shorten the name?

I can't tell what that means without reading the source code. I think
clarity should take precedence over brevity.

2. Similarly, in pg_stat_statements, the Counters structure could
easily use the same names for the structure members that we already
use in e.g. pg_stat_database - blks_hit, blks_read, and, say,
blks_temp_read.  In fact I tend to think we should stick with "blocks"
rather than "buffers" overall, for consistency with what the system
does elsewhere.

I agree to rename them into blks_*, but EXPLAIN (blocks) might be
misleading; EXPLAIN (buffer) can be interpreted as "buffer usage",
but normally we don't call it "block usage".

My suggestion is:
   * EXPLAIN (buffers) prints (blocks read: %ld hit: %ld temp: %ld)
   * auto_explain.log_buffers are not changed
   * pg_stat_statements uses blks_hit and blks_read

I agree.

4. "Instrumentation stack is broken" doesn't seem terribly helpful in
understanding what has gone wrong.

This message is only for hackers and should not occur.
Assert() might be ok instead.

Hmm, I think I like the idea of an Assert(). Logging a cryptic
message at DEBUG2 doesn't seem sufficient for a can't-happen condition
that probably indicates a serious bug in the code.

...Robert

#12Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#11)
1 attachment(s)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Here is an update version of buffer usage patch.
* All buffers_* and bufs_* are renamed to blks_*.
* 'disc' => 'disk' in documentation
* Replace debug-log to Assert().
* Fix a bug in ResetLocalBufferUsage(). log_xxx_stats had not worked.

Robert Haas <robertmhaas@gmail.com> wrote:

?* Can I use "temp" instead of "Temp Read" to shorten the name?

I can't tell what that means without reading the source code. I think
clarity should take precedence over brevity.

I used temp_blks_read because we have idx_blks_read in pg_statio_xxx.

=# \d pg_stat_statements
View "public.pg_stat_statements"
Column | Type | Modifiers
----------------+------------------+-----------
userid | oid |
dbid | oid |
query | text |
calls | bigint |
total_time | double precision |
rows | bigint |
blks_hit | bigint |
blks_read | bigint |
temp_blks_read | bigint |

=# SET work_mem = '1MB';
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM pgbench_accounts ORDER BY bid;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Sort (cost=21913.32..22163.33 rows=100005 width=97) (actual time=81.345..99.054 rows=100000 loops=1)
Sort Key: bid
Sort Method: external sort Disk: 10472kB
Blocks Hit: 0 Read: 0 Temp Read: 1309
-> Seq Scan on pgbench_accounts (cost=0.00..2667.05 rows=100005 width=97) (actual time=0.018..23.129 rows=100000 loops=1)
Blocks Hit: 74 Read: 1694 Temp Read: 0
Total runtime: 105.238 ms
(7 rows)

=# SET work_mem = '18MB';
=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM pgbench_accounts ORDER BY bid;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Sort (cost=10972.32..11222.33 rows=100005 width=97) (actual time=35.437..43.069 rows=100000 loops=1)
Sort Key: bid
Sort Method: quicksort Memory: 17916kB
Blocks Hit: 0 Read: 0 Temp Read: 0
-> Seq Scan on pgbench_accounts (cost=0.00..2667.05 rows=100005 width=97) (actual time=0.028..15.030 rows=100000 loops=1)
Blocks Hit: 32 Read: 1635 Temp Read: 0
Total runtime: 52.026 ms
(7 rows)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

buffer_usage_20091005.patchapplication/octet-stream; name=buffer_usage_20091005.patchDownload
diff -cprN head/contrib/auto_explain/auto_explain.c work/contrib/auto_explain/auto_explain.c
*** head/contrib/auto_explain/auto_explain.c	Mon Aug 10 14:46:49 2009
--- work/contrib/auto_explain/auto_explain.c	Mon Oct  5 11:50:46 2009
*************** PG_MODULE_MAGIC;
*** 22,27 ****
--- 22,28 ----
  static int	auto_explain_log_min_duration = -1; /* msec or -1 */
  static bool auto_explain_log_analyze = false;
  static bool auto_explain_log_verbose = false;
+ static bool auto_explain_log_buffers = false;
  static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
  static bool auto_explain_log_nested_statements = false;
  
*************** _PG_init(void)
*** 92,97 ****
--- 93,108 ----
  							 NULL,
  							 NULL);
  
+ 	DefineCustomBoolVariable("auto_explain.log_buffers",
+ 							 "Log buffers usage.",
+ 							 NULL,
+ 							 &auto_explain_log_buffers,
+ 							 false,
+ 							 PGC_SUSET,
+ 							 0,
+ 							 NULL,
+ 							 NULL);
+ 
  	DefineCustomEnumVariable("auto_explain.log_format",
  							 "EXPLAIN format to be used for plan logging.",
  							 NULL,
*************** explain_ExecutorEnd(QueryDesc *queryDesc
*** 220,225 ****
--- 231,237 ----
  			ExplainInitState(&es);
  			es.analyze = (queryDesc->doInstrument && auto_explain_log_analyze);
  			es.verbose = auto_explain_log_verbose;
+ 			es.buffers = (es.analyze && auto_explain_log_buffers);
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
diff -cprN head/contrib/pg_stat_statements/pg_stat_statements.c work/contrib/pg_stat_statements/pg_stat_statements.c
*** head/contrib/pg_stat_statements/pg_stat_statements.c	Mon Jul 27 13:09:55 2009
--- work/contrib/pg_stat_statements/pg_stat_statements.c	Mon Oct  5 14:13:16 2009
***************
*** 26,31 ****
--- 26,32 ----
  #include "catalog/pg_type.h"
  #include "executor/executor.h"
  #include "executor/instrument.h"
+ #include "funcapi.h"
  #include "mb/pg_wchar.h"
  #include "miscadmin.h"
  #include "pgstat.h"
*************** PG_MODULE_MAGIC;
*** 43,49 ****
  #define PGSS_DUMP_FILE	"global/pg_stat_statements.stat"
  
  /* This constant defines the magic number in the stats file header */
! static const uint32 PGSS_FILE_HEADER = 0x20081202;
  
  /* XXX: Should USAGE_EXEC reflect execution time and/or buffer usage? */
  #define USAGE_EXEC(duration)	(1.0)
--- 44,50 ----
  #define PGSS_DUMP_FILE	"global/pg_stat_statements.stat"
  
  /* This constant defines the magic number in the stats file header */
! static const uint32 PGSS_FILE_HEADER = 0x20090928;
  
  /* XXX: Should USAGE_EXEC reflect execution time and/or buffer usage? */
  #define USAGE_EXEC(duration)	(1.0)
*************** typedef struct Counters
*** 77,82 ****
--- 78,86 ----
  	int64		calls;			/* # of times executed */
  	double		total_time;		/* total execution time in seconds */
  	int64		rows;			/* total # of retrieved or affected rows */
+ 	int64		blks_hit;		/* total # of buffer hits */
+ 	int64		blks_read;		/* total # of disk blocks read */
+ 	int64		temp_blks_read;	/* total # of temp blocks read */
  	double		usage;			/* usage factor */
  } Counters;
  
*************** pgss_store(const char *query, const Inst
*** 633,638 ****
--- 637,645 ----
  		e->counters.calls += 1;
  		e->counters.total_time += instr->total;
  		e->counters.rows += rows;
+ 		e->counters.blks_hit += instr->blks_hit;
+ 		e->counters.blks_read += instr->blks_read;
+ 		e->counters.temp_blks_read += instr->temp_blks_read;
  		e->counters.usage += usage;
  		SpinLockRelease(&e->mutex);
  	}
*************** pg_stat_statements_reset(PG_FUNCTION_ARG
*** 654,660 ****
  	PG_RETURN_VOID();
  }
  
! #define PG_STAT_STATEMENTS_COLS		6
  
  /*
   * Retrieve statement statistics.
--- 661,667 ----
  	PG_RETURN_VOID();
  }
  
! #define PG_STAT_STATEMENTS_COLS		9
  
  /*
   * Retrieve statement statistics.
*************** pg_stat_statements(PG_FUNCTION_ARGS)
*** 688,709 ****
  				 errmsg("materialize mode required, but it is not " \
  						"allowed in this context")));
  
  	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
  	oldcontext = MemoryContextSwitchTo(per_query_ctx);
  
! 	tupdesc = CreateTemplateTupleDesc(PG_STAT_STATEMENTS_COLS, false);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "userid",
! 					   OIDOID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "dbid",
! 					   OIDOID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "query",
! 					   TEXTOID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 4, "calls",
! 					   INT8OID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 5, "total_time",
! 					   FLOAT8OID, -1, 0);
! 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "rows",
! 					   INT8OID, -1, 0);
  
  	tupstore = tuplestore_begin_heap(true, false, work_mem);
  	rsinfo->returnMode = SFRM_Materialize;
--- 695,708 ----
  				 errmsg("materialize mode required, but it is not " \
  						"allowed in this context")));
  
+ 	/* Build a tuple descriptor for our result type */
+ 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ 		elog(ERROR, "return type must be a row type");
+ 
  	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
  	oldcontext = MemoryContextSwitchTo(per_query_ctx);
  
! 	tupdesc = CreateTupleDescCopy(tupdesc);
  
  	tupstore = tuplestore_begin_heap(true, false, work_mem);
  	rsinfo->returnMode = SFRM_Materialize;
*************** pg_stat_statements(PG_FUNCTION_ARGS)
*** 757,762 ****
--- 756,764 ----
  		values[i++] = Int64GetDatumFast(tmp.calls);
  		values[i++] = Float8GetDatumFast(tmp.total_time);
  		values[i++] = Int64GetDatumFast(tmp.rows);
+ 		values[i++] = Int64GetDatumFast(tmp.blks_hit);
+ 		values[i++] = Int64GetDatumFast(tmp.blks_read);
+ 		values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
  
  		Assert(i == PG_STAT_STATEMENTS_COLS);
  
diff -cprN head/contrib/pg_stat_statements/pg_stat_statements.sql.in work/contrib/pg_stat_statements/pg_stat_statements.sql.in
*** head/contrib/pg_stat_statements/pg_stat_statements.sql.in	Mon Jan  5 07:19:59 2009
--- work/contrib/pg_stat_statements/pg_stat_statements.sql.in	Mon Oct  5 14:12:51 2009
*************** CREATE FUNCTION pg_stat_statements(
*** 15,21 ****
      OUT query text,
      OUT calls int8,
      OUT total_time float8,
!     OUT rows int8
  )
  RETURNS SETOF record
  AS 'MODULE_PATHNAME'
--- 15,24 ----
      OUT query text,
      OUT calls int8,
      OUT total_time float8,
!     OUT rows int8,
!     OUT blks_hit int8,
!     OUT blks_read int8,
!     OUT temp_blks_read int8
  )
  RETURNS SETOF record
  AS 'MODULE_PATHNAME'
diff -cprN head/doc/src/sgml/auto-explain.sgml work/doc/src/sgml/auto-explain.sgml
*** head/doc/src/sgml/auto-explain.sgml	Mon Aug 10 14:46:50 2009
--- work/doc/src/sgml/auto-explain.sgml	Mon Oct  5 11:50:46 2009
*************** LOAD 'auto_explain';
*** 104,109 ****
--- 104,128 ----
  
     <varlistentry>
      <term>
+      <varname>auto_explain.log_buffers</varname> (<type>boolean</type>)
+     </term>
+     <indexterm>
+      <primary><varname>auto_explain.log_buffers</> configuration parameter</primary>
+     </indexterm>
+     <listitem>
+      <para>
+       <varname>auto_explain.log_buffers</varname> causes <command>EXPLAIN
+       (ANALYZE, BUFFERS)</> output, rather than just <command>EXPLAIN</> 
+       output, to be printed when an execution plan is logged. This parameter is 
+       off by default. Only superusers can change this setting. Also, this
+       parameter only has effect if <varname>auto_explain.log_analyze</>
+       parameter is set.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
+     <term>
       <varname>auto_explain.log_format</varname> (<type>enum</type>)
      </term>
      <indexterm>
diff -cprN head/doc/src/sgml/pgstatstatements.sgml work/doc/src/sgml/pgstatstatements.sgml
*** head/doc/src/sgml/pgstatstatements.sgml	Mon May 18 20:08:24 2009
--- work/doc/src/sgml/pgstatstatements.sgml	Mon Oct  5 14:12:51 2009
***************
*** 85,90 ****
--- 85,111 ----
        <entry>Total number of rows retrieved or affected by the statement</entry>
       </row>
  
+      <row>
+       <entry><structfield>blks_hit</structfield></entry>
+       <entry><type>bigint</type></entry>
+       <entry></entry>
+       <entry>Total number of buffer hits by the statement</entry>
+      </row>
+ 
+      <row>
+       <entry><structfield>blks_read</structfield></entry>
+       <entry><type>bigint</type></entry>
+       <entry></entry>
+       <entry>Total number of disk blocks read by the statement</entry>
+      </row>
+ 
+      <row>
+       <entry><structfield>temp_blks_read</structfield></entry>
+       <entry><type>bigint</type></entry>
+       <entry></entry>
+       <entry>Total number of temp blocks read by the statement</entry>
+      </row>
+ 
      </tbody>
     </tgroup>
    </table>
diff -cprN head/doc/src/sgml/ref/explain.sgml work/doc/src/sgml/ref/explain.sgml
*** head/doc/src/sgml/ref/explain.sgml	Mon Aug 10 14:46:50 2009
--- work/doc/src/sgml/ref/explain.sgml	Mon Oct  5 13:11:22 2009
*************** PostgreSQL documentation
*** 31,37 ****
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
--- 31,37 ----
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | BUFFERS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
*************** ROLLBACK;
*** 140,145 ****
--- 140,157 ----
     </varlistentry>
  
     <varlistentry>
+     <term><literal>BUFFERS</literal></term>
+     <listitem>
+      <para>
+       Include information on the buffers. Specifically, include the number of
+       buffer hits, number of disk blocks read, and number of local buffer read.
+       This parameter should be used with <literal>ANALYZE</literal> parameter.
+       Also, this parameter defaults to <literal>FALSE</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
      <term><literal>FORMAT</literal></term>
      <listitem>
       <para>
diff -cprN head/src/backend/commands/explain.c work/src/backend/commands/explain.c
*** head/src/backend/commands/explain.c	Sat Aug 22 11:06:32 2009
--- work/src/backend/commands/explain.c	Mon Oct  5 14:16:02 2009
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 127,132 ****
--- 127,134 ----
  			es.verbose = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "costs") == 0)
  			es.costs = defGetBoolean(opt);
+ 		else if (strcmp(opt->defname, "buffers") == 0)
+ 			es.buffers = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "format") == 0)
  		{
  			char   *p = defGetString(opt);
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 150,155 ****
--- 152,162 ----
  							opt->defname)));
  	}
  
+ 	if (es.buffers && !es.analyze)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 			 errmsg("EXPLAIN option BUFFERS requires ANALYZE")));
+ 
  	/* Convert parameter type data to the form parser wants */
  	getParamListTypes(params, &param_types, &num_params);
  
*************** ExplainNode(Plan *plan, PlanState *plans
*** 1019,1024 ****
--- 1026,1052 ----
  			break;
  	}
  
+ 	/* Show buffer usage */
+ 	if (es->buffers)
+ 	{
+ 		long	num_hit = planstate->instrument->blks_hit;
+ 		long	num_read = planstate->instrument->blks_read;
+ 		long	num_temp_read = planstate->instrument->temp_blks_read;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Blocks Hit: %ld  Read: %ld  Temp Read: %ld\n",
+ 							 num_hit, num_read, num_temp_read);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyLong("Hit Blocks", num_hit, es);
+ 			ExplainPropertyLong("Read Blocks", num_read, es);
+ 			ExplainPropertyLong("Temp Read Blocks", num_temp_read, es);
+ 		}
+ 	}
+ 
  	/* Get ready to display the child plans */
  	haschildren = plan->initPlan ||
  		outerPlan(plan) ||
diff -cprN head/src/backend/executor/execMain.c work/src/backend/executor/execMain.c
*** head/src/backend/executor/execMain.c	Mon Sep 28 05:09:57 2009
--- work/src/backend/executor/execMain.c	Mon Oct  5 14:12:52 2009
*************** standard_ExecutorRun(QueryDesc *queryDes
*** 267,272 ****
--- 267,273 ----
  	DestReceiver *dest;
  	bool		sendTuples;
  	MemoryContext oldcontext;
+ 	Instrumentation *save_TopInstrument = NULL;
  
  	/* sanity checks */
  	Assert(queryDesc != NULL);
*************** standard_ExecutorRun(QueryDesc *queryDes
*** 282,288 ****
--- 283,293 ----
  
  	/* Allow instrumentation of ExecutorRun overall runtime */
  	if (queryDesc->totaltime)
+ 	{
  		InstrStartNode(queryDesc->totaltime);
+ 		save_TopInstrument = TopInstrument;
+ 		TopInstrument = queryDesc->totaltime;
+ 	}
  
  	/*
  	 * extract information from the query descriptor and the query feature.
*************** standard_ExecutorRun(QueryDesc *queryDes
*** 320,326 ****
--- 325,340 ----
  		(*dest->rShutdown) (dest);
  
  	if (queryDesc->totaltime)
+ 	{
  		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ 		if (save_TopInstrument)
+ 		{
+ 			save_TopInstrument->blks_hit += queryDesc->totaltime->blks_hit;
+ 			save_TopInstrument->blks_read += queryDesc->totaltime->blks_read;
+ 			save_TopInstrument->temp_blks_read += queryDesc->totaltime->temp_blks_read;
+ 		}
+ 		TopInstrument = save_TopInstrument;
+ 	}
  
  	MemoryContextSwitchTo(oldcontext);
  }
diff -cprN head/src/backend/executor/instrument.c work/src/backend/executor/instrument.c
*** head/src/backend/executor/instrument.c	Fri Jan  2 02:23:41 2009
--- work/src/backend/executor/instrument.c	Mon Oct  5 14:13:52 2009
***************
*** 16,22 ****
--- 16,26 ----
  #include <unistd.h>
  
  #include "executor/instrument.h"
+ #include "storage/buf_internals.h"
+ #include "storage/bufmgr.h"
  
+ Instrumentation *CurrentInstrument = NULL;
+ Instrumentation *TopInstrument = NULL;
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
*************** InstrStartNode(Instrumentation *instr)
*** 37,42 ****
--- 41,50 ----
  		INSTR_TIME_SET_CURRENT(instr->starttime);
  	else
  		elog(DEBUG2, "InstrStartNode called twice in a row");
+ 
+ 	/* push stack */
+ 	instr->prev = CurrentInstrument;
+ 	CurrentInstrument = instr;
  }
  
  /* Exit from a plan node */
*************** void
*** 44,49 ****
--- 52,85 ----
  InstrStopNode(Instrumentation *instr, double nTuples)
  {
  	instr_time	endtime;
+ 	long		num_get;
+ 	long		num_hit;
+ 	long		num_read;
+ 	long		num_temp_read;
+ 
+ 	Assert(instr == CurrentInstrument);
+ 
+ 	num_get = ReadBufferCount + ReadLocalBufferCount;
+ 	num_hit = BufferHitCount + LocalBufferHitCount;
+ 	num_read = num_get - num_hit;
+ 	num_temp_read = BufFileReadCount;
+ 
+ 	/* count buffer usage per plan node */
+ 	instr->blks_hit += num_hit;
+ 	instr->blks_read += num_read;
+ 	instr->temp_blks_read += num_temp_read;
+ 
+ 	/* accumulate per-node buffer statistics into top node */
+ 	if (TopInstrument && TopInstrument != CurrentInstrument)
+ 	{
+ 		TopInstrument->blks_hit += num_hit;
+ 		TopInstrument->blks_read += num_read;
+ 		TopInstrument->temp_blks_read += num_temp_read;
+ 	}
+ 
+ 	/* reset buffer usage and pop stack */
+ 	ResetLocalBufferUsage();
+ 	CurrentInstrument = instr->prev;
  
  	/* count the returned tuples */
  	instr->tuplecount += nTuples;
diff -cprN head/src/backend/storage/buffer/bufmgr.c work/src/backend/storage/buffer/bufmgr.c
*** head/src/backend/storage/buffer/bufmgr.c	Thu Jun 11 23:49:01 2009
--- work/src/backend/storage/buffer/bufmgr.c	Mon Oct  5 13:59:31 2009
*************** static bool IsForInput;
*** 79,84 ****
--- 79,94 ----
  /* local state for LockBufferForCleanup */
  static volatile BufferDesc *PinCountWaitBuf = NULL;
  
+ /* statistics counters for log_[parser|planner|executor|statement]_stats */
+ static long GlobalReadBufferCount;
+ static long GlobalReadLocalBufferCount;
+ static long GlobalBufferHitCount;
+ static long GlobalLocalBufferHitCount;
+ static long GlobalBufferFlushCount;
+ static long GlobalLocalBufferFlushCount;
+ static long GlobalBufFileReadCount;
+ static long GlobalBufFileWriteCount;
+ 
  
  static Buffer ReadBuffer_common(SMgrRelation reln, bool isLocalBuf,
  				  ForkNumber forkNum, BlockNumber blockNum,
*************** ShowBufferUsage(void)
*** 1620,1646 ****
  	float		hitrate;
  	float		localhitrate;
  
  	initStringInfo(&str);
  
! 	if (ReadBufferCount == 0)
  		hitrate = 0.0;
  	else
! 		hitrate = (float) BufferHitCount *100.0 / ReadBufferCount;
  
! 	if (ReadLocalBufferCount == 0)
  		localhitrate = 0.0;
  	else
! 		localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
  
  	appendStringInfo(&str,
  	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
! 				ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
  	appendStringInfo(&str,
  	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
! 					 ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate);
  	appendStringInfo(&str,
  					 "!\tDirect blocks: %10ld read, %10ld written\n",
! 					 BufFileReadCount, BufFileWriteCount);
  
  	return str.data;
  }
--- 1630,1658 ----
  	float		hitrate;
  	float		localhitrate;
  
+ 	ResetLocalBufferUsage();
+ 
  	initStringInfo(&str);
  
! 	if (GlobalReadBufferCount == 0)
  		hitrate = 0.0;
  	else
! 		hitrate = (float) GlobalBufferHitCount *100.0 / GlobalReadBufferCount;
  
! 	if (GlobalReadLocalBufferCount == 0)
  		localhitrate = 0.0;
  	else
! 		localhitrate = (float) GlobalLocalBufferHitCount *100.0 / GlobalReadLocalBufferCount;
  
  	appendStringInfo(&str,
  	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
! 		GlobalReadBufferCount - GlobalBufferHitCount, GlobalBufferFlushCount, hitrate);
  	appendStringInfo(&str,
  	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
! 		GlobalReadLocalBufferCount - GlobalLocalBufferHitCount, GlobalLocalBufferFlushCount, localhitrate);
  	appendStringInfo(&str,
  					 "!\tDirect blocks: %10ld read, %10ld written\n",
! 					 GlobalBufFileReadCount, GlobalBufFileWriteCount);
  
  	return str.data;
  }
*************** ResetBufferUsage(void)
*** 1656,1661 ****
--- 1668,1704 ----
  	LocalBufferFlushCount = 0;
  	BufFileReadCount = 0;
  	BufFileWriteCount = 0;
+ 
+ 	GlobalBufferHitCount = 0;
+ 	GlobalReadBufferCount = 0;
+ 	GlobalBufferFlushCount = 0;
+ 	GlobalLocalBufferHitCount = 0;
+ 	GlobalReadLocalBufferCount = 0;
+ 	GlobalLocalBufferFlushCount = 0;
+ 	GlobalBufFileReadCount = 0;
+ 	GlobalBufFileWriteCount = 0;
+ }
+ 
+ void
+ ResetLocalBufferUsage(void)
+ {
+ 	GlobalReadBufferCount += ReadBufferCount;
+ 	GlobalReadLocalBufferCount += ReadLocalBufferCount;
+ 	GlobalBufferHitCount += BufferHitCount;
+ 	GlobalLocalBufferHitCount += LocalBufferHitCount;
+ 	GlobalBufferFlushCount += BufferFlushCount;
+ 	GlobalLocalBufferFlushCount += LocalBufferFlushCount;
+ 	GlobalBufFileReadCount += BufFileReadCount;
+ 	GlobalBufFileWriteCount += BufFileWriteCount;
+ 
+ 	BufferHitCount = 0;
+ 	ReadBufferCount = 0;
+ 	BufferFlushCount = 0;
+ 	LocalBufferHitCount = 0;
+ 	ReadLocalBufferCount = 0;
+ 	LocalBufferFlushCount = 0;
+ 	BufFileReadCount = 0;
+ 	BufFileWriteCount = 0;
  }
  
  /*
diff -cprN head/src/backend/tcop/postgres.c work/src/backend/tcop/postgres.c
*** head/src/backend/tcop/postgres.c	Tue Sep  1 11:54:51 2009
--- work/src/backend/tcop/postgres.c	Mon Oct  5 11:50:46 2009
***************
*** 44,49 ****
--- 44,50 ----
  #include "catalog/pg_type.h"
  #include "commands/async.h"
  #include "commands/prepare.h"
+ #include "executor/instrument.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
  #include "libpq/pqsignal.h"
*************** PostgresMain(int argc, char *argv[], con
*** 3482,3487 ****
--- 3483,3492 ----
  		 */
  		doing_extended_query_message = false;
  
+ 		/* Reset buffer usage counters */
+ 		CurrentInstrument = TopInstrument = NULL;
+ 		ResetLocalBufferUsage();
+ 
  		/*
  		 * Release storage left over from prior query cycle, and create a new
  		 * query input buffer in the cleared MessageContext.
diff -cprN head/src/include/commands/explain.h work/src/include/commands/explain.h
*** head/src/include/commands/explain.h	Mon Aug 10 14:46:50 2009
--- work/src/include/commands/explain.h	Mon Oct  5 11:50:46 2009
*************** typedef struct ExplainState
*** 29,34 ****
--- 29,35 ----
  	bool		verbose;		/* be verbose */
  	bool		analyze;		/* print actual times */
  	bool		costs;			/* print costs */
+ 	bool		buffers;		/* print buffer usage */
  	ExplainFormat format;		/* output format */
  	/* other states */
  	PlannedStmt *pstmt;			/* top of plan */
diff -cprN head/src/include/executor/instrument.h work/src/include/executor/instrument.h
*** head/src/include/executor/instrument.h	Fri Jan  2 02:23:59 2009
--- work/src/include/executor/instrument.h	Mon Oct  5 14:14:25 2009
*************** typedef struct Instrumentation
*** 29,36 ****
--- 29,45 ----
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
+ 	/* Buffer usage */
+ 	long		blks_hit;		/* # of buffer hits */
+ 	long		blks_read;		/* # of disk blocks read */
+ 	long		temp_blks_read;	/* # of temp blocks read */
+ 	/* previous node in stack */
+ 	struct Instrumentation *prev;
  } Instrumentation;
  
+ extern Instrumentation *CurrentInstrument;
+ extern Instrumentation *TopInstrument;
+ 
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
  extern void InstrStopNode(Instrumentation *instr, double nTuples);
diff -cprN head/src/include/storage/bufmgr.h work/src/include/storage/bufmgr.h
*** head/src/include/storage/bufmgr.h	Thu Jun 11 23:49:12 2009
--- work/src/include/storage/bufmgr.h	Mon Oct  5 11:50:46 2009
*************** extern void InitBufferPoolAccess(void);
*** 175,180 ****
--- 175,181 ----
  extern void InitBufferPoolBackend(void);
  extern char *ShowBufferUsage(void);
  extern void ResetBufferUsage(void);
+ extern void ResetLocalBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
  extern void CheckPointBuffers(int flags);
#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Itagaki Takahiro (#12)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

Here is an update version of buffer usage patch.

I started to look at this patch, and I have a few comments:

1. I was expecting this patch to get rid of ShowBufferUsage() and friends
altogether, instead of adding yet more static counters to support them.
Isn't that stuff pretty well superseded by having EXPLAIN support?

2. I do not understand the stuff with propagating counts into the top
instrumentation node. That seems like it's going to double-count those
counts. In any case it is 100% inconsistent to propagate only buffer
counts that way and not any other resource usage. I think you should
drop the TopInstrument variable and the logic that propagates counts up.

3. I don't believe that you've sufficiently considered the problem of
restoring the previous value of CurrentInstrument after an error. It is
not at all adequate to do it in postgres.c; consider subtransactions
for example. However, so far as I can see that variable is useless
anyway. Couldn't you just drop both that and the "prev" link?
(If you keep TopInstrument then the same objection applies to it.)

4. I don't believe this counting scheme works, except in the special
case where all buffer access happens in leaf plan nodes (which might be
enough if it weren't for Sort, Materialize, Hash, etc). It looks to
me like counts will be transferred into the instrumentation node for
the next plan node to stop execution, which could be a descendant of
the node that really ought to get charged.

You could deal with #4 by having the low-level I/O routines accumulate
counts directly into *CurrentInstrument and not have static I/O counters
at all, but then you'd have to contend with fixing #3 properly instead
of just eliminating that global variable. It might be better to add a
"start" field to struct Instrumentation for each counter, and do
something like this:
* StartNode copies static counter into start field
* StopNode computes delta = static counter - start field,
then adds delta to node's count and resets counter to start
The reason for the reset is so that the I/O isn't double counted by
parent nodes. If you wanted buffer I/O to be charged to the node
causing it *and* to all parent nodes, which would be more consistent
with the way we charge CPU time, then don't do the reset. Offhand
though that seems to me like it'd be more surprising than useful.

regards, tom lane

#14Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Tom Lane (#13)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. I do not understand the stuff with propagating counts into the top
instrumentation node. That seems like it's going to double-count those
counts. In any case it is 100% inconsistent to propagate only buffer
counts that way and not any other resource usage. I think you should
drop the TopInstrument variable and the logic that propagates counts up.

It is required by contrib/pg_stat_statements. EXPLAIN wants per-node
accumulation, but pg_stat_statements wants the total number.

Is it enough to add a PG_TRY block to standard_ExecutorRun() to
cleanup TopInstrument on error? I'm working on your other comments,
but I cannot remove TopInstrument for pg_state_statements.

I considerd other approaches, but all of them require node-dependent
routines; for example, adding a function to walk through a plan tree
and accumulate instrumentations in it at pg_stat_statements. But it is
hard to be maintained on executor nodes changes. Are there any better idea?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Itagaki Takahiro (#14)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. I do not understand the stuff with propagating counts into the top
instrumentation node.

It is required by contrib/pg_stat_statements. EXPLAIN wants per-node
accumulation, but pg_stat_statements wants the total number.

Well, you need to find another way or risk getting the patch rejected
altogether. Those global variables are the weakest part of the whole
design, and I'm not going to commit a patch that destabilizes the entire
system for the sake of a debatable "requirement" of a contrib module.

If you went with the alternative definition I suggested (don't reset the
static counters, so that every node includes its children's counts) then
the behavior you want would fall out automatically. Or, at the price of
running both resettable and non-resettable static counters, you could
have pg_stat_statements obtain totals that are independent of any
particular instrumentation node.

regards, tom lane

#16Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#15)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

On Wed, Oct 14, 2009 at 9:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

2. I do not understand the stuff with propagating counts into the top
instrumentation node.

It is required by contrib/pg_stat_statements. EXPLAIN wants per-node
accumulation, but pg_stat_statements wants the total number.

Well, you need to find another way or risk getting the patch rejected
altogether.  Those global variables are the weakest part of the whole
design, and I'm not going to commit a patch that destabilizes the entire
system for the sake of a debatable "requirement" of a contrib module.

If you went with the alternative definition I suggested (don't reset the
static counters, so that every node includes its children's counts) then
the behavior you want would fall out automatically.  Or, at the price of
running both resettable and non-resettable static counters, you could
have pg_stat_statements obtain totals that are independent of any
particular instrumentation node.

I am marking this patch as Returned with Feedback. I hope that it
will be resubmitted for a future CommitFest, because I think this
could be pretty interesting feature.

...Robert

#17Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#16)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Robert Haas <robertmhaas@gmail.com> wrote:

Well, you need to find another way or risk getting the patch rejected
altogether. ?Those global variables are the weakest part of the whole
design, and I'm not going to commit a patch that destabilizes the entire
system for the sake of a debatable "requirement" of a contrib module.

I am marking this patch as Returned with Feedback. I hope that it
will be resubmitted for a future CommitFest, because I think this
could be pretty interesting feature.

Ok, I'll reconsider them and re-submit patches for the next commitfest.
Maybe I need to split the patch into EXPLAIN-part and contrib-part.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#18Robert Haas
robertmhaas@gmail.com
In reply to: Itagaki Takahiro (#17)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

2009/10/14 Itagaki Takahiro <itagaki.takahiro@oss.ntt.co.jp>:

Robert Haas <robertmhaas@gmail.com> wrote:

Well, you need to find another way or risk getting the patch rejected
altogether. ?Those global variables are the weakest part of the whole
design, and I'm not going to commit a patch that destabilizes the entire
system for the sake of a debatable "requirement" of a contrib module.

I am marking this patch as Returned with Feedback.  I hope that it
will be resubmitted for a future CommitFest, because I think this
could be pretty interesting feature.

Ok, I'll reconsider them and re-submit patches for the next commitfest.
Maybe I need to split the patch into EXPLAIN-part and contrib-part.

My (limited) experience is that it's usually better to get something
incremental committed, even if it's not what you really want. You can
always take another crack at the remaining issues later, but if the
whole patch gets shot down then you are out of luck.

In this case, I think that the auto_explain changes out to be part of
the same patch as the core EXPLAIN changes, but if the
pg_stat_statement stuff is severable it might make sense to push that
off until later.

...Robert

#19Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#18)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

Robert Haas <robertmhaas@gmail.com> wrote:

My (limited) experience is that it's usually better to get something
incremental committed, even if it's not what you really want. You can
always take another crack at the remaining issues later, but if the
whole patch gets shot down then you are out of luck.

Yeah, that makes sense. But the partial change should also be
a "long-term solution" ;-). It is hard to determine whether
the partial change is a good solution until the whole features
works as expected (at least partially).

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#20Robert Haas
robertmhaas@gmail.com
In reply to: Itagaki Takahiro (#19)
Re: Buffer usage in EXPLAIN and pg_stat_statements (review)

On Wed, Oct 14, 2009 at 9:38 PM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

My (limited) experience is that it's usually better to get something
incremental committed, even if it's not what you really want.  You can
always take another crack at the remaining issues later, but if the
whole patch gets shot down then you are out of luck.

Yeah, that makes sense. But the partial change should also be
a "long-term solution" ;-). It is hard to determine whether
the partial change is a good solution until the whole features
works as expected (at least partially).

Well, that's an indication that you've chosen too small a piece. But
I don't really believe that a change that affects only core EXPLAIN
and auto_explain is too small a piece to be independently useful. If
it is, the whole feature is probably badly conceived in the first
place...

...Robert

#21Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#18)
1 attachment(s)
EXPLAIN BUFFERS

Robert Haas <robertmhaas@gmail.com> wrote:

In this case, I think that the auto_explain changes out to be part of
the same patch as the core EXPLAIN changes

Here is a rewritten patch to add EXPLAIN (ANALYZE, BUFFERS) and
support for it by contrib/auto_explain. I removed pg_stat_statements
support from the patch for now.

I modifed heavily in buffer statistics conters; These counters are
put all together into struct BufferUsage. The struct is also used in
struct Instrumentation. The global buffer usage counters are saved
into 'bufusage_start' field at the InstrStartNode(), and accumulated
into 'bufusage' field and global counters are reset at InstrStopNode().

EXPLAIN BUFFERS only shows 'hit', 'read' and 'temp read' in text format
to fit in display, but xml or json format contains all of them.

I removed ShowBufferUsage() because we can retrieve the same information
from xml or json explain output, but the patch does not drop
log_statement_stats variable families nor ShowUsage() functions.
We could also remove all of them if no one use them at all.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

explain_buffers_20091015.patchapplication/octet-stream; name=explain_buffers_20091015.patchDownload
diff -cprN head/contrib/auto_explain/auto_explain.c work/contrib/auto_explain/auto_explain.c
*** head/contrib/auto_explain/auto_explain.c	2009-08-11 09:26:35.209377000 +0900
--- work/contrib/auto_explain/auto_explain.c	2009-10-15 20:05:16.109810412 +0900
*************** PG_MODULE_MAGIC;
*** 22,27 ****
--- 22,28 ----
  static int	auto_explain_log_min_duration = -1; /* msec or -1 */
  static bool auto_explain_log_analyze = false;
  static bool auto_explain_log_verbose = false;
+ static bool auto_explain_log_buffers = false;
  static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
  static bool auto_explain_log_nested_statements = false;
  
*************** _PG_init(void)
*** 92,97 ****
--- 93,108 ----
  							 NULL,
  							 NULL);
  
+ 	DefineCustomBoolVariable("auto_explain.log_buffers",
+ 							 "Log buffers usage.",
+ 							 NULL,
+ 							 &auto_explain_log_buffers,
+ 							 false,
+ 							 PGC_SUSET,
+ 							 0,
+ 							 NULL,
+ 							 NULL);
+ 
  	DefineCustomEnumVariable("auto_explain.log_format",
  							 "EXPLAIN format to be used for plan logging.",
  							 NULL,
*************** explain_ExecutorEnd(QueryDesc *queryDesc
*** 218,225 ****
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument && auto_explain_log_analyze);
  			es.verbose = auto_explain_log_verbose;
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
--- 229,238 ----
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument &&
! 				(auto_explain_log_analyze || auto_explain_log_buffers));
  			es.verbose = auto_explain_log_verbose;
+ 			es.buffers = (es.analyze && auto_explain_log_buffers);
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
diff -cprN head/doc/src/sgml/auto-explain.sgml work/doc/src/sgml/auto-explain.sgml
*** head/doc/src/sgml/auto-explain.sgml	2009-08-11 09:26:35.209377000 +0900
--- work/doc/src/sgml/auto-explain.sgml	2009-10-15 20:04:10.963807103 +0900
*************** LOAD 'auto_explain';
*** 104,109 ****
--- 104,128 ----
  
     <varlistentry>
      <term>
+      <varname>auto_explain.log_buffers</varname> (<type>boolean</type>)
+     </term>
+     <indexterm>
+      <primary><varname>auto_explain.log_buffers</> configuration parameter</primary>
+     </indexterm>
+     <listitem>
+      <para>
+       <varname>auto_explain.log_buffers</varname> causes <command>EXPLAIN
+       (ANALYZE, BUFFERS)</> output, rather than just <command>EXPLAIN</> 
+       output, to be printed when an execution plan is logged. This parameter is 
+       off by default. Only superusers can change this setting. Also, this
+       parameter only has effect if <varname>auto_explain.log_analyze</>
+       parameter is set.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
+     <term>
       <varname>auto_explain.log_format</varname> (<type>enum</type>)
      </term>
      <indexterm>
diff -cprN head/doc/src/sgml/ref/explain.sgml work/doc/src/sgml/ref/explain.sgml
*** head/doc/src/sgml/ref/explain.sgml	2009-08-11 09:26:35.209377000 +0900
--- work/doc/src/sgml/ref/explain.sgml	2009-10-15 20:04:10.963807103 +0900
*************** PostgreSQL documentation
*** 31,37 ****
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
--- 31,37 ----
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | BUFFERS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
*************** ROLLBACK;
*** 140,145 ****
--- 140,157 ----
     </varlistentry>
  
     <varlistentry>
+     <term><literal>BUFFERS</literal></term>
+     <listitem>
+      <para>
+       Include information on the buffers. Specifically, include the number of
+       buffer hits, number of disk blocks read, and number of local buffer read.
+       This parameter should be used with <literal>ANALYZE</literal> parameter.
+       Also, this parameter defaults to <literal>FALSE</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
      <term><literal>FORMAT</literal></term>
      <listitem>
       <para>
diff -cprN head/src/backend/commands/explain.c work/src/backend/commands/explain.c
*** head/src/backend/commands/explain.c	2009-10-13 09:24:03.097662000 +0900
--- work/src/backend/commands/explain.c	2009-10-15 20:04:10.964808301 +0900
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 127,132 ****
--- 127,134 ----
  			es.verbose = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "costs") == 0)
  			es.costs = defGetBoolean(opt);
+ 		else if (strcmp(opt->defname, "buffers") == 0)
+ 			es.buffers = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "format") == 0)
  		{
  			char   *p = defGetString(opt);
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 150,155 ****
--- 152,162 ----
  							opt->defname)));
  	}
  
+ 	if (es.buffers && !es.analyze)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 			 errmsg("EXPLAIN option BUFFERS requires ANALYZE")));
+ 
  	/* Convert parameter type data to the form parser wants */
  	getParamListTypes(params, &param_types, &num_params);
  
*************** ExplainNode(Plan *plan, PlanState *plans
*** 1043,1048 ****
--- 1050,1079 ----
  			break;
  	}
  
+ 	/* Show buffer usage */
+ 	if (es->buffers)
+ 	{
+ 		const BufferUsage *usage = &planstate->instrument->bufusage;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Blocks Hit: %ld  Read: %ld  Temp Read: %ld\n",
+ 				usage->blks_hit, usage->blks_read, usage->temp_blks_read);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyLong("Hit Blocks", usage->blks_hit, es);
+ 			ExplainPropertyLong("Read Blocks", usage->blks_read, es);
+ 			ExplainPropertyLong("Written Blocks", usage->blks_written, es);
+ 			ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+ 			ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+ 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
+ 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+ 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+ 		}
+ 	}
+ 
  	/* Get ready to display the child plans */
  	haschildren = plan->initPlan ||
  		outerPlan(plan) ||
diff -cprN head/src/backend/executor/instrument.c work/src/backend/executor/instrument.c
*** head/src/backend/executor/instrument.c	2009-01-05 00:22:25.168790000 +0900
--- work/src/backend/executor/instrument.c	2009-10-15 20:10:08.807120586 +0900
***************
*** 17,22 ****
--- 17,26 ----
  
  #include "executor/instrument.h"
  
+ BufferUsage			pgBufferUsage;
+ 
+ static void BufferUsageAccumDiff(BufferUsage *dst,
+ 		const BufferUsage *add, const BufferUsage *sub);
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
*************** InstrStartNode(Instrumentation *instr)
*** 37,42 ****
--- 41,49 ----
  		INSTR_TIME_SET_CURRENT(instr->starttime);
  	else
  		elog(DEBUG2, "InstrStartNode called twice in a row");
+ 
+ 	/* initialize buffer usage per plan node */
+ 	instr->bufusage_start = pgBufferUsage;
  }
  
  /* Exit from a plan node */
*************** InstrStopNode(Instrumentation *instr, do
*** 59,64 ****
--- 66,79 ----
  
  	INSTR_TIME_SET_ZERO(instr->starttime);
  
+ 	/*
+ 	 * Adds delta of buffer usage to node's count and resets counter to start
+ 	 * so that the counters are not double counted by parent nodes.
+ 	 */
+ 	BufferUsageAccumDiff(&instr->bufusage,
+ 		&pgBufferUsage, &instr->bufusage_start);
+ 	pgBufferUsage = instr->bufusage_start;
+ 
  	/* Is this the first tuple of this cycle? */
  	if (!instr->running)
  	{
*************** InstrEndLoop(Instrumentation *instr)
*** 95,97 ****
--- 110,128 ----
  	instr->firsttuple = 0;
  	instr->tuplecount = 0;
  }
+ 
+ static void
+ BufferUsageAccumDiff(BufferUsage *dst,
+ 					 const BufferUsage *add,
+ 					 const BufferUsage *sub)
+ {
+ 	/* dst += add - sub */
+ 	dst->blks_hit += add->blks_hit - sub->blks_hit;
+ 	dst->blks_read += add->blks_read - sub->blks_read;
+ 	dst->blks_written += add->blks_written - sub->blks_written;
+ 	dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
+ 	dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
+ 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
+ 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
+ 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+ }
diff -cprN head/src/backend/storage/buffer/buf_init.c work/src/backend/storage/buffer/buf_init.c
*** head/src/backend/storage/buffer/buf_init.c	2009-01-05 00:22:25.168790000 +0900
--- work/src/backend/storage/buffer/buf_init.c	2009-10-15 20:04:10.965870810 +0900
*************** BufferDesc *BufferDescriptors;
*** 22,37 ****
  char	   *BufferBlocks;
  int32	   *PrivateRefCount;
  
- /* statistics counters */
- long int	ReadBufferCount;
- long int	ReadLocalBufferCount;
- long int	BufferHitCount;
- long int	LocalBufferHitCount;
- long int	BufferFlushCount;
- long int	LocalBufferFlushCount;
- long int	BufFileReadCount;
- long int	BufFileWriteCount;
- 
  
  /*
   * Data Structures:
--- 22,27 ----
diff -cprN head/src/backend/storage/buffer/bufmgr.c work/src/backend/storage/buffer/bufmgr.c
*** head/src/backend/storage/buffer/bufmgr.c	2009-06-12 09:52:43.356212000 +0900
--- work/src/backend/storage/buffer/bufmgr.c	2009-10-15 20:04:10.966906779 +0900
***************
*** 34,39 ****
--- 34,40 ----
  #include <unistd.h>
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "pgstat.h"
*************** ReadBuffer_common(SMgrRelation smgr, boo
*** 300,321 ****
  
  	if (isLocalBuf)
  	{
- 		ReadLocalBufferCount++;
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			LocalBufferHitCount++;
  	}
  	else
  	{
- 		ReadBufferCount++;
- 
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			BufferHitCount++;
  	}
  
  	/* At this point we do NOT hold any locks. */
--- 301,323 ----
  
  	if (isLocalBuf)
  	{
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			pgBufferUsage.local_blks_hit++;
! 		else
! 			pgBufferUsage.local_blks_read++;
  	}
  	else
  	{
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			pgBufferUsage.blks_hit++;
! 		else
! 			pgBufferUsage.blks_read++;
  	}
  
  	/* At this point we do NOT hold any locks. */
*************** SyncOneBuffer(int buf_id, bool skip_rece
*** 1611,1664 ****
  
  
  /*
-  * Return a palloc'd string containing buffer usage statistics.
-  */
- char *
- ShowBufferUsage(void)
- {
- 	StringInfoData str;
- 	float		hitrate;
- 	float		localhitrate;
- 
- 	initStringInfo(&str);
- 
- 	if (ReadBufferCount == 0)
- 		hitrate = 0.0;
- 	else
- 		hitrate = (float) BufferHitCount *100.0 / ReadBufferCount;
- 
- 	if (ReadLocalBufferCount == 0)
- 		localhitrate = 0.0;
- 	else
- 		localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
- 
- 	appendStringInfo(&str,
- 	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 				ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
- 	appendStringInfo(&str,
- 	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 					 ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate);
- 	appendStringInfo(&str,
- 					 "!\tDirect blocks: %10ld read, %10ld written\n",
- 					 BufFileReadCount, BufFileWriteCount);
- 
- 	return str.data;
- }
- 
- void
- ResetBufferUsage(void)
- {
- 	BufferHitCount = 0;
- 	ReadBufferCount = 0;
- 	BufferFlushCount = 0;
- 	LocalBufferHitCount = 0;
- 	ReadLocalBufferCount = 0;
- 	LocalBufferFlushCount = 0;
- 	BufFileReadCount = 0;
- 	BufFileWriteCount = 0;
- }
- 
- /*
   *		AtEOXact_Buffers - clean up at end of transaction.
   *
   *		As of PostgreSQL 8.0, buffer pins should get released by the
--- 1613,1618 ----
*************** FlushBuffer(volatile BufferDesc *buf, SM
*** 1916,1922 ****
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	BufferFlushCount++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
--- 1870,1876 ----
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	pgBufferUsage.blks_written++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
diff -cprN head/src/backend/storage/buffer/localbuf.c work/src/backend/storage/buffer/localbuf.c
*** head/src/backend/storage/buffer/localbuf.c	2009-06-12 09:52:43.356212000 +0900
--- work/src/backend/storage/buffer/localbuf.c	2009-10-15 20:04:10.966906779 +0900
***************
*** 16,21 ****
--- 16,22 ----
  #include "postgres.h"
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/smgr.h"
*************** LocalBufferAlloc(SMgrRelation smgr, Fork
*** 209,215 ****
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		LocalBufferFlushCount++;
  	}
  
  	/*
--- 210,216 ----
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		pgBufferUsage.local_blks_written++;
  	}
  
  	/*
diff -cprN head/src/backend/storage/file/buffile.c work/src/backend/storage/file/buffile.c
*** head/src/backend/storage/file/buffile.c	2009-06-12 09:52:43.356212000 +0900
--- work/src/backend/storage/file/buffile.c	2009-10-15 20:04:10.966906779 +0900
***************
*** 34,39 ****
--- 34,40 ----
  
  #include "postgres.h"
  
+ #include "executor/instrument.h"
  #include "storage/fd.h"
  #include "storage/buffile.h"
  #include "storage/buf_internals.h"
*************** BufFileLoadBuffer(BufFile *file)
*** 240,246 ****
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	BufFileReadCount++;
  }
  
  /*
--- 241,247 ----
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	pgBufferUsage.temp_blks_read++;
  }
  
  /*
*************** BufFileDumpBuffer(BufFile *file)
*** 304,310 ****
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		BufFileWriteCount++;
  	}
  	file->dirty = false;
  
--- 305,311 ----
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		pgBufferUsage.temp_blks_written++;
  	}
  	file->dirty = false;
  
diff -cprN head/src/backend/tcop/postgres.c work/src/backend/tcop/postgres.c
*** head/src/backend/tcop/postgres.c	2009-10-13 09:24:03.097662000 +0900
--- work/src/backend/tcop/postgres.c	2009-10-15 20:04:10.967906967 +0900
*************** ResetUsage(void)
*** 3850,3856 ****
  {
  	getrusage(RUSAGE_SELF, &Save_r);
  	gettimeofday(&Save_t, NULL);
- 	ResetBufferUsage();
  }
  
  void
--- 3850,3855 ----
*************** ShowUsage(const char *title)
*** 3861,3867 ****
  				sys;
  	struct timeval elapse_t;
  	struct rusage r;
- 	char	   *bufusage;
  
  	getrusage(RUSAGE_SELF, &r);
  	gettimeofday(&elapse_t, NULL);
--- 3860,3865 ----
*************** ShowUsage(const char *title)
*** 3935,3944 ****
  					 r.ru_nvcsw, r.ru_nivcsw);
  #endif   /* HAVE_GETRUSAGE */
  
- 	bufusage = ShowBufferUsage();
- 	appendStringInfo(&str, "! buffer usage stats:\n%s", bufusage);
- 	pfree(bufusage);
- 
  	/* remove trailing newline */
  	if (str.data[str.len - 1] == '\n')
  		str.data[--str.len] = '\0';
--- 3933,3938 ----
diff -cprN head/src/include/commands/explain.h work/src/include/commands/explain.h
*** head/src/include/commands/explain.h	2009-08-11 09:26:35.209377000 +0900
--- work/src/include/commands/explain.h	2009-10-15 20:04:10.968906657 +0900
*************** typedef struct ExplainState
*** 29,34 ****
--- 29,35 ----
  	bool		verbose;		/* be verbose */
  	bool		analyze;		/* print actual times */
  	bool		costs;			/* print costs */
+ 	bool		buffers;		/* print buffer usage */
  	ExplainFormat format;		/* output format */
  	/* other states */
  	PlannedStmt *pstmt;			/* top of plan */
diff -cprN head/src/include/executor/instrument.h work/src/include/executor/instrument.h
*** head/src/include/executor/instrument.h	2009-01-05 00:22:25.168790000 +0900
--- work/src/include/executor/instrument.h	2009-10-15 20:09:54.963808044 +0900
***************
*** 16,21 ****
--- 16,33 ----
  #include "portability/instr_time.h"
  
  
+ typedef struct BufferUsage
+ {
+ 	long	blks_hit;			/* # of buffer hits at start */
+ 	long	blks_read;			/* # of disk blocks read at start */
+ 	long	blks_written;		/* # of disk blocks written at start */
+ 	long	local_blks_hit;		/* # of buffer hits at start */
+ 	long	local_blks_read;	/* # of disk blocks read at start */
+ 	long	local_blks_written;	/* # of disk blocks written at start */
+ 	long	temp_blks_read;		/* # of temp blocks read at start */
+ 	long	temp_blks_written;	/* # of temp blocks written at start */
+ } BufferUsage;
+ 
  typedef struct Instrumentation
  {
  	/* Info about current plan cycle: */
*************** typedef struct Instrumentation
*** 24,36 ****
--- 36,52 ----
  	instr_time	counter;		/* Accumulated runtime for this node */
  	double		firsttuple;		/* Time for first tuple of this cycle */
  	double		tuplecount;		/* Tuples emitted so far this cycle */
+ 	BufferUsage	bufusage_start;	/* Buffer usage at start */
  	/* Accumulated statistics across all completed cycles: */
  	double		startup;		/* Total startup time (in seconds) */
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
+ 	BufferUsage	bufusage;		/* Total buffer usage */
  } Instrumentation;
  
+ extern BufferUsage		pgBufferUsage;
+ 
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
  extern void InstrStopNode(Instrumentation *instr, double nTuples);
diff -cprN head/src/include/storage/buf_internals.h work/src/include/storage/buf_internals.h
*** head/src/include/storage/buf_internals.h	2009-06-12 09:52:43.356212000 +0900
--- work/src/include/storage/buf_internals.h	2009-10-15 20:04:10.968906657 +0900
*************** extern PGDLLIMPORT BufferDesc *BufferDes
*** 173,188 ****
  /* in localbuf.c */
  extern BufferDesc *LocalBufferDescriptors;
  
- /* event counters in buf_init.c */
- extern long int ReadBufferCount;
- extern long int ReadLocalBufferCount;
- extern long int BufferHitCount;
- extern long int LocalBufferHitCount;
- extern long int BufferFlushCount;
- extern long int LocalBufferFlushCount;
- extern long int BufFileReadCount;
- extern long int BufFileWriteCount;
- 
  
  /*
   * Internal routines: only called by bufmgr
--- 173,178 ----
diff -cprN head/src/include/storage/bufmgr.h work/src/include/storage/bufmgr.h
*** head/src/include/storage/bufmgr.h	2009-06-12 09:52:43.356212000 +0900
--- work/src/include/storage/bufmgr.h	2009-10-15 20:04:10.968906657 +0900
*************** extern Buffer ReleaseAndReadBuffer(Buffe
*** 173,180 ****
  extern void InitBufferPool(void);
  extern void InitBufferPoolAccess(void);
  extern void InitBufferPoolBackend(void);
- extern char *ShowBufferUsage(void);
- extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
  extern void CheckPointBuffers(int flags);
--- 173,178 ----
#22Robert Haas
robertmhaas@gmail.com
In reply to: Itagaki Takahiro (#21)
Re: EXPLAIN BUFFERS

On Thu, Oct 15, 2009 at 7:29 AM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

EXPLAIN BUFFERS only shows 'hit', 'read' and 'temp read' in text format
to fit in display, but xml or json format contains all of them.

I was very careful when I submitted the machine-readable explain patch
to make sure that the choice of which information was displayed was
independent of the format, and I think that we should stick with that.
If you want we could have 'buffers terse' and 'buffers detail' but I
don't think we should force JSON/XML on people who want to see that
information.

...Robert

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#22)
Re: EXPLAIN BUFFERS

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Oct 15, 2009 at 7:29 AM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

EXPLAIN BUFFERS only shows 'hit', 'read' and 'temp read' in text format
to fit in display, but xml or json format contains all of them.

I was very careful when I submitted the machine-readable explain patch
to make sure that the choice of which information was displayed was
independent of the format, and I think that we should stick with that.

I thought one of the main purposes of adding the machine-readable
formats was to allow inclusion of information that we thought too
verbose for the human-readable format. Whether this info falls into
that category remains to be seen, but I don't agree with the premise
that the information content must always be exactly the same.

FWIW, the patch's output as it stood a few days ago (one extra line per
node, conditional on a BUFFERS option) did seem perfectly reasonable to
me, and I don't see the reason to change that format now.

regards, tom lane

#24Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#23)
Re: EXPLAIN BUFFERS

On Thu, Oct 15, 2009 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Oct 15, 2009 at 7:29 AM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

EXPLAIN BUFFERS only shows 'hit', 'read' and 'temp read' in text format
to fit in display, but xml or json format contains all of them.

I was very careful when I submitted the machine-readable explain patch
to make sure that the choice of which information was displayed was
independent of the format, and I think that we should stick with that.

I thought one of the main purposes of adding the machine-readable
formats was to allow inclusion of information that we thought too
verbose for the human-readable format.  Whether this info falls into
that category remains to be seen, but I don't agree with the premise
that the information content must always be exactly the same.

Hmm. I thought that the purpose of having a generalized options
syntax was that people could have the information they wanted,
independently of the format they chose. Even with a lot of extra
information, the human readable format is still far shorter and more
easily readable than either of the others. If we had gone with the
idea of just dumping everything in the world into the XML format,
you'd be right: but for various reasons we decided against that, which
I'm very happy about.

FWIW, the patch's output as it stood a few days ago (one extra line per
node, conditional on a BUFFERS option) did seem perfectly reasonable to
me, and I don't see the reason to change that format now.

Yep, agreed.

...Robert

#25Jeff Janes
jeff.janes@gmail.com
In reply to: Itagaki Takahiro (#21)
Re: EXPLAIN BUFFERS

On Thu, Oct 15, 2009 at 3:29 AM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

In this case, I think that the auto_explain changes out to be part of
the same patch as the core EXPLAIN changes

Here is a rewritten patch to add EXPLAIN (ANALYZE, BUFFERS) and
support for it by contrib/auto_explain. I removed pg_stat_statements
support from the patch for now.

Just a quick note: this patch does not apply cleanly to HEAD due to
the subsequent removal from explain.c of the near-by lines:

/* Convert parameter type data to the form parser wants */
getParamListTypes(params, &param_types, &num_params);

I think it is merely a text conflict and not a functional one.

Cheers,

Jeff

#26Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Jeff Janes (#25)
1 attachment(s)
Re: EXPLAIN BUFFERS

Jeff Janes <jeff.janes@gmail.com> wrote:

Just a quick note: this patch does not apply cleanly to HEAD due to
the subsequent removal from explain.c of the near-by lines:

Thanks for reporting.
The attached patch is rebased to current CVS.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

explain_buffers_20091124.patchapplication/octet-stream; name=explain_buffers_20091124.patchDownload
diff -cprN head/contrib/auto_explain/auto_explain.c work/contrib/auto_explain/auto_explain.c
*** head/contrib/auto_explain/auto_explain.c	2009-08-10 14:46:49.000000000 +0900
--- work/contrib/auto_explain/auto_explain.c	2009-11-24 10:04:50.868963605 +0900
*************** PG_MODULE_MAGIC;
*** 22,27 ****
--- 22,28 ----
  static int	auto_explain_log_min_duration = -1; /* msec or -1 */
  static bool auto_explain_log_analyze = false;
  static bool auto_explain_log_verbose = false;
+ static bool auto_explain_log_buffers = false;
  static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
  static bool auto_explain_log_nested_statements = false;
  
*************** _PG_init(void)
*** 92,97 ****
--- 93,108 ----
  							 NULL,
  							 NULL);
  
+ 	DefineCustomBoolVariable("auto_explain.log_buffers",
+ 							 "Log buffers usage.",
+ 							 NULL,
+ 							 &auto_explain_log_buffers,
+ 							 false,
+ 							 PGC_SUSET,
+ 							 0,
+ 							 NULL,
+ 							 NULL);
+ 
  	DefineCustomEnumVariable("auto_explain.log_format",
  							 "EXPLAIN format to be used for plan logging.",
  							 NULL,
*************** explain_ExecutorEnd(QueryDesc *queryDesc
*** 218,225 ****
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument && auto_explain_log_analyze);
  			es.verbose = auto_explain_log_verbose;
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
--- 229,238 ----
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument &&
! 				(auto_explain_log_analyze || auto_explain_log_buffers));
  			es.verbose = auto_explain_log_verbose;
+ 			es.buffers = (es.analyze && auto_explain_log_buffers);
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
diff -cprN head/doc/src/sgml/auto-explain.sgml work/doc/src/sgml/auto-explain.sgml
*** head/doc/src/sgml/auto-explain.sgml	2009-08-10 14:46:50.000000000 +0900
--- work/doc/src/sgml/auto-explain.sgml	2009-11-24 10:04:50.868963605 +0900
*************** LOAD 'auto_explain';
*** 104,109 ****
--- 104,128 ----
  
     <varlistentry>
      <term>
+      <varname>auto_explain.log_buffers</varname> (<type>boolean</type>)
+     </term>
+     <indexterm>
+      <primary><varname>auto_explain.log_buffers</> configuration parameter</primary>
+     </indexterm>
+     <listitem>
+      <para>
+       <varname>auto_explain.log_buffers</varname> causes <command>EXPLAIN
+       (ANALYZE, BUFFERS)</> output, rather than just <command>EXPLAIN</> 
+       output, to be printed when an execution plan is logged. This parameter is 
+       off by default. Only superusers can change this setting. Also, this
+       parameter only has effect if <varname>auto_explain.log_analyze</>
+       parameter is set.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
+     <term>
       <varname>auto_explain.log_format</varname> (<type>enum</type>)
      </term>
      <indexterm>
diff -cprN head/doc/src/sgml/ref/explain.sgml work/doc/src/sgml/ref/explain.sgml
*** head/doc/src/sgml/ref/explain.sgml	2009-08-10 14:46:50.000000000 +0900
--- work/doc/src/sgml/ref/explain.sgml	2009-11-24 10:04:50.870421302 +0900
*************** PostgreSQL documentation
*** 31,37 ****
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
--- 31,37 ----
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | BUFFERS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
*************** ROLLBACK;
*** 140,145 ****
--- 140,157 ----
     </varlistentry>
  
     <varlistentry>
+     <term><literal>BUFFERS</literal></term>
+     <listitem>
+      <para>
+       Include information on the buffers. Specifically, include the number of
+       buffer hits, number of disk blocks read, and number of local buffer read.
+       This parameter should be used with <literal>ANALYZE</literal> parameter.
+       Also, this parameter defaults to <literal>FALSE</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
      <term><literal>FORMAT</literal></term>
      <listitem>
       <para>
diff -cprN head/src/backend/commands/explain.c work/src/backend/commands/explain.c
*** head/src/backend/commands/explain.c	2009-11-05 07:26:04.000000000 +0900
--- work/src/backend/commands/explain.c	2009-11-24 10:06:53.073015629 +0900
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 125,130 ****
--- 125,132 ----
  			es.verbose = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "costs") == 0)
  			es.costs = defGetBoolean(opt);
+ 		else if (strcmp(opt->defname, "buffers") == 0)
+ 			es.buffers = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "format") == 0)
  		{
  			char   *p = defGetString(opt);
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 147,152 ****
--- 149,159 ----
  					 errmsg("unrecognized EXPLAIN option \"%s\"",
  							opt->defname)));
  	}
+ 
+ 	if (es.buffers && !es.analyze)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 			 errmsg("EXPLAIN option BUFFERS requires ANALYZE")));
  
  	/*
  	 * Run parse analysis and rewrite.	Note this also acquires sufficient
*************** ExplainNode(Plan *plan, PlanState *plans
*** 1040,1045 ****
--- 1047,1076 ----
  			break;
  	}
  
+ 	/* Show buffer usage */
+ 	if (es->buffers)
+ 	{
+ 		const BufferUsage *usage = &planstate->instrument->bufusage;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str, "Blocks Hit: %ld  Read: %ld  Temp Read: %ld\n",
+ 				usage->blks_hit, usage->blks_read, usage->temp_blks_read);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyLong("Hit Blocks", usage->blks_hit, es);
+ 			ExplainPropertyLong("Read Blocks", usage->blks_read, es);
+ 			ExplainPropertyLong("Written Blocks", usage->blks_written, es);
+ 			ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+ 			ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+ 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
+ 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+ 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+ 		}
+ 	}
+ 
  	/* Get ready to display the child plans */
  	haschildren = plan->initPlan ||
  		outerPlan(plan) ||
diff -cprN head/src/backend/executor/instrument.c work/src/backend/executor/instrument.c
*** head/src/backend/executor/instrument.c	2009-01-02 02:23:41.000000000 +0900
--- work/src/backend/executor/instrument.c	2009-11-24 10:04:50.871436243 +0900
***************
*** 17,22 ****
--- 17,26 ----
  
  #include "executor/instrument.h"
  
+ BufferUsage			pgBufferUsage;
+ 
+ static void BufferUsageAccumDiff(BufferUsage *dst,
+ 		const BufferUsage *add, const BufferUsage *sub);
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
*************** InstrStartNode(Instrumentation *instr)
*** 37,42 ****
--- 41,49 ----
  		INSTR_TIME_SET_CURRENT(instr->starttime);
  	else
  		elog(DEBUG2, "InstrStartNode called twice in a row");
+ 
+ 	/* initialize buffer usage per plan node */
+ 	instr->bufusage_start = pgBufferUsage;
  }
  
  /* Exit from a plan node */
*************** InstrStopNode(Instrumentation *instr, do
*** 59,64 ****
--- 66,79 ----
  
  	INSTR_TIME_SET_ZERO(instr->starttime);
  
+ 	/*
+ 	 * Adds delta of buffer usage to node's count and resets counter to start
+ 	 * so that the counters are not double counted by parent nodes.
+ 	 */
+ 	BufferUsageAccumDiff(&instr->bufusage,
+ 		&pgBufferUsage, &instr->bufusage_start);
+ 	pgBufferUsage = instr->bufusage_start;
+ 
  	/* Is this the first tuple of this cycle? */
  	if (!instr->running)
  	{
*************** InstrEndLoop(Instrumentation *instr)
*** 95,97 ****
--- 110,128 ----
  	instr->firsttuple = 0;
  	instr->tuplecount = 0;
  }
+ 
+ static void
+ BufferUsageAccumDiff(BufferUsage *dst,
+ 					 const BufferUsage *add,
+ 					 const BufferUsage *sub)
+ {
+ 	/* dst += add - sub */
+ 	dst->blks_hit += add->blks_hit - sub->blks_hit;
+ 	dst->blks_read += add->blks_read - sub->blks_read;
+ 	dst->blks_written += add->blks_written - sub->blks_written;
+ 	dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
+ 	dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
+ 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
+ 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
+ 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+ }
diff -cprN head/src/backend/storage/buffer/buf_init.c work/src/backend/storage/buffer/buf_init.c
*** head/src/backend/storage/buffer/buf_init.c	2009-01-02 02:23:47.000000000 +0900
--- work/src/backend/storage/buffer/buf_init.c	2009-11-24 10:04:50.871436243 +0900
*************** BufferDesc *BufferDescriptors;
*** 22,37 ****
  char	   *BufferBlocks;
  int32	   *PrivateRefCount;
  
- /* statistics counters */
- long int	ReadBufferCount;
- long int	ReadLocalBufferCount;
- long int	BufferHitCount;
- long int	LocalBufferHitCount;
- long int	BufferFlushCount;
- long int	LocalBufferFlushCount;
- long int	BufFileReadCount;
- long int	BufFileWriteCount;
- 
  
  /*
   * Data Structures:
--- 22,27 ----
diff -cprN head/src/backend/storage/buffer/bufmgr.c work/src/backend/storage/buffer/bufmgr.c
*** head/src/backend/storage/buffer/bufmgr.c	2009-06-11 23:49:01.000000000 +0900
--- work/src/backend/storage/buffer/bufmgr.c	2009-11-24 10:04:50.872386293 +0900
***************
*** 34,39 ****
--- 34,40 ----
  #include <unistd.h>
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "pgstat.h"
*************** ReadBuffer_common(SMgrRelation smgr, boo
*** 300,321 ****
  
  	if (isLocalBuf)
  	{
- 		ReadLocalBufferCount++;
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			LocalBufferHitCount++;
  	}
  	else
  	{
- 		ReadBufferCount++;
- 
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			BufferHitCount++;
  	}
  
  	/* At this point we do NOT hold any locks. */
--- 301,323 ----
  
  	if (isLocalBuf)
  	{
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			pgBufferUsage.local_blks_hit++;
! 		else
! 			pgBufferUsage.local_blks_read++;
  	}
  	else
  	{
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			pgBufferUsage.blks_hit++;
! 		else
! 			pgBufferUsage.blks_read++;
  	}
  
  	/* At this point we do NOT hold any locks. */
*************** SyncOneBuffer(int buf_id, bool skip_rece
*** 1611,1664 ****
  
  
  /*
-  * Return a palloc'd string containing buffer usage statistics.
-  */
- char *
- ShowBufferUsage(void)
- {
- 	StringInfoData str;
- 	float		hitrate;
- 	float		localhitrate;
- 
- 	initStringInfo(&str);
- 
- 	if (ReadBufferCount == 0)
- 		hitrate = 0.0;
- 	else
- 		hitrate = (float) BufferHitCount *100.0 / ReadBufferCount;
- 
- 	if (ReadLocalBufferCount == 0)
- 		localhitrate = 0.0;
- 	else
- 		localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
- 
- 	appendStringInfo(&str,
- 	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 				ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
- 	appendStringInfo(&str,
- 	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 					 ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate);
- 	appendStringInfo(&str,
- 					 "!\tDirect blocks: %10ld read, %10ld written\n",
- 					 BufFileReadCount, BufFileWriteCount);
- 
- 	return str.data;
- }
- 
- void
- ResetBufferUsage(void)
- {
- 	BufferHitCount = 0;
- 	ReadBufferCount = 0;
- 	BufferFlushCount = 0;
- 	LocalBufferHitCount = 0;
- 	ReadLocalBufferCount = 0;
- 	LocalBufferFlushCount = 0;
- 	BufFileReadCount = 0;
- 	BufFileWriteCount = 0;
- }
- 
- /*
   *		AtEOXact_Buffers - clean up at end of transaction.
   *
   *		As of PostgreSQL 8.0, buffer pins should get released by the
--- 1613,1618 ----
*************** FlushBuffer(volatile BufferDesc *buf, SM
*** 1916,1922 ****
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	BufferFlushCount++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
--- 1870,1876 ----
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	pgBufferUsage.blks_written++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
diff -cprN head/src/backend/storage/buffer/localbuf.c work/src/backend/storage/buffer/localbuf.c
*** head/src/backend/storage/buffer/localbuf.c	2009-06-11 23:49:01.000000000 +0900
--- work/src/backend/storage/buffer/localbuf.c	2009-11-24 10:04:50.873458462 +0900
***************
*** 16,21 ****
--- 16,22 ----
  #include "postgres.h"
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/smgr.h"
*************** LocalBufferAlloc(SMgrRelation smgr, Fork
*** 209,215 ****
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		LocalBufferFlushCount++;
  	}
  
  	/*
--- 210,216 ----
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		pgBufferUsage.local_blks_written++;
  	}
  
  	/*
diff -cprN head/src/backend/storage/file/buffile.c work/src/backend/storage/file/buffile.c
*** head/src/backend/storage/file/buffile.c	2009-06-11 23:49:01.000000000 +0900
--- work/src/backend/storage/file/buffile.c	2009-11-24 10:04:50.873458462 +0900
***************
*** 34,39 ****
--- 34,40 ----
  
  #include "postgres.h"
  
+ #include "executor/instrument.h"
  #include "storage/fd.h"
  #include "storage/buffile.h"
  #include "storage/buf_internals.h"
*************** BufFileLoadBuffer(BufFile *file)
*** 240,246 ****
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	BufFileReadCount++;
  }
  
  /*
--- 241,247 ----
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	pgBufferUsage.temp_blks_read++;
  }
  
  /*
*************** BufFileDumpBuffer(BufFile *file)
*** 304,310 ****
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		BufFileWriteCount++;
  	}
  	file->dirty = false;
  
--- 305,311 ----
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		pgBufferUsage.temp_blks_written++;
  	}
  	file->dirty = false;
  
diff -cprN head/src/backend/tcop/postgres.c work/src/backend/tcop/postgres.c
*** head/src/backend/tcop/postgres.c	2009-11-05 07:26:06.000000000 +0900
--- work/src/backend/tcop/postgres.c	2009-11-24 10:04:50.874015639 +0900
*************** ResetUsage(void)
*** 3901,3907 ****
  {
  	getrusage(RUSAGE_SELF, &Save_r);
  	gettimeofday(&Save_t, NULL);
- 	ResetBufferUsage();
  }
  
  void
--- 3901,3906 ----
*************** ShowUsage(const char *title)
*** 3912,3918 ****
  				sys;
  	struct timeval elapse_t;
  	struct rusage r;
- 	char	   *bufusage;
  
  	getrusage(RUSAGE_SELF, &r);
  	gettimeofday(&elapse_t, NULL);
--- 3911,3916 ----
*************** ShowUsage(const char *title)
*** 3986,3995 ****
  					 r.ru_nvcsw, r.ru_nivcsw);
  #endif   /* HAVE_GETRUSAGE */
  
- 	bufusage = ShowBufferUsage();
- 	appendStringInfo(&str, "! buffer usage stats:\n%s", bufusage);
- 	pfree(bufusage);
- 
  	/* remove trailing newline */
  	if (str.data[str.len - 1] == '\n')
  		str.data[--str.len] = '\0';
--- 3984,3989 ----
diff -cprN head/src/include/commands/explain.h work/src/include/commands/explain.h
*** head/src/include/commands/explain.h	2009-08-10 14:46:50.000000000 +0900
--- work/src/include/commands/explain.h	2009-11-24 10:04:50.875067601 +0900
*************** typedef struct ExplainState
*** 29,34 ****
--- 29,35 ----
  	bool		verbose;		/* be verbose */
  	bool		analyze;		/* print actual times */
  	bool		costs;			/* print costs */
+ 	bool		buffers;		/* print buffer usage */
  	ExplainFormat format;		/* output format */
  	/* other states */
  	PlannedStmt *pstmt;			/* top of plan */
diff -cprN head/src/include/executor/instrument.h work/src/include/executor/instrument.h
*** head/src/include/executor/instrument.h	2009-01-02 02:23:59.000000000 +0900
--- work/src/include/executor/instrument.h	2009-11-24 10:04:50.875067601 +0900
***************
*** 16,21 ****
--- 16,33 ----
  #include "portability/instr_time.h"
  
  
+ typedef struct BufferUsage
+ {
+ 	long	blks_hit;			/* # of buffer hits at start */
+ 	long	blks_read;			/* # of disk blocks read at start */
+ 	long	blks_written;		/* # of disk blocks written at start */
+ 	long	local_blks_hit;		/* # of buffer hits at start */
+ 	long	local_blks_read;	/* # of disk blocks read at start */
+ 	long	local_blks_written;	/* # of disk blocks written at start */
+ 	long	temp_blks_read;		/* # of temp blocks read at start */
+ 	long	temp_blks_written;	/* # of temp blocks written at start */
+ } BufferUsage;
+ 
  typedef struct Instrumentation
  {
  	/* Info about current plan cycle: */
*************** typedef struct Instrumentation
*** 24,36 ****
--- 36,52 ----
  	instr_time	counter;		/* Accumulated runtime for this node */
  	double		firsttuple;		/* Time for first tuple of this cycle */
  	double		tuplecount;		/* Tuples emitted so far this cycle */
+ 	BufferUsage	bufusage_start;	/* Buffer usage at start */
  	/* Accumulated statistics across all completed cycles: */
  	double		startup;		/* Total startup time (in seconds) */
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
+ 	BufferUsage	bufusage;		/* Total buffer usage */
  } Instrumentation;
  
+ extern BufferUsage		pgBufferUsage;
+ 
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
  extern void InstrStopNode(Instrumentation *instr, double nTuples);
diff -cprN head/src/include/storage/buf_internals.h work/src/include/storage/buf_internals.h
*** head/src/include/storage/buf_internals.h	2009-06-11 23:49:12.000000000 +0900
--- work/src/include/storage/buf_internals.h	2009-11-24 10:04:50.875067601 +0900
*************** extern PGDLLIMPORT BufferDesc *BufferDes
*** 173,188 ****
  /* in localbuf.c */
  extern BufferDesc *LocalBufferDescriptors;
  
- /* event counters in buf_init.c */
- extern long int ReadBufferCount;
- extern long int ReadLocalBufferCount;
- extern long int BufferHitCount;
- extern long int LocalBufferHitCount;
- extern long int BufferFlushCount;
- extern long int LocalBufferFlushCount;
- extern long int BufFileReadCount;
- extern long int BufFileWriteCount;
- 
  
  /*
   * Internal routines: only called by bufmgr
--- 173,178 ----
diff -cprN head/src/include/storage/bufmgr.h work/src/include/storage/bufmgr.h
*** head/src/include/storage/bufmgr.h	2009-06-11 23:49:12.000000000 +0900
--- work/src/include/storage/bufmgr.h	2009-11-24 10:04:50.875067601 +0900
*************** extern Buffer ReleaseAndReadBuffer(Buffe
*** 173,180 ****
  extern void InitBufferPool(void);
  extern void InitBufferPoolAccess(void);
  extern void InitBufferPoolBackend(void);
- extern char *ShowBufferUsage(void);
- extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
  extern void CheckPointBuffers(int flags);
--- 173,178 ----
In reply to: Itagaki Takahiro (#26)
Re: EXPLAIN BUFFERS

Itagaki Takahiro escreveu:

The attached patch is rebased to current CVS.

I'm looking at your patch now... It is almost there but has some issues.

(i) documentation: you have more than three counters and they could be
mentioned in docs too.

+    Include information on the buffers. Specifically, include the number of
+    buffer hits, number of disk blocks read, and number of local buffer read.

(ii) format: why does text output format have less counters than the other ones?

+       if (es->format == EXPLAIN_FORMAT_TEXT)
+       {
+           appendStringInfoSpaces(es->str, es->indent * 2);
+           appendStringInfo(es->str, "Blocks Hit: %ld  Read: %ld  Temp Read:
%ld\n",
+               usage->blks_hit, usage->blks_read, usage->temp_blks_read);
+       }
+       else
+       {
+           ExplainPropertyLong("Hit Blocks", usage->blks_hit, es);
+           ExplainPropertyLong("Read Blocks", usage->blks_read, es);
+           ExplainPropertyLong("Written Blocks", usage->blks_written, es);
+           ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+           ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+           ExplainPropertyLong("Local Written Blocks",
usage->local_blks_written, es);
+           ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+           ExplainPropertyLong("Temp Written Blocks",
usage->temp_blks_written, es);
+       }

(iii) string: i don't like the string in text format because (1) it is not
concise (only the first item has the word 'Blocks'), (2) what block is it
about? Shared, Local, or Temp?, (3) why don't you include the other ones?, and
(4) why don't you include the written counters?

-> Seq Scan on pg_namespace nc (cost=0.00..1.07 rows=4 width=68) (actual
time=0.015..0.165 rows=4 loops=1)
Filter: (NOT pg_is_other_temp_schema(oid))
Blocks Hit: 11 Read: 0 Temp Read: 0

(iv) text format: i don't have a good suggestion but here are some ideas. The
former is too long and the latter is too verbose. :( Another option is to
suppress words hit, read, and written; and just document it.

Shared Blocks (11 hit, 5 read, 0 written); Local Blocks (5 hit, 0 read, 0
written); Temp Blocks (0 read, 0 written)

or

Shared Blocks: 11 hit, 5 read, 0 written
Local Blocks: 5 hit, 0 read, 0 written
Temp Blocks: 0 read, 0 written

(v) accumulative: i don't remember if we discussed it but is there a reason
the number of buffers isn't accumulative? We already have cost and time that
are both accumulative. I saw BufferUsageAccumDiff() function but didn't try to
figure out why it isn't accumulating or passing the counters to parent nodes.

euler=# explain (analyze true, buffers true) select * from pgbench_branches
inner join pgbench_history using (bid) where bid > 100;
QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1.02..18.62 rows=3 width=476) (actual time=0.136..0.136
rows=0 loops=1)
Hash Cond: (pgbench_history.bid = pgbench_branches.bid)
Blocks Hit: 2 Read: 0 Temp Read: 0
-> Seq Scan on pgbench_history (cost=0.00..15.50 rows=550 width=116)
(actual time=0.034..0.034 rows=1 loops=1)
Blocks Hit: 1 Read: 0 Temp Read: 0
-> Hash (cost=1.01..1.01 rows=1 width=364) (actual time=0.022..0.022
rows=0 loops=1)
Blocks Hit: 0 Read: 0 Temp Read: 0
-> Seq Scan on pgbench_branches (cost=0.00..1.01 rows=1 width=364)
(actual time=0.019..0.019 rows=0 loops=1)
Filter: (bid > 100)
Blocks Hit: 1 Read: 0 Temp Read: 0
Total runtime: 0.531 ms
(11 rows)

(vi) comment: the 'at start' is superfluous. Please, remove it.

+   long    blks_hit;           /* # of buffer hits at start */
+   long    blks_read;          /* # of disk blocks read at start */

(vii) all nodes: I'm thinking if we need this information in all nodes (even
in those nodes that don't read or write). It would be less verbose but it
could complicate some parser's life. Of course, if we suppress this
information, we need to include it on the top node even if we don't read or
write in it.

I didn't have time to adjust your patch per comments above but if you can
address all of those issues I certainly could check your patch again.

--
Euler Taveira de Oliveira
http://www.timbira.com/

#28Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Euler Taveira de Oliveira (#27)
Re: EXPLAIN BUFFERS

Euler Taveira de Oliveira <euler@timbira.com> wrote:

I'm looking at your patch now... It is almost there but has some issues.

(i) documentation: you have more than three counters and they could be
mentioned in docs too.

I'll add documentation for all variables.

(ii) format: why does text output format have less counters than the other ones?

That's because lines will be too long for text format. I think the
three values in it are the most important and useful ones.

(iii) string: i don't like the string in text format
(1) it is not concise (only the first item has the word 'Blocks'),
(2) what block is it about? Shared, Local, or Temp?

The format was suggested here and no objections then.
http://archives.postgresql.org/pgsql-hackers/2009-10/msg00268.php
I think the current output is enough and useful in normal use.
We can use XML or JSON format for more details.

I think
Blocks Hit: 1641 Read: 0 Temp Read: 1443
means
Blocks (Hit: 1641 Read: 0 Temp Read: 1443)
i.e., Blocks of hit, blocks of reads, and Blocks of temp reads.

(3) why don't you include the other ones?, and
(4) why don't you include the written counters?
(iv) text format: i don't have a good suggestion but here are some ideas. The
former is too long and the latter is too verbose.

Their reasons are the same as (ii).

(v) accumulative: i don't remember if we discussed it but is there a reason
the number of buffers isn't accumulative? We already have cost and time that
are both accumulative. I saw BufferUsageAccumDiff() function but didn't try to
figure out why it isn't accumulating or passing the counters to parent nodes.

It's reasonable. I'll change so if no objections.

(vi) comment: the 'at start' is superfluous. Please, remove it.

Ooops, I'll remove them.

(vii) all nodes: I'm thinking if we need this information in all nodes (even
in those nodes that don't read or write). It would be less verbose but it
could complicate some parser's life. Of course, if we suppress this
information, we need to include it on the top node even if we don't read or
write in it.

I cannot understand what you mean -- should I suppress the lines when they
have all-zero values?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#29Robert Haas
robertmhaas@gmail.com
In reply to: Itagaki Takahiro (#28)
Re: EXPLAIN BUFFERS

On Mon, Dec 7, 2009 at 1:28 AM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

(ii) format: why does text output format have less counters than the other ones?

That's because lines will be too long for text format. I think the
three values in it are the most important and useful ones.

I disagree. I objected to showing only part of the information when I
looked at this patch last CommitFest, and I object again now. I do
*NOT* want to have to use JSON or XML to get at the counters you've
arbitrarily decided are not interesting to me. I think with a little
creativity we can certainly get these into the text format, and I'm
willing to help with that. In fact, if you want, I'll pick up this
patch and make it my first commit, though since you're now a committer
as well perhaps you'd prefer to do it yourself.

(v) accumulative: i don't remember if we discussed it but is there a reason
the number of buffers isn't accumulative? We already have cost and time that
are both accumulative. I saw BufferUsageAccumDiff() function but didn't try to
figure out why it isn't accumulating or passing the counters to parent nodes.

It's reasonable. I'll change so if no objections.

+1 to change it.

...Robert

In reply to: Itagaki Takahiro (#28)
Re: EXPLAIN BUFFERS

Itagaki Takahiro escreveu:

I think the current output is enough and useful in normal use.
We can use XML or JSON format for more details.

I don't think it is a good idea to have different information in different
formats. I'm with Robert; *don't* do that. If you want to suppress the other
ones in text format, do it in the others too. One idea is to add them only if
we prefer the VERBOSE output.

I think
Blocks Hit: 1641 Read: 0 Temp Read: 1443
means
Blocks (Hit: 1641 Read: 0 Temp Read: 1443)
i.e., Blocks of hit, blocks of reads, and Blocks of temp reads.

But the latter is more clear than the former.

I cannot understand what you mean -- should I suppress the lines when they
have all-zero values?

There are nodes that don't read or write blocks. If we go this way, we need to
document this behavior.

--
Euler Taveira de Oliveira
http://www.timbira.com/

#31Itagaki Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Euler Taveira de Oliveira (#30)
1 attachment(s)
Re: EXPLAIN BUFFERS

Here is an updated patch per discussion.

* Counters are accumulative. They contain I/Os by child nodes.
* Text format shows all counters.
* Add "shared_" prefix to variables representing shared buffers/blocks.

Euler Taveira de Oliveira <euler@timbira.com> wrote:

Itagaki Takahiro escreveu:

I think the current output is enough and useful in normal use.
We can use XML or JSON format for more details.

I don't think it is a good idea to have different information in different
formats. I'm with Robert; *don't* do that.

I'm afraid of the human-unreadability of the text format, that is discussed
in the YAML format thread. ...but I found we say the following in the docs.

XML or JSON output contains the same information as the text output format
http://developer.postgresql.org/pgdocs/postgres/sql-explain.html

Obviously I should not hide any information only in the text format.
The new output will be: (in one line)
Shared Blocks: (hit=2 read=1641 written=0) Local Blocks: (hit=0 read=0 written=0) Temp Blocks: (read=1443 written=1443)

There are nodes that don't read or write blocks.

This will be impossible now because we re-defined the meaning of counters
as "accumulated number of I/O". Even if the node never read or write blocks,
it might contain some child nodes with I/O.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

explain_buffers_20091208.patchapplication/octet-stream; name=explain_buffers_20091208.patchDownload
diff -cprN head/contrib/auto_explain/auto_explain.c work/contrib/auto_explain/auto_explain.c
*** head/contrib/auto_explain/auto_explain.c	2009-08-10 14:46:49.000000000 +0900
--- work/contrib/auto_explain/auto_explain.c	2009-12-08 11:28:09.398644461 +0900
*************** PG_MODULE_MAGIC;
*** 22,27 ****
--- 22,28 ----
  static int	auto_explain_log_min_duration = -1; /* msec or -1 */
  static bool auto_explain_log_analyze = false;
  static bool auto_explain_log_verbose = false;
+ static bool auto_explain_log_buffers = false;
  static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
  static bool auto_explain_log_nested_statements = false;
  
*************** _PG_init(void)
*** 92,97 ****
--- 93,108 ----
  							 NULL,
  							 NULL);
  
+ 	DefineCustomBoolVariable("auto_explain.log_buffers",
+ 							 "Log buffers usage.",
+ 							 NULL,
+ 							 &auto_explain_log_buffers,
+ 							 false,
+ 							 PGC_SUSET,
+ 							 0,
+ 							 NULL,
+ 							 NULL);
+ 
  	DefineCustomEnumVariable("auto_explain.log_format",
  							 "EXPLAIN format to be used for plan logging.",
  							 NULL,
*************** explain_ExecutorEnd(QueryDesc *queryDesc
*** 218,225 ****
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument && auto_explain_log_analyze);
  			es.verbose = auto_explain_log_verbose;
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
--- 229,238 ----
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument &&
! 				(auto_explain_log_analyze || auto_explain_log_buffers));
  			es.verbose = auto_explain_log_verbose;
+ 			es.buffers = (es.analyze && auto_explain_log_buffers);
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
diff -cprN head/doc/src/sgml/auto-explain.sgml work/doc/src/sgml/auto-explain.sgml
*** head/doc/src/sgml/auto-explain.sgml	2009-08-10 14:46:50.000000000 +0900
--- work/doc/src/sgml/auto-explain.sgml	2009-12-08 11:28:09.399563675 +0900
*************** LOAD 'auto_explain';
*** 104,109 ****
--- 104,128 ----
  
     <varlistentry>
      <term>
+      <varname>auto_explain.log_buffers</varname> (<type>boolean</type>)
+     </term>
+     <indexterm>
+      <primary><varname>auto_explain.log_buffers</> configuration parameter</primary>
+     </indexterm>
+     <listitem>
+      <para>
+       <varname>auto_explain.log_buffers</varname> causes <command>EXPLAIN
+       (ANALYZE, BUFFERS)</> output, rather than just <command>EXPLAIN</> 
+       output, to be printed when an execution plan is logged. This parameter is 
+       off by default. Only superusers can change this setting. Also, this
+       parameter only has effect if <varname>auto_explain.log_analyze</>
+       parameter is set.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
+     <term>
       <varname>auto_explain.log_format</varname> (<type>enum</type>)
      </term>
      <indexterm>
diff -cprN head/doc/src/sgml/ref/explain.sgml work/doc/src/sgml/ref/explain.sgml
*** head/doc/src/sgml/ref/explain.sgml	2009-08-10 14:46:50.000000000 +0900
--- work/doc/src/sgml/ref/explain.sgml	2009-12-08 11:42:32.165582000 +0900
*************** PostgreSQL documentation
*** 31,37 ****
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
--- 31,37 ----
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | BUFFERS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
*************** ROLLBACK;
*** 140,145 ****
--- 140,161 ----
     </varlistentry>
  
     <varlistentry>
+     <term><literal>BUFFERS</literal></term>
+     <listitem>
+      <para>
+       Include information on the buffers. Specifically, include the number of
+       hits/reads/writes of shared blocks and local blocks, and number of reads
+       and writes of temp blocks. Shared blocks contain global tables and
+       indexes, local blocks contain temporary tables and indexes, and temp
+       blocks contain disk blocks used in sort and materialized plans.
+       The value of a parent node contains values of its children.
+       This parameter should be used with <literal>ANALYZE</literal> parameter.
+       This parameter defaults to <literal>FALSE</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
      <term><literal>FORMAT</literal></term>
      <listitem>
       <para>
diff -cprN head/src/backend/commands/explain.c work/src/backend/commands/explain.c
*** head/src/backend/commands/explain.c	2009-11-05 07:26:04.000000000 +0900
--- work/src/backend/commands/explain.c	2009-12-08 11:51:06.599538966 +0900
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 125,130 ****
--- 125,132 ----
  			es.verbose = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "costs") == 0)
  			es.costs = defGetBoolean(opt);
+ 		else if (strcmp(opt->defname, "buffers") == 0)
+ 			es.buffers = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "format") == 0)
  		{
  			char   *p = defGetString(opt);
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 148,153 ****
--- 150,160 ----
  							opt->defname)));
  	}
  
+ 	if (es.buffers && !es.analyze)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 			 errmsg("EXPLAIN option BUFFERS requires ANALYZE")));
+ 
  	/*
  	 * Run parse analysis and rewrite.	Note this also acquires sufficient
  	 * locks on the source table(s).
*************** ExplainNode(Plan *plan, PlanState *plans
*** 1040,1045 ****
--- 1047,1083 ----
  			break;
  	}
  
+ 	/* Show buffer usage */
+ 	if (es->buffers)
+ 	{
+ 		const BufferUsage *usage = &planstate->instrument->bufusage;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str,
+ 				"Shared Blocks: (hit=%ld read=%ld written=%ld) "
+ 				"Local Blocks: (hit=%ld read=%ld written=%ld) "
+ 				"Temp Blocks: (read=%ld written=%ld)\n",
+ 				usage->shared_blks_hit, usage->shared_blks_read,
+ 				usage->shared_blks_written,
+ 				usage->local_blks_hit, usage->local_blks_read,
+ 				usage->local_blks_written,
+ 				usage->temp_blks_read, usage->temp_blks_written);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyLong("Shared Hit Blocks", usage->shared_blks_hit, es);
+ 			ExplainPropertyLong("Shared Read Blocks", usage->shared_blks_read, es);
+ 			ExplainPropertyLong("Shared Written Blocks", usage->shared_blks_written, es);
+ 			ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+ 			ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+ 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
+ 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+ 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+ 		}
+ 	}
+ 
  	/* Get ready to display the child plans */
  	haschildren = plan->initPlan ||
  		outerPlan(plan) ||
diff -cprN head/src/backend/executor/instrument.c work/src/backend/executor/instrument.c
*** head/src/backend/executor/instrument.c	2009-01-02 02:23:41.000000000 +0900
--- work/src/backend/executor/instrument.c	2009-12-08 11:53:00.714541295 +0900
***************
*** 17,22 ****
--- 17,26 ----
  
  #include "executor/instrument.h"
  
+ BufferUsage			pgBufferUsage;
+ 
+ static void BufferUsageAccumDiff(BufferUsage *dst,
+ 		const BufferUsage *add, const BufferUsage *sub);
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
*************** InstrStartNode(Instrumentation *instr)
*** 37,42 ****
--- 41,49 ----
  		INSTR_TIME_SET_CURRENT(instr->starttime);
  	else
  		elog(DEBUG2, "InstrStartNode called twice in a row");
+ 
+ 	/* initialize buffer usage per plan node */
+ 	instr->bufusage_start = pgBufferUsage;
  }
  
  /* Exit from a plan node */
*************** InstrStopNode(Instrumentation *instr, do
*** 59,64 ****
--- 66,78 ----
  
  	INSTR_TIME_SET_ZERO(instr->starttime);
  
+ 	/*
+ 	 * Adds delta of buffer usage to node's count and resets counter to start
+ 	 * so that the counters are not double counted by parent nodes.
+ 	 */
+ 	BufferUsageAccumDiff(&instr->bufusage,
+ 		&pgBufferUsage, &instr->bufusage_start);
+ 
  	/* Is this the first tuple of this cycle? */
  	if (!instr->running)
  	{
*************** InstrEndLoop(Instrumentation *instr)
*** 95,97 ****
--- 109,127 ----
  	instr->firsttuple = 0;
  	instr->tuplecount = 0;
  }
+ 
+ static void
+ BufferUsageAccumDiff(BufferUsage *dst,
+ 					 const BufferUsage *add,
+ 					 const BufferUsage *sub)
+ {
+ 	/* dst += add - sub */
+ 	dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
+ 	dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
+ 	dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
+ 	dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
+ 	dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
+ 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
+ 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
+ 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+ }
diff -cprN head/src/backend/storage/buffer/buf_init.c work/src/backend/storage/buffer/buf_init.c
*** head/src/backend/storage/buffer/buf_init.c	2009-01-02 02:23:47.000000000 +0900
--- work/src/backend/storage/buffer/buf_init.c	2009-12-08 11:28:09.400644825 +0900
*************** BufferDesc *BufferDescriptors;
*** 22,37 ****
  char	   *BufferBlocks;
  int32	   *PrivateRefCount;
  
- /* statistics counters */
- long int	ReadBufferCount;
- long int	ReadLocalBufferCount;
- long int	BufferHitCount;
- long int	LocalBufferHitCount;
- long int	BufferFlushCount;
- long int	LocalBufferFlushCount;
- long int	BufFileReadCount;
- long int	BufFileWriteCount;
- 
  
  /*
   * Data Structures:
--- 22,27 ----
diff -cprN head/src/backend/storage/buffer/bufmgr.c work/src/backend/storage/buffer/bufmgr.c
*** head/src/backend/storage/buffer/bufmgr.c	2009-06-11 23:49:01.000000000 +0900
--- work/src/backend/storage/buffer/bufmgr.c	2009-12-08 11:53:00.757645179 +0900
***************
*** 34,39 ****
--- 34,40 ----
  #include <unistd.h>
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "pgstat.h"
*************** ReadBuffer_common(SMgrRelation smgr, boo
*** 300,321 ****
  
  	if (isLocalBuf)
  	{
- 		ReadLocalBufferCount++;
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			LocalBufferHitCount++;
  	}
  	else
  	{
- 		ReadBufferCount++;
- 
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			BufferHitCount++;
  	}
  
  	/* At this point we do NOT hold any locks. */
--- 301,323 ----
  
  	if (isLocalBuf)
  	{
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			pgBufferUsage.local_blks_hit++;
! 		else
! 			pgBufferUsage.local_blks_read++;
  	}
  	else
  	{
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			pgBufferUsage.shared_blks_hit++;
! 		else
! 			pgBufferUsage.shared_blks_read++;
  	}
  
  	/* At this point we do NOT hold any locks. */
*************** SyncOneBuffer(int buf_id, bool skip_rece
*** 1611,1664 ****
  
  
  /*
-  * Return a palloc'd string containing buffer usage statistics.
-  */
- char *
- ShowBufferUsage(void)
- {
- 	StringInfoData str;
- 	float		hitrate;
- 	float		localhitrate;
- 
- 	initStringInfo(&str);
- 
- 	if (ReadBufferCount == 0)
- 		hitrate = 0.0;
- 	else
- 		hitrate = (float) BufferHitCount *100.0 / ReadBufferCount;
- 
- 	if (ReadLocalBufferCount == 0)
- 		localhitrate = 0.0;
- 	else
- 		localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
- 
- 	appendStringInfo(&str,
- 	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 				ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
- 	appendStringInfo(&str,
- 	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 					 ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate);
- 	appendStringInfo(&str,
- 					 "!\tDirect blocks: %10ld read, %10ld written\n",
- 					 BufFileReadCount, BufFileWriteCount);
- 
- 	return str.data;
- }
- 
- void
- ResetBufferUsage(void)
- {
- 	BufferHitCount = 0;
- 	ReadBufferCount = 0;
- 	BufferFlushCount = 0;
- 	LocalBufferHitCount = 0;
- 	ReadLocalBufferCount = 0;
- 	LocalBufferFlushCount = 0;
- 	BufFileReadCount = 0;
- 	BufFileWriteCount = 0;
- }
- 
- /*
   *		AtEOXact_Buffers - clean up at end of transaction.
   *
   *		As of PostgreSQL 8.0, buffer pins should get released by the
--- 1613,1618 ----
*************** FlushBuffer(volatile BufferDesc *buf, SM
*** 1916,1922 ****
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	BufferFlushCount++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
--- 1870,1876 ----
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	pgBufferUsage.shared_blks_written++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
diff -cprN head/src/backend/storage/buffer/localbuf.c work/src/backend/storage/buffer/localbuf.c
*** head/src/backend/storage/buffer/localbuf.c	2009-06-11 23:49:01.000000000 +0900
--- work/src/backend/storage/buffer/localbuf.c	2009-12-08 11:28:09.402644626 +0900
***************
*** 16,21 ****
--- 16,22 ----
  #include "postgres.h"
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/smgr.h"
*************** LocalBufferAlloc(SMgrRelation smgr, Fork
*** 209,215 ****
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		LocalBufferFlushCount++;
  	}
  
  	/*
--- 210,216 ----
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		pgBufferUsage.local_blks_written++;
  	}
  
  	/*
diff -cprN head/src/backend/storage/file/buffile.c work/src/backend/storage/file/buffile.c
*** head/src/backend/storage/file/buffile.c	2009-06-11 23:49:01.000000000 +0900
--- work/src/backend/storage/file/buffile.c	2009-12-08 11:28:09.402644626 +0900
***************
*** 34,39 ****
--- 34,40 ----
  
  #include "postgres.h"
  
+ #include "executor/instrument.h"
  #include "storage/fd.h"
  #include "storage/buffile.h"
  #include "storage/buf_internals.h"
*************** BufFileLoadBuffer(BufFile *file)
*** 240,246 ****
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	BufFileReadCount++;
  }
  
  /*
--- 241,247 ----
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	pgBufferUsage.temp_blks_read++;
  }
  
  /*
*************** BufFileDumpBuffer(BufFile *file)
*** 304,310 ****
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		BufFileWriteCount++;
  	}
  	file->dirty = false;
  
--- 305,311 ----
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		pgBufferUsage.temp_blks_written++;
  	}
  	file->dirty = false;
  
diff -cprN head/src/backend/tcop/postgres.c work/src/backend/tcop/postgres.c
*** head/src/backend/tcop/postgres.c	2009-11-05 07:26:06.000000000 +0900
--- work/src/backend/tcop/postgres.c	2009-12-08 11:28:09.403644636 +0900
*************** ResetUsage(void)
*** 3901,3907 ****
  {
  	getrusage(RUSAGE_SELF, &Save_r);
  	gettimeofday(&Save_t, NULL);
- 	ResetBufferUsage();
  }
  
  void
--- 3901,3906 ----
*************** ShowUsage(const char *title)
*** 3912,3918 ****
  				sys;
  	struct timeval elapse_t;
  	struct rusage r;
- 	char	   *bufusage;
  
  	getrusage(RUSAGE_SELF, &r);
  	gettimeofday(&elapse_t, NULL);
--- 3911,3916 ----
*************** ShowUsage(const char *title)
*** 3986,3995 ****
  					 r.ru_nvcsw, r.ru_nivcsw);
  #endif   /* HAVE_GETRUSAGE */
  
- 	bufusage = ShowBufferUsage();
- 	appendStringInfo(&str, "! buffer usage stats:\n%s", bufusage);
- 	pfree(bufusage);
- 
  	/* remove trailing newline */
  	if (str.data[str.len - 1] == '\n')
  		str.data[--str.len] = '\0';
--- 3984,3989 ----
diff -cprN head/src/include/commands/explain.h work/src/include/commands/explain.h
*** head/src/include/commands/explain.h	2009-08-10 14:46:50.000000000 +0900
--- work/src/include/commands/explain.h	2009-12-08 11:28:09.404674074 +0900
*************** typedef struct ExplainState
*** 29,34 ****
--- 29,35 ----
  	bool		verbose;		/* be verbose */
  	bool		analyze;		/* print actual times */
  	bool		costs;			/* print costs */
+ 	bool		buffers;		/* print buffer usage */
  	ExplainFormat format;		/* output format */
  	/* other states */
  	PlannedStmt *pstmt;			/* top of plan */
diff -cprN head/src/include/executor/instrument.h work/src/include/executor/instrument.h
*** head/src/include/executor/instrument.h	2009-01-02 02:23:59.000000000 +0900
--- work/src/include/executor/instrument.h	2009-12-08 11:52:03.376544810 +0900
***************
*** 16,21 ****
--- 16,33 ----
  #include "portability/instr_time.h"
  
  
+ typedef struct BufferUsage
+ {
+ 	long	shared_blks_hit;		/* # of shared buffer hits */
+ 	long	shared_blks_read;		/* # of shared disk blocks read */
+ 	long	shared_blks_written;	/* # of shared disk blocks written */
+ 	long	local_blks_hit;			/* # of local buffer hits */
+ 	long	local_blks_read;		/* # of local disk blocks read */
+ 	long	local_blks_written;		/* # of local disk blocks written */
+ 	long	temp_blks_read;			/* # of temp blocks read */
+ 	long	temp_blks_written;		/* # of temp blocks written */
+ } BufferUsage;
+ 
  typedef struct Instrumentation
  {
  	/* Info about current plan cycle: */
*************** typedef struct Instrumentation
*** 24,36 ****
--- 36,52 ----
  	instr_time	counter;		/* Accumulated runtime for this node */
  	double		firsttuple;		/* Time for first tuple of this cycle */
  	double		tuplecount;		/* Tuples emitted so far this cycle */
+ 	BufferUsage	bufusage_start;	/* Buffer usage at start */
  	/* Accumulated statistics across all completed cycles: */
  	double		startup;		/* Total startup time (in seconds) */
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
+ 	BufferUsage	bufusage;		/* Total buffer usage */
  } Instrumentation;
  
+ extern BufferUsage		pgBufferUsage;
+ 
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
  extern void InstrStopNode(Instrumentation *instr, double nTuples);
diff -cprN head/src/include/storage/buf_internals.h work/src/include/storage/buf_internals.h
*** head/src/include/storage/buf_internals.h	2009-06-11 23:49:12.000000000 +0900
--- work/src/include/storage/buf_internals.h	2009-12-08 11:28:09.404674074 +0900
*************** extern PGDLLIMPORT BufferDesc *BufferDes
*** 173,188 ****
  /* in localbuf.c */
  extern BufferDesc *LocalBufferDescriptors;
  
- /* event counters in buf_init.c */
- extern long int ReadBufferCount;
- extern long int ReadLocalBufferCount;
- extern long int BufferHitCount;
- extern long int LocalBufferHitCount;
- extern long int BufferFlushCount;
- extern long int LocalBufferFlushCount;
- extern long int BufFileReadCount;
- extern long int BufFileWriteCount;
- 
  
  /*
   * Internal routines: only called by bufmgr
--- 173,178 ----
diff -cprN head/src/include/storage/bufmgr.h work/src/include/storage/bufmgr.h
*** head/src/include/storage/bufmgr.h	2009-06-11 23:49:12.000000000 +0900
--- work/src/include/storage/bufmgr.h	2009-12-08 11:28:09.405636486 +0900
*************** extern Buffer ReleaseAndReadBuffer(Buffe
*** 173,180 ****
  extern void InitBufferPool(void);
  extern void InitBufferPoolAccess(void);
  extern void InitBufferPoolBackend(void);
- extern char *ShowBufferUsage(void);
- extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
  extern void CheckPointBuffers(int flags);
--- 173,178 ----
#32Robert Haas
robertmhaas@gmail.com
In reply to: Itagaki Takahiro (#31)
Re: EXPLAIN BUFFERS

On Mon, Dec 7, 2009 at 9:58 PM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Here is an updated patch per discussion.

 * Counters are accumulative. They contain I/Os by child nodes.
 * Text format shows all counters.
 * Add "shared_" prefix to variables representing shared buffers/blocks.

Euler Taveira de Oliveira <euler@timbira.com> wrote:

Itagaki Takahiro escreveu:

I think the current output is enough and useful in normal use.
We can use XML or JSON format for more details.

I don't think it is a good idea to have different information in different
formats. I'm with Robert; *don't* do that.

I'm afraid of the human-unreadability of the text format, that is discussed
in the YAML format thread. ...but I found we say the following in the docs.

 XML or JSON output contains the same information as the text output format
 http://developer.postgresql.org/pgdocs/postgres/sql-explain.html

Obviously I should not hide any information only in the text format.
The new output will be: (in one line)
 Shared Blocks: (hit=2 read=1641 written=0) Local Blocks: (hit=0 read=0 written=0) Temp Blocks: (read=1443 written=1443)

Hmm, that's a little awkward. I think we could drop some of the punctuation.

Shared Blocks: hit 2 read 1641 wrote 0, Local Blocks: hit 0 read 0
wrote 0, Temp Blocks: read 1443 wrote 1443

...Robert

#33Greg Smith
greg@2ndquadrant.com
In reply to: Robert Haas (#32)
Re: EXPLAIN BUFFERS

Robert Haas wrote:

On Mon, Dec 7, 2009 at 9:58 PM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Obviously I should not hide any information only in the text format.
The new output will be: (in one line)
Shared Blocks: (hit=2 read=1641 written=0) Local Blocks: (hit=0 read=0 written=0) Temp Blocks: (read=1443 written=1443)

Hmm, that's a little awkward. I think we could drop some of the punctuation.

Shared Blocks: hit 2 read 1641 wrote 0, Local Blocks: hit 0 read 0
wrote 0, Temp Blocks: read 1443 wrote 1443

Having written more things to parse log files that should have been
saved in a better format after the fact than I'd like, I'd say that
replacing the "=" signs used as a delimiter with a space is a step
backwards. That doesn't save any space anyway, and drifts too far from
what one gets out of regular EXPLAIN I think.

I was perfectly happy with proposed text format as being a reasonable
trade-off between *possible* to parse if all you have is the text
format, while still being readable. If you want to compress
horizontally, get rid of "Blocks" after the first usage is the first
thing to do:

(1) Blocks Shared: (hit=2 read=1641 written=0) Local: (hit=0 read=0
written=0) Temp: (read=1443 written=1443)

That's already at about the same length as what you suggested at 105
characters, without losing any useful formatting.

If further compression is required, you could just remove all the
parentheses:

(2) Blocks Shared:hit=2 read=1641 written=0 Local:hit=0 read=0 written=0
Temp:read=1443 written=1443

I don't really like this though. Instead you could abbreviate the rest
of the repeated text and reduce the number of spaces:

(3) Blocks Shared:(hit=2 read=1641 written=0) Local:(h=0 r=0 w=0)
Temp:(r=1443 w=1443)

And now we're at the smallest result yet without any real loss in
readability--I'd argue it's faster to read in fact. This has a good
balance of fitting on a reasonably wide console (the above is down to 82
characters), still being readable to anyone, and being possible to
machine parse in a pinch if all you have are text logs around (split on
: then =). It might be too compressed down for some tastes though.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

#34Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#33)
Re: EXPLAIN BUFFERS

On Mon, Dec 7, 2009 at 11:09 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Robert Haas wrote:

On Mon, Dec 7, 2009 at 9:58 PM, Itagaki Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Obviously I should not hide any information only in the text format.
The new output will be: (in one line)
 Shared Blocks: (hit=2 read=1641 written=0) Local Blocks: (hit=0 read=0
written=0) Temp Blocks: (read=1443 written=1443)

Hmm, that's a little awkward. I think we could drop some of the
punctuation.

Shared Blocks: hit 2 read 1641 wrote 0, Local Blocks: hit 0 read 0
wrote 0, Temp Blocks: read 1443 wrote 1443

Having written more things to parse log files that should have been saved in
a better format after the fact than I'd like, I'd say that replacing the "="
signs used as a delimiter with a space is a step backwards.  That doesn't
save any space anyway, and drifts too far from what one gets out of regular
EXPLAIN I think.

I was perfectly happy with proposed text format as being a reasonable
trade-off between *possible* to parse if all you have is the text format,
while still being readable.  If you want to compress horizontally, get rid
of "Blocks" after the first usage is the first thing to do:

(1) Blocks Shared: (hit=2 read=1641 written=0) Local: (hit=0 read=0
written=0) Temp: (read=1443 written=1443)

That's already at about the same length as what you suggested at 105
characters, without losing any useful formatting.

If further compression is required, you could just remove all the
parentheses:

(2) Blocks Shared:hit=2 read=1641 written=0 Local:hit=0 read=0 written=0
Temp:read=1443 written=1443

I don't really like this though.  Instead you could abbreviate the rest of
the repeated text and reduce the number of spaces:

(3) Blocks Shared:(hit=2 read=1641 written=0) Local:(h=0 r=0 w=0)
Temp:(r=1443 w=1443)

And now we're at the smallest result yet without any real loss in
readability--I'd argue it's faster to read in fact.  This has a good balance
of fitting on a reasonably wide console (the above is down to 82
characters), still being readable to anyone, and being possible to machine
parse in a pinch if all you have are text logs around (split on : then =).
It might be too compressed down for some tastes though.

I could live with the equals signs, but the use of parentheses seems
weird and inconsistent with normal english usage (which permits
parentheses as a means of making parenthetical comments). (You can
also put an entire sentence in parentheses.) But you can't: (do
this).

...Robert

#35Takahiro Itagaki
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#34)
Re: EXPLAIN BUFFERS

Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 7, 2009 at 11:09 PM, Greg Smith <greg@2ndquadrant.com> wrote:

(1) Blocks Shared: (hit=2 read=1641 written=0) Local: (hit=0 read=0
written=0) Temp: (read=1443 written=1443)

I could live with the equals signs, but the use of parentheses seems
weird and inconsistent with normal english usage (which permits
parentheses as a means of making parenthetical comments). (You can
also put an entire sentence in parentheses.) But you can't: (do
this).

+1 for (1) personally, if we could think it is not in English but just
a symbol. I have another idea to make it alike with ANALYZE output.

(4) Blocks (shared hit=2 read=1641 ...) (local hit=0 ...) (temp read=0 ...)

Which is the best? I think it's a matter of preference.
(0) 109 characters - Shared Blocks: hit 2 read 1641 wrote 0, ...
(1) 105 characters - Blocks Shared: (hit=2 ...
(2) 96 characters - Blocks Shared:hit=2 ...
(3) 82 characters - Blocks Shared:(hit=2 ...
(4) 82 characters - Blocks (shared hit=2 ...

BTW, I found text is a *bad* format because it requires much discussion ;)
We have a little choice in XML, JSON and YAML formats.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

#36Greg Smith
greg@2ndquadrant.com
In reply to: Robert Haas (#34)
Re: EXPLAIN BUFFERS

Robert Haas wrote:

I could live with the equals signs, but the use of parentheses seems
weird and inconsistent with normal english usage (which permits
parentheses as a means of making parenthetical comments).

But it is consistent with people seeing:

Seq Scan on foo (cost=0.00..155.00 rows=10000 width=4)

Which seems to be what was being emulated here. I though that was
pretty reasonable given this is a related feature.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

In reply to: Itagaki Takahiro (#31)
Re: EXPLAIN BUFFERS

Itagaki Takahiro escreveu:

Here is an updated patch per discussion.

* Counters are accumulative. They contain I/Os by child nodes.
* Text format shows all counters.
* Add "shared_" prefix to variables representing shared buffers/blocks.

Nice. Despite of the other opinions, I'm satisfied with your text format
sentence. It is: (i) clear (I don't have to think or check the docs to know
what information is that.) and (ii) not so verbose (There are nodes that are
even longer than that.).

The only thing that needs some fix is:

+     Include information on the buffers. Specifically, include the number of
+     hits/reads/writes of shared blocks and local blocks, and number of reads
+     and writes of temp blocks. Shared blocks contain global tables and
+     indexes, local blocks contain temporary tables and indexes, and temp
+     blocks contain disk blocks used in sort and materialized plans.
+     The value of a parent node contains values of its children.
+     This parameter should be used with <literal>ANALYZE</literal> parameter.
+     This parameter defaults to <literal>FALSE</literal>.

That could be:

Include information on the buffers. Specifically, include the number of shared
blocks hits, reads, and writes, the number of local blocks hits, reads, and
writes, and the number of temp blocks reads and writes. Shared blocks, local
blocks, and temp blocks contain tables and indexes, temporary tables and
temporary indexes, and disk blocks used in sort and materialized plans,
respectively. The number of blocks of an upper-level node includes the blocks
of all its child nodes. It should be used with <literal>ANALYZE</literal>
parameter. This parameter defaults to <literal>FALSE</literal>.

If there is no more objections, I'll flag the patch 'ready for committer' (you :).

--
Euler Taveira de Oliveira
http://www.timbira.com/

#38Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#36)
Re: EXPLAIN BUFFERS

On Dec 8, 2009, at 12:05 AM, Greg Smith <greg@2ndquadrant.com> wrote:

Robert Haas wrote:

I could live with the equals signs, but the use of parentheses seems
weird and inconsistent with normal english usage (which permits
parentheses as a means of making parenthetical comments).

But it is consistent with people seeing:

Seq Scan on foo (cost=0.00..155.00 rows=10000 width=4)

Which seems to be what was being emulated here. I though that was
pretty reasonable given this is a related feature.

It's not the same at all. The essence of a parenthetical phrase is
that it can be omitted without turning what's left into nonsense - and
in fact we have COSTS OFF, which does just that. Omitting only the
parenthesized portions of the proposed output would not be sensible.

...Robert

#39Greg Smith
greg@2ndquadrant.com
In reply to: Euler Taveira de Oliveira (#37)
Re: EXPLAIN BUFFERS

Euler Taveira de Oliveira wrote:

If there is no more objections, I'll flag the patch 'ready for committer'

I just executed that. Note that there are two bits of subjective
tweaking possible to do with this one when it's committed: slimming
down the width of the display, and Euler's suggestion's for rewording.
I linked to both of those messages in the CF app, labeled as notes the
committer might want to consider, but that the patch hasn't been updated
to include yet.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

#40Takahiro Itagaki
itagaki.takahiro@oss.ntt.co.jp
In reply to: Greg Smith (#39)
1 attachment(s)
Re: EXPLAIN BUFFERS

Greg Smith <greg@2ndquadrant.com> wrote:

I just executed that. Note that there are two bits of subjective
tweaking possible to do with this one when it's committed: slimming
down the width of the display, and Euler's suggestion's for rewording.
I linked to both of those messages in the CF app, labeled as notes the
committer might want to consider, but that the patch hasn't been updated
to include yet.

Sure, I should have merge all of the comments. Patch attached.

- Updated documentation as Euler's suggestion, but I replaced
the "It" of the second last sentence to "This parameter".
- Updated the output format as follows. I think this format is the most
similar to existing lines. ("actual" by ANALYZE and "Filter:").

Note that the patch also removes buffer counters from log_statement_stats,
but we only have brief descriptions about the options. Our documentation
say nothing about buffer counters, so I didn't modify those lines in sgml.
http://developer.postgresql.org/pgdocs/postgres/runtime-config-statistics.html#RUNTIME-CONFIG-STATISTICS-MONITOR
IMHO, we could remove those options completely because we can use
EXPLAIN BUFFERS and DTrace probes instead of them.

=# EXPLAIN (BUFFERS, ANALYZE) SELECT *
FROM pgbench_accounts a, pgbench_branches b
WHERE a.bid = b.bid AND abalance > 0 ORDER BY abalance;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Sort (cost=2891.03..2891.04 rows=1 width=461) (actual time=22.494..22.494 rows=0 loops=1)
Sort Key: a.abalance
Sort Method: quicksort Memory: 25kB
Blocks: (shared hit=96 read=1544 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)
-> Nested Loop (cost=0.00..2891.02 rows=1 width=461) (actual time=22.488..22.488 rows=0 loops=1)
Join Filter: (a.bid = b.bid)
Blocks: (shared hit=96 read=1544 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)
-> Seq Scan on pgbench_accounts a (cost=0.00..2890.00 rows=1 width=97) (actual time=22.486..22.486 rows=0 loops=1)
Filter: (abalance > 0)
Blocks: (shared hit=96 read=1544 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)
-> Seq Scan on pgbench_branches b (cost=0.00..1.01 rows=1 width=364) (never executed)
Blocks: (shared hit=0 read=0 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)
Total runtime: 22.546 ms
(13 rows)

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

Attachments:

explain_buffers_20091209.patchapplication/octet-stream; name=explain_buffers_20091209.patchDownload
diff -cprN head/contrib/auto_explain/auto_explain.c work/contrib/auto_explain/auto_explain.c
*** head/contrib/auto_explain/auto_explain.c	2009-08-10 14:46:49.000000000 +0900
--- work/contrib/auto_explain/auto_explain.c	2009-12-09 13:54:40.292404947 +0900
*************** PG_MODULE_MAGIC;
*** 22,27 ****
--- 22,28 ----
  static int	auto_explain_log_min_duration = -1; /* msec or -1 */
  static bool auto_explain_log_analyze = false;
  static bool auto_explain_log_verbose = false;
+ static bool auto_explain_log_buffers = false;
  static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
  static bool auto_explain_log_nested_statements = false;
  
*************** _PG_init(void)
*** 92,97 ****
--- 93,108 ----
  							 NULL,
  							 NULL);
  
+ 	DefineCustomBoolVariable("auto_explain.log_buffers",
+ 							 "Log buffers usage.",
+ 							 NULL,
+ 							 &auto_explain_log_buffers,
+ 							 false,
+ 							 PGC_SUSET,
+ 							 0,
+ 							 NULL,
+ 							 NULL);
+ 
  	DefineCustomEnumVariable("auto_explain.log_format",
  							 "EXPLAIN format to be used for plan logging.",
  							 NULL,
*************** explain_ExecutorEnd(QueryDesc *queryDesc
*** 218,225 ****
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument && auto_explain_log_analyze);
  			es.verbose = auto_explain_log_verbose;
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
--- 229,238 ----
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument &&
! 				(auto_explain_log_analyze || auto_explain_log_buffers));
  			es.verbose = auto_explain_log_verbose;
+ 			es.buffers = (es.analyze && auto_explain_log_buffers);
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
diff -cprN head/doc/src/sgml/auto-explain.sgml work/doc/src/sgml/auto-explain.sgml
*** head/doc/src/sgml/auto-explain.sgml	2009-08-10 14:46:50.000000000 +0900
--- work/doc/src/sgml/auto-explain.sgml	2009-12-09 13:54:40.293421709 +0900
*************** LOAD 'auto_explain';
*** 104,109 ****
--- 104,128 ----
  
     <varlistentry>
      <term>
+      <varname>auto_explain.log_buffers</varname> (<type>boolean</type>)
+     </term>
+     <indexterm>
+      <primary><varname>auto_explain.log_buffers</> configuration parameter</primary>
+     </indexterm>
+     <listitem>
+      <para>
+       <varname>auto_explain.log_buffers</varname> causes <command>EXPLAIN
+       (ANALYZE, BUFFERS)</> output, rather than just <command>EXPLAIN</> 
+       output, to be printed when an execution plan is logged. This parameter is 
+       off by default. Only superusers can change this setting. Also, this
+       parameter only has effect if <varname>auto_explain.log_analyze</>
+       parameter is set.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
+     <term>
       <varname>auto_explain.log_format</varname> (<type>enum</type>)
      </term>
      <indexterm>
diff -cprN head/doc/src/sgml/ref/explain.sgml work/doc/src/sgml/ref/explain.sgml
*** head/doc/src/sgml/ref/explain.sgml	2009-08-10 14:46:50.000000000 +0900
--- work/doc/src/sgml/ref/explain.sgml	2009-12-09 13:59:44.985468000 +0900
*************** PostgreSQL documentation
*** 31,37 ****
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
--- 31,37 ----
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | BUFFERS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
*************** ROLLBACK;
*** 140,145 ****
--- 140,162 ----
     </varlistentry>
  
     <varlistentry>
+     <term><literal>BUFFERS</literal></term>
+     <listitem>
+      <para>
+       Include information on the buffers. Specifically, include the number of
+       shared blocks hits, reads, and writes, the number of local blocks hits,
+       reads, and writes, and the number of temp blocks reads and writes.
+       Shared blocks, local blocks, and temp blocks contain tables and indexes,
+       temporary tables and temporary indexes, and disk blocks used in sort and
+       materialized plans, respectively. The number of blocks of an upper-level
+       node includes the blocks of all its child nodes. This parameter should
+       be used with <literal>ANALYZE</literal> parameter. It defaults to
+       <literal>FALSE</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
      <term><literal>FORMAT</literal></term>
      <listitem>
       <para>
diff -cprN head/src/backend/commands/explain.c work/src/backend/commands/explain.c
*** head/src/backend/commands/explain.c	2009-11-05 07:26:04.000000000 +0900
--- work/src/backend/commands/explain.c	2009-12-09 14:19:50.048388685 +0900
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 125,130 ****
--- 125,132 ----
  			es.verbose = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "costs") == 0)
  			es.costs = defGetBoolean(opt);
+ 		else if (strcmp(opt->defname, "buffers") == 0)
+ 			es.buffers = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "format") == 0)
  		{
  			char   *p = defGetString(opt);
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 148,153 ****
--- 150,160 ----
  							opt->defname)));
  	}
  
+ 	if (es.buffers && !es.analyze)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 			 errmsg("EXPLAIN option BUFFERS requires ANALYZE")));
+ 
  	/*
  	 * Run parse analysis and rewrite.	Note this also acquires sufficient
  	 * locks on the source table(s).
*************** ExplainNode(Plan *plan, PlanState *plans
*** 1040,1045 ****
--- 1047,1083 ----
  			break;
  	}
  
+ 	/* Show buffer usage */
+ 	if (es->buffers)
+ 	{
+ 		const BufferUsage *usage = &planstate->instrument->bufusage;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			appendStringInfoSpaces(es->str, es->indent * 2);
+ 			appendStringInfo(es->str,
+ 				"Blocks: (shared hit=%ld read=%ld written=%ld) "
+ 				"(local hit=%ld read=%ld written=%ld) "
+ 				"(temp read=%ld written=%ld)\n",
+ 				usage->shared_blks_hit, usage->shared_blks_read,
+ 				usage->shared_blks_written,
+ 				usage->local_blks_hit, usage->local_blks_read,
+ 				usage->local_blks_written,
+ 				usage->temp_blks_read, usage->temp_blks_written);
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyLong("Shared Hit Blocks", usage->shared_blks_hit, es);
+ 			ExplainPropertyLong("Shared Read Blocks", usage->shared_blks_read, es);
+ 			ExplainPropertyLong("Shared Written Blocks", usage->shared_blks_written, es);
+ 			ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+ 			ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+ 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
+ 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+ 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+ 		}
+ 	}
+ 
  	/* Get ready to display the child plans */
  	haschildren = plan->initPlan ||
  		outerPlan(plan) ||
diff -cprN head/src/backend/executor/instrument.c work/src/backend/executor/instrument.c
*** head/src/backend/executor/instrument.c	2009-01-02 02:23:41.000000000 +0900
--- work/src/backend/executor/instrument.c	2009-12-09 13:54:40.294422165 +0900
***************
*** 17,22 ****
--- 17,26 ----
  
  #include "executor/instrument.h"
  
+ BufferUsage			pgBufferUsage;
+ 
+ static void BufferUsageAccumDiff(BufferUsage *dst,
+ 		const BufferUsage *add, const BufferUsage *sub);
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
*************** InstrStartNode(Instrumentation *instr)
*** 37,42 ****
--- 41,49 ----
  		INSTR_TIME_SET_CURRENT(instr->starttime);
  	else
  		elog(DEBUG2, "InstrStartNode called twice in a row");
+ 
+ 	/* initialize buffer usage per plan node */
+ 	instr->bufusage_start = pgBufferUsage;
  }
  
  /* Exit from a plan node */
*************** InstrStopNode(Instrumentation *instr, do
*** 59,64 ****
--- 66,78 ----
  
  	INSTR_TIME_SET_ZERO(instr->starttime);
  
+ 	/*
+ 	 * Adds delta of buffer usage to node's count and resets counter to start
+ 	 * so that the counters are not double counted by parent nodes.
+ 	 */
+ 	BufferUsageAccumDiff(&instr->bufusage,
+ 		&pgBufferUsage, &instr->bufusage_start);
+ 
  	/* Is this the first tuple of this cycle? */
  	if (!instr->running)
  	{
*************** InstrEndLoop(Instrumentation *instr)
*** 95,97 ****
--- 109,127 ----
  	instr->firsttuple = 0;
  	instr->tuplecount = 0;
  }
+ 
+ static void
+ BufferUsageAccumDiff(BufferUsage *dst,
+ 					 const BufferUsage *add,
+ 					 const BufferUsage *sub)
+ {
+ 	/* dst += add - sub */
+ 	dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
+ 	dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
+ 	dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
+ 	dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
+ 	dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
+ 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
+ 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
+ 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+ }
diff -cprN head/src/backend/storage/buffer/buf_init.c work/src/backend/storage/buffer/buf_init.c
*** head/src/backend/storage/buffer/buf_init.c	2009-01-02 02:23:47.000000000 +0900
--- work/src/backend/storage/buffer/buf_init.c	2009-12-09 13:54:40.295553562 +0900
*************** BufferDesc *BufferDescriptors;
*** 22,37 ****
  char	   *BufferBlocks;
  int32	   *PrivateRefCount;
  
- /* statistics counters */
- long int	ReadBufferCount;
- long int	ReadLocalBufferCount;
- long int	BufferHitCount;
- long int	LocalBufferHitCount;
- long int	BufferFlushCount;
- long int	LocalBufferFlushCount;
- long int	BufFileReadCount;
- long int	BufFileWriteCount;
- 
  
  /*
   * Data Structures:
--- 22,27 ----
diff -cprN head/src/backend/storage/buffer/bufmgr.c work/src/backend/storage/buffer/bufmgr.c
*** head/src/backend/storage/buffer/bufmgr.c	2009-06-11 23:49:01.000000000 +0900
--- work/src/backend/storage/buffer/bufmgr.c	2009-12-09 13:54:40.296426875 +0900
***************
*** 34,39 ****
--- 34,40 ----
  #include <unistd.h>
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "pgstat.h"
*************** ReadBuffer_common(SMgrRelation smgr, boo
*** 300,321 ****
  
  	if (isLocalBuf)
  	{
- 		ReadLocalBufferCount++;
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			LocalBufferHitCount++;
  	}
  	else
  	{
- 		ReadBufferCount++;
- 
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			BufferHitCount++;
  	}
  
  	/* At this point we do NOT hold any locks. */
--- 301,323 ----
  
  	if (isLocalBuf)
  	{
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			pgBufferUsage.local_blks_hit++;
! 		else
! 			pgBufferUsage.local_blks_read++;
  	}
  	else
  	{
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			pgBufferUsage.shared_blks_hit++;
! 		else
! 			pgBufferUsage.shared_blks_read++;
  	}
  
  	/* At this point we do NOT hold any locks. */
*************** SyncOneBuffer(int buf_id, bool skip_rece
*** 1611,1664 ****
  
  
  /*
-  * Return a palloc'd string containing buffer usage statistics.
-  */
- char *
- ShowBufferUsage(void)
- {
- 	StringInfoData str;
- 	float		hitrate;
- 	float		localhitrate;
- 
- 	initStringInfo(&str);
- 
- 	if (ReadBufferCount == 0)
- 		hitrate = 0.0;
- 	else
- 		hitrate = (float) BufferHitCount *100.0 / ReadBufferCount;
- 
- 	if (ReadLocalBufferCount == 0)
- 		localhitrate = 0.0;
- 	else
- 		localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
- 
- 	appendStringInfo(&str,
- 	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 				ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
- 	appendStringInfo(&str,
- 	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 					 ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate);
- 	appendStringInfo(&str,
- 					 "!\tDirect blocks: %10ld read, %10ld written\n",
- 					 BufFileReadCount, BufFileWriteCount);
- 
- 	return str.data;
- }
- 
- void
- ResetBufferUsage(void)
- {
- 	BufferHitCount = 0;
- 	ReadBufferCount = 0;
- 	BufferFlushCount = 0;
- 	LocalBufferHitCount = 0;
- 	ReadLocalBufferCount = 0;
- 	LocalBufferFlushCount = 0;
- 	BufFileReadCount = 0;
- 	BufFileWriteCount = 0;
- }
- 
- /*
   *		AtEOXact_Buffers - clean up at end of transaction.
   *
   *		As of PostgreSQL 8.0, buffer pins should get released by the
--- 1613,1618 ----
*************** FlushBuffer(volatile BufferDesc *buf, SM
*** 1916,1922 ****
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	BufferFlushCount++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
--- 1870,1876 ----
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	pgBufferUsage.shared_blks_written++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
diff -cprN head/src/backend/storage/buffer/localbuf.c work/src/backend/storage/buffer/localbuf.c
*** head/src/backend/storage/buffer/localbuf.c	2009-06-11 23:49:01.000000000 +0900
--- work/src/backend/storage/buffer/localbuf.c	2009-12-09 13:54:40.296426875 +0900
***************
*** 16,21 ****
--- 16,22 ----
  #include "postgres.h"
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/smgr.h"
*************** LocalBufferAlloc(SMgrRelation smgr, Fork
*** 209,215 ****
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		LocalBufferFlushCount++;
  	}
  
  	/*
--- 210,216 ----
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		pgBufferUsage.local_blks_written++;
  	}
  
  	/*
diff -cprN head/src/backend/storage/file/buffile.c work/src/backend/storage/file/buffile.c
*** head/src/backend/storage/file/buffile.c	2009-06-11 23:49:01.000000000 +0900
--- work/src/backend/storage/file/buffile.c	2009-12-09 13:54:40.296426875 +0900
***************
*** 34,39 ****
--- 34,40 ----
  
  #include "postgres.h"
  
+ #include "executor/instrument.h"
  #include "storage/fd.h"
  #include "storage/buffile.h"
  #include "storage/buf_internals.h"
*************** BufFileLoadBuffer(BufFile *file)
*** 240,246 ****
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	BufFileReadCount++;
  }
  
  /*
--- 241,247 ----
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	pgBufferUsage.temp_blks_read++;
  }
  
  /*
*************** BufFileDumpBuffer(BufFile *file)
*** 304,310 ****
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		BufFileWriteCount++;
  	}
  	file->dirty = false;
  
--- 305,311 ----
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		pgBufferUsage.temp_blks_written++;
  	}
  	file->dirty = false;
  
diff -cprN head/src/backend/tcop/postgres.c work/src/backend/tcop/postgres.c
*** head/src/backend/tcop/postgres.c	2009-11-05 07:26:06.000000000 +0900
--- work/src/backend/tcop/postgres.c	2009-12-09 13:54:40.297410363 +0900
*************** ResetUsage(void)
*** 3901,3907 ****
  {
  	getrusage(RUSAGE_SELF, &Save_r);
  	gettimeofday(&Save_t, NULL);
- 	ResetBufferUsage();
  }
  
  void
--- 3901,3906 ----
*************** ShowUsage(const char *title)
*** 3912,3918 ****
  				sys;
  	struct timeval elapse_t;
  	struct rusage r;
- 	char	   *bufusage;
  
  	getrusage(RUSAGE_SELF, &r);
  	gettimeofday(&elapse_t, NULL);
--- 3911,3916 ----
*************** ShowUsage(const char *title)
*** 3986,3995 ****
  					 r.ru_nvcsw, r.ru_nivcsw);
  #endif   /* HAVE_GETRUSAGE */
  
- 	bufusage = ShowBufferUsage();
- 	appendStringInfo(&str, "! buffer usage stats:\n%s", bufusage);
- 	pfree(bufusage);
- 
  	/* remove trailing newline */
  	if (str.data[str.len - 1] == '\n')
  		str.data[--str.len] = '\0';
--- 3984,3989 ----
diff -cprN head/src/include/commands/explain.h work/src/include/commands/explain.h
*** head/src/include/commands/explain.h	2009-08-10 14:46:50.000000000 +0900
--- work/src/include/commands/explain.h	2009-12-09 13:54:40.298406797 +0900
*************** typedef struct ExplainState
*** 29,34 ****
--- 29,35 ----
  	bool		verbose;		/* be verbose */
  	bool		analyze;		/* print actual times */
  	bool		costs;			/* print costs */
+ 	bool		buffers;		/* print buffer usage */
  	ExplainFormat format;		/* output format */
  	/* other states */
  	PlannedStmt *pstmt;			/* top of plan */
diff -cprN head/src/include/executor/instrument.h work/src/include/executor/instrument.h
*** head/src/include/executor/instrument.h	2009-01-02 02:23:59.000000000 +0900
--- work/src/include/executor/instrument.h	2009-12-09 13:54:40.298406797 +0900
***************
*** 16,21 ****
--- 16,33 ----
  #include "portability/instr_time.h"
  
  
+ typedef struct BufferUsage
+ {
+ 	long	shared_blks_hit;		/* # of shared buffer hits */
+ 	long	shared_blks_read;		/* # of shared disk blocks read */
+ 	long	shared_blks_written;	/* # of shared disk blocks written */
+ 	long	local_blks_hit;			/* # of local buffer hits */
+ 	long	local_blks_read;		/* # of local disk blocks read */
+ 	long	local_blks_written;		/* # of local disk blocks written */
+ 	long	temp_blks_read;			/* # of temp blocks read */
+ 	long	temp_blks_written;		/* # of temp blocks written */
+ } BufferUsage;
+ 
  typedef struct Instrumentation
  {
  	/* Info about current plan cycle: */
*************** typedef struct Instrumentation
*** 24,36 ****
--- 36,52 ----
  	instr_time	counter;		/* Accumulated runtime for this node */
  	double		firsttuple;		/* Time for first tuple of this cycle */
  	double		tuplecount;		/* Tuples emitted so far this cycle */
+ 	BufferUsage	bufusage_start;	/* Buffer usage at start */
  	/* Accumulated statistics across all completed cycles: */
  	double		startup;		/* Total startup time (in seconds) */
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
+ 	BufferUsage	bufusage;		/* Total buffer usage */
  } Instrumentation;
  
+ extern BufferUsage		pgBufferUsage;
+ 
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
  extern void InstrStopNode(Instrumentation *instr, double nTuples);
diff -cprN head/src/include/storage/buf_internals.h work/src/include/storage/buf_internals.h
*** head/src/include/storage/buf_internals.h	2009-06-11 23:49:12.000000000 +0900
--- work/src/include/storage/buf_internals.h	2009-12-09 13:54:40.298406797 +0900
*************** extern PGDLLIMPORT BufferDesc *BufferDes
*** 173,188 ****
  /* in localbuf.c */
  extern BufferDesc *LocalBufferDescriptors;
  
- /* event counters in buf_init.c */
- extern long int ReadBufferCount;
- extern long int ReadLocalBufferCount;
- extern long int BufferHitCount;
- extern long int LocalBufferHitCount;
- extern long int BufferFlushCount;
- extern long int LocalBufferFlushCount;
- extern long int BufFileReadCount;
- extern long int BufFileWriteCount;
- 
  
  /*
   * Internal routines: only called by bufmgr
--- 173,178 ----
diff -cprN head/src/include/storage/bufmgr.h work/src/include/storage/bufmgr.h
*** head/src/include/storage/bufmgr.h	2009-06-11 23:49:12.000000000 +0900
--- work/src/include/storage/bufmgr.h	2009-12-09 13:54:40.299427251 +0900
*************** extern Buffer ReleaseAndReadBuffer(Buffe
*** 173,180 ****
  extern void InitBufferPool(void);
  extern void InitBufferPoolAccess(void);
  extern void InitBufferPoolBackend(void);
- extern char *ShowBufferUsage(void);
- extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
  extern void CheckPointBuffers(int flags);
--- 173,178 ----
In reply to: Takahiro Itagaki (#40)
Re: EXPLAIN BUFFERS

Takahiro Itagaki escreveu:

Sure, I should have merge all of the comments. Patch attached.

Thanks for your effort. Looks sane to me.

- Updated the output format as follows. I think this format is the most
similar to existing lines. ("actual" by ANALYZE and "Filter:").

If people object to it, we can always change it later.

IMHO, we could remove those options completely because we can use
EXPLAIN BUFFERS and DTrace probes instead of them.

+1. But we need to propose some replacement options.

--
Euler Taveira de Oliveira
http://www.timbira.com/

#42Robert Haas
robertmhaas@gmail.com
In reply to: Takahiro Itagaki (#40)
Re: EXPLAIN BUFFERS

On Wed, Dec 9, 2009 at 12:36 AM, Takahiro Itagaki
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Note that the patch also removes buffer counters from log_statement_stats,
but we only have brief descriptions about the options. Our documentation
say nothing about buffer counters, so I didn't modify those lines in sgml.
http://developer.postgresql.org/pgdocs/postgres/runtime-config-statistics.html#RUNTIME-CONFIG-STATISTICS-MONITOR

I'm not sure whether this is a good idea or not. Let me read the
patch. I'm not sure an EXPLAIN option is really an adequate
substitute for log_statement_stats - the latter will let you get stats
for all of your queries automatically, I believe, and might still be
useful as a quick and dirty tool.

IMHO, we could remove those options completely because we can use
EXPLAIN BUFFERS and DTrace probes instead of them.

We certainly should NOT count on dtrace as a substitute for anything.
It's not available on Windows, or all other platforms either.

=# EXPLAIN (BUFFERS, ANALYZE) SELECT *
     FROM pgbench_accounts a, pgbench_branches b
    WHERE a.bid = b.bid AND abalance > 0 ORDER BY abalance;
                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=2891.03..2891.04 rows=1 width=461) (actual time=22.494..22.494 rows=0 loops=1)
  Sort Key: a.abalance
  Sort Method:  quicksort  Memory: 25kB
  Blocks: (shared hit=96 read=1544 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)
  ->  Nested Loop  (cost=0.00..2891.02 rows=1 width=461) (actual time=22.488..22.488 rows=0 loops=1)
        Join Filter: (a.bid = b.bid)
        Blocks: (shared hit=96 read=1544 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)
        ->  Seq Scan on pgbench_accounts a  (cost=0.00..2890.00 rows=1 width=97) (actual time=22.486..22.486 rows=0 loops=1)
              Filter: (abalance > 0)
              Blocks: (shared hit=96 read=1544 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)
        ->  Seq Scan on pgbench_branches b  (cost=0.00..1.01 rows=1 width=364) (never executed)
              Blocks: (shared hit=0 read=0 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)
 Total runtime: 22.546 ms
(13 rows)

I still think this is a bad format. Instead of putting "(" and ")"
around each phrase, can't we just separate them with a "," or ";"?
The filter uses parentheses in a mathematical way, for grouping
related items. Not all filters are surrounded by parentheses
(consider a filter like "WHERE x", x being a boolean column) and some
will have multiple sets, if there are ANDs and ORs in there.

...Robert

In reply to: Robert Haas (#42)
Re: EXPLAIN BUFFERS

Robert Haas escreveu:

I'm not sure whether this is a good idea or not. Let me read the
patch. I'm not sure an EXPLAIN option is really an adequate
substitute for log_statement_stats - the latter will let you get stats
for all of your queries automatically, I believe, and might still be
useful as a quick and dirty tool.

Why? If you want this information for all of your queries, you can always set
auto_explain.log_min_duration to 0. But if you're suggesting that we should
maintain log_statement_stats (that was not I understand from Tom's email [1]http://archives.postgresql.org/pgsql-hackers/2009-10/msg00718.php),
it's not that difficult to a change ShowBufferUsage().

We certainly should NOT count on dtrace as a substitute for anything.
It's not available on Windows, or all other platforms either.

But we can always count on EXPLAIN BUFFERS. Remember that some monitoring
tasks are _only_ available via DTrace.

I still think this is a bad format. Instead of putting "(" and ")"
around each phrase, can't we just separate them with a "," or ";"?

We already use ( and ) to group things. I don't remember us using , or ; in
any output node. The suggested output is intuitive and similar to other nodes
patterns.

[1]: http://archives.postgresql.org/pgsql-hackers/2009-10/msg00718.php

--
Euler Taveira de Oliveira
http://www.timbira.com/

#44Robert Haas
robertmhaas@gmail.com
In reply to: Euler Taveira de Oliveira (#43)
Re: EXPLAIN BUFFERS

On Thu, Dec 10, 2009 at 9:03 AM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:

Robert Haas escreveu:

I'm not sure whether this is a good idea or not.  Let me read the
patch.  I'm not sure an EXPLAIN option is really an adequate
substitute for log_statement_stats - the latter will let you get stats
for all of your queries automatically, I believe, and might still be
useful as a quick and dirty tool.

Why? If you want this information for all of your queries, you can always set
auto_explain.log_min_duration to 0. But if you're suggesting that we should
maintain log_statement_stats (that was not I understand from Tom's email [1]),
it's not that difficult to a change ShowBufferUsage().

Mmm, OK, if Tom thinks we should rip it out, I'm not going to second-guess him.

I still think this is a bad format.  Instead of putting "(" and ")"
around each phrase, can't we just separate them with a "," or ";"?

We already use ( and ) to group things. I don't remember us using , or ; in
any output node. The suggested output is intuitive and similar to other nodes
patterns.

It isn't. In the other cases where we output multiple distinct values
on the same output row - like the sort instrumentation when ANALYZE is
turned on - they are separated with copious amounts of whitespace.
Costs are an exception, but those aren't done the same way as this
either.

The only reason anyone is even thinking that they need parentheses
here is because they're trying to put three separate groups of
buffer-related statistics - a total of 8 values - on the same output
line. If this were split up over three output lines, no one would
even be suggesting parentheses. Maybe that's a saner way to go. If
not, fine, but I don't believe for a minute that the suggested format
is either correct or parallel to what has been done elsewhere.

...Robert

#45Alvaro Herrera
alvherre@commandprompt.com
In reply to: Takahiro Itagaki (#40)
Re: EXPLAIN BUFFERS

Takahiro Itagaki escribi�:

=# EXPLAIN (BUFFERS, ANALYZE) SELECT *
FROM pgbench_accounts a, pgbench_branches b
WHERE a.bid = b.bid AND abalance > 0 ORDER BY abalance;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Sort (cost=2891.03..2891.04 rows=1 width=461) (actual time=22.494..22.494 rows=0 loops=1)
Sort Key: a.abalance
Sort Method: quicksort Memory: 25kB
Blocks: (shared hit=96 read=1544 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)

Maybe I missed part of this discussion, but it seems a bit weird to have
an option named "buffers" turn on a line that specifies numbers of
"blocks". I kept looking for where you were specifying the BLOCKS
option to EXPLAIN in the command ...

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#45)
Re: EXPLAIN BUFFERS

Alvaro Herrera <alvherre@commandprompt.com> writes:

Takahiro Itagaki escribi�:

Blocks: (shared hit=96 read=1544 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)

Maybe I missed part of this discussion, but it seems a bit weird to have
an option named "buffers" turn on a line that specifies numbers of
"blocks".

Agreed, and I have to agree also with the people who have been
criticizing the output format. If we were trying to put the block
counts onto the same line as everything else then maybe parentheses
would be helpful, but here they're just clutter.

Perhaps

I/O: shared hit=96 read=1544 written=0 local hit=0 read=0 written=0 temp read=0 written=0

(although this would suggest making the option name "io" which is
probably a poor choice)

I also suggest that dropping out zeroes might help --- a large fraction
of EXPLAIN work is done with SELECTs that aren't ever going to write
anything. Then the above becomes

I/O: shared hit=96 read=1544

which is vastly more readable. You wouldn't want that to happen in
machine-readable formats of course, but I think we no longer care about
whether the text format is easy for programs to parse.

regards, tom lane

#47Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#46)
Re: EXPLAIN BUFFERS

On Thu, Dec 10, 2009 at 10:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Takahiro Itagaki escribió:

Blocks: (shared hit=96 read=1544 written=0) (local hit=0 read=0 written=0) (temp read=0 written=0)

Maybe I missed part of this discussion, but it seems a bit weird to have
an option named "buffers" turn on a line that specifies numbers of
"blocks".

Agreed, and I have to agree also with the people who have been
criticizing the output format.  If we were trying to put the block
counts onto the same line as everything else then maybe parentheses
would be helpful, but here they're just clutter.

Perhaps

       I/O: shared hit=96 read=1544 written=0 local hit=0 read=0 written=0 temp read=0 written=0

(although this would suggest making the option name "io" which is
probably a poor choice)

I also suggest that dropping out zeroes might help --- a large fraction
of EXPLAIN work is done with SELECTs that aren't ever going to write
anything.  Then the above becomes

       I/O: shared hit=96 read=1544

which is vastly more readable.  You wouldn't want that to happen in
machine-readable formats of course, but I think we no longer care about
whether the text format is easy for programs to parse.

Oooh, that's a nice idea, though I think you should throw in some
commas if there is, say, both shared and local stuff:

shared hit=96 read=1544, local read=19

I don't think IO is a terrible name for an option but I like BUFFERS
better. I don't think the BUFFERS/BLOCKS confusion is too bad, but
perhaps we could use BUFFERS in both places.

...Robert

#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#44)
Re: EXPLAIN BUFFERS

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Dec 10, 2009 at 9:03 AM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:

Why? If you want this information for all of your queries, you can always set
auto_explain.log_min_duration to 0. But if you're suggesting that we should
maintain log_statement_stats (that was not I understand from Tom's email [1]),
it's not that difficult to a change ShowBufferUsage().

Mmm, OK, if Tom thinks we should rip it out, I'm not going to second-guess him.

Feel free to question that. But it's ancient code and I'm not convinced
it still has a reason to live. If you want to investigate the I/O
behavior of a particular query, you'll use EXPLAIN. If you want to get
an idea of the system-wide behavior, you'll use the stats collector.
What use case is left for the backend-local counters?

regards, tom lane

#49Greg Smith
greg@2ndquadrant.com
In reply to: Robert Haas (#47)
Re: EXPLAIN BUFFERS

Robert Haas wrote:

I don't think IO is a terrible name for an option but I like BUFFERS
better. I don't think the BUFFERS/BLOCKS confusion is too bad, but
perhaps we could use BUFFERS in both places.

I don't know how "blocks" got into here in the first place--this concept
is "buffers" just about everywhere else already, right?

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

#50Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#49)
Re: EXPLAIN BUFFERS

On Thu, Dec 10, 2009 at 10:52 AM, Greg Smith <greg@2ndquadrant.com> wrote:

Robert Haas wrote:

I don't think IO is a terrible name for an option but I like BUFFERS
better.  I don't think the BUFFERS/BLOCKS confusion is too bad, but
perhaps we could use BUFFERS in both places.

I don't know how "blocks" got into here in the first place--this concept is
"buffers" just about everywhere else already, right?

I think we have some places already in the system where we bounce back
and forth between those terms. I expect that's the reason.

...Robert

In reply to: Robert Haas (#44)
Re: EXPLAIN BUFFERS

Robert Haas escreveu:

The only reason anyone is even thinking that they need parentheses
here is because they're trying to put three separate groups of
buffer-related statistics - a total of 8 values - on the same output
line. If this were split up over three output lines, no one would
even be suggesting parentheses.

That's the point. I'm afraid 3 new lines per node is too verbose.

--
Euler Taveira de Oliveira
http://www.timbira.com/

#52Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#48)
Re: EXPLAIN BUFFERS

On Thu, Dec 10, 2009 at 10:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Dec 10, 2009 at 9:03 AM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:

Why? If you want this information for all of your queries, you can always set
auto_explain.log_min_duration to 0. But if you're suggesting that we should
maintain log_statement_stats (that was not I understand from Tom's email [1]),
it's not that difficult to a change ShowBufferUsage().

Mmm, OK, if Tom thinks we should rip it out, I'm not going to second-guess him.

Feel free to question that.  But it's ancient code and I'm not convinced
it still has a reason to live.  If you want to investigate the I/O
behavior of a particular query, you'll use EXPLAIN.  If you want to get
an idea of the system-wide behavior, you'll use the stats collector.
What use case is left for the backend-local counters?

Beats me. Tracing just your session without having to EXPLAIN each
query (and therefore not get the output rows)? OK, I'm reaching. I
tend to be very conservative about ripping things out that someone
might want unless they're actually getting in the way of doing some
new thing that we want to do - but so are you, and you know the
history of this code better than I do. I'm happy to save my
questioning for a more important issue.

...Robert

#53Takahiro Itagaki
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#50)
1 attachment(s)
Re: EXPLAIN BUFFERS

Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 10, 2009 at 10:52 AM, Greg Smith <greg@2ndquadrant.com> wrote:

I don't think IO is a terrible name for an option but I like BUFFERS
better. ?I don't think the BUFFERS/BLOCKS confusion is too bad, but
perhaps we could use BUFFERS in both places.

I don't know how "blocks" got into here in the first place--this concept is
"buffers" just about everywhere else already, right?

I think we have some places already in the system where we bounce back
and forth between those terms. I expect that's the reason.

The "blocks" comes from pg_statio_all_tables.heap_blks_{read|hit},
but "buffers" might be easy to understand. One matter for concern
is that "buffer read" might not be clear whether it is a memory access
or a disk read.

Anyway, a revised patch according to the comments is attached.
The new text format is:
Buffers: shared hit=675 read=968, temp read=1443 written=1443
* Zero values are omitted. (Non-text formats could have zero values.)
* Rename "Blocks:" to "Buffers:".
* Remove parentheses and add a comma between shared, local and temp.

=# EXPLAIN (BUFFERS, ANALYZE) SELECT *
FROM pgbench_accounts a, pgbench_branches b
WHERE a.bid = b.bid AND abalance >= 0 ORDER BY abalance;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=54151.83..54401.83 rows=100000 width=461) (actual time=92.551..109.646 rows=100000 loops=1)
Sort Key: a.abalance
Sort Method: external sort Disk: 11544kB
Buffers: shared hit=675 read=968, temp read=1443 written=1443
-> Nested Loop (cost=0.00..4141.01 rows=100000 width=461) (actual time=0.048..42.190 rows=100000 loops=1)
Join Filter: (a.bid = b.bid)
Buffers: shared hit=673 read=968
-> Seq Scan on pgbench_branches b (cost=0.00..1.01 rows=1 width=364) (actual time=0.003..0.004 rows=1 loops=1)
Buffers: shared hit=1
-> Seq Scan on pgbench_accounts a (cost=0.00..2890.00 rows=100000 width=97) (actual time=0.038..22.912 rows=100000 loops=1)
Filter: (a.abalance >= 0)
Buffers: shared hit=672 read=968
Total runtime: 116.058 ms
(13 rows)

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

Attachments:

explain_buffers_20091211.patchapplication/octet-stream; name=explain_buffers_20091211.patchDownload
diff -cprN head/contrib/auto_explain/auto_explain.c work/contrib/auto_explain/auto_explain.c
*** head/contrib/auto_explain/auto_explain.c	Fri Dec 11 10:47:06 2009
--- work/contrib/auto_explain/auto_explain.c	Fri Dec 11 11:08:04 2009
*************** PG_MODULE_MAGIC;
*** 22,27 ****
--- 22,28 ----
  static int	auto_explain_log_min_duration = -1; /* msec or -1 */
  static bool auto_explain_log_analyze = false;
  static bool auto_explain_log_verbose = false;
+ static bool auto_explain_log_buffers = false;
  static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
  static bool auto_explain_log_nested_statements = false;
  
*************** _PG_init(void)
*** 93,98 ****
--- 94,109 ----
  							 NULL,
  							 NULL);
  
+ 	DefineCustomBoolVariable("auto_explain.log_buffers",
+ 							 "Log buffers usage.",
+ 							 NULL,
+ 							 &auto_explain_log_buffers,
+ 							 false,
+ 							 PGC_SUSET,
+ 							 0,
+ 							 NULL,
+ 							 NULL);
+ 
  	DefineCustomEnumVariable("auto_explain.log_format",
  							 "EXPLAIN format to be used for plan logging.",
  							 NULL,
*************** explain_ExecutorEnd(QueryDesc *queryDesc
*** 219,226 ****
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument && auto_explain_log_analyze);
  			es.verbose = auto_explain_log_verbose;
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
--- 230,239 ----
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument &&
! 				(auto_explain_log_analyze || auto_explain_log_buffers));
  			es.verbose = auto_explain_log_verbose;
+ 			es.buffers = (es.analyze && auto_explain_log_buffers);
  			es.format = auto_explain_log_format;
  
  			ExplainPrintPlan(&es, queryDesc);
diff -cprN head/doc/src/sgml/auto-explain.sgml work/doc/src/sgml/auto-explain.sgml
*** head/doc/src/sgml/auto-explain.sgml	Fri Dec 11 10:47:06 2009
--- work/doc/src/sgml/auto-explain.sgml	Fri Dec 11 11:08:04 2009
*************** LOAD 'auto_explain';
*** 104,109 ****
--- 104,128 ----
  
     <varlistentry>
      <term>
+      <varname>auto_explain.log_buffers</varname> (<type>boolean</type>)
+     </term>
+     <indexterm>
+      <primary><varname>auto_explain.log_buffers</> configuration parameter</primary>
+     </indexterm>
+     <listitem>
+      <para>
+       <varname>auto_explain.log_buffers</varname> causes <command>EXPLAIN
+       (ANALYZE, BUFFERS)</> output, rather than just <command>EXPLAIN</> 
+       output, to be printed when an execution plan is logged. This parameter is 
+       off by default. Only superusers can change this setting. Also, this
+       parameter only has effect if <varname>auto_explain.log_analyze</>
+       parameter is set.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
+     <term>
       <varname>auto_explain.log_format</varname> (<type>enum</type>)
      </term>
      <indexterm>
diff -cprN head/doc/src/sgml/ref/explain.sgml work/doc/src/sgml/ref/explain.sgml
*** head/doc/src/sgml/ref/explain.sgml	Fri Dec 11 10:47:06 2009
--- work/doc/src/sgml/ref/explain.sgml	Fri Dec 11 11:34:58 2009
*************** PostgreSQL documentation
*** 31,37 ****
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON | YAML } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
--- 31,37 ----
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | BUFFERS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON | YAML } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
*************** ROLLBACK;
*** 140,145 ****
--- 140,162 ----
     </varlistentry>
  
     <varlistentry>
+     <term><literal>BUFFERS</literal></term>
+     <listitem>
+      <para>
+       Include information on the buffers. Specifically, include the number of
+       shared blocks hits, reads, and writes, the number of local blocks hits,
+       reads, and writes, and the number of temp blocks reads and writes.
+       Shared blocks, local blocks, and temp blocks contain tables and indexes,
+       temporary tables and temporary indexes, and disk blocks used in sort and
+       materialized plans, respectively. The number of blocks of an upper-level
+       node includes the blocks of all its child nodes. This parameter should
+       be used with <literal>ANALYZE</literal> parameter. It defaults to
+       <literal>FALSE</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
      <term><literal>FORMAT</literal></term>
      <listitem>
       <para>
diff -cprN head/src/backend/commands/explain.c work/src/backend/commands/explain.c
*** head/src/backend/commands/explain.c	Fri Dec 11 10:47:06 2009
--- work/src/backend/commands/explain.c	Fri Dec 11 11:32:15 2009
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 127,132 ****
--- 127,134 ----
  			es.verbose = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "costs") == 0)
  			es.costs = defGetBoolean(opt);
+ 		else if (strcmp(opt->defname, "buffers") == 0)
+ 			es.buffers = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "format") == 0)
  		{
  			char   *p = defGetString(opt);
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 152,157 ****
--- 154,164 ----
  							opt->defname)));
  	}
  
+ 	if (es.buffers && !es.analyze)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 			 errmsg("EXPLAIN option BUFFERS requires ANALYZE")));
+ 
  	/*
  	 * Run parse analysis and rewrite.	Note this also acquires sufficient
  	 * locks on the source table(s).
*************** ExplainNode(Plan *plan, PlanState *plans
*** 1044,1049 ****
--- 1051,1134 ----
  			break;
  	}
  
+ 	/* Show buffer usage */
+ 	if (es->buffers)
+ 	{
+ 		const BufferUsage *usage = &planstate->instrument->bufusage;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			bool	has_shared = (usage->shared_blks_hit > 0 ||
+ 								  usage->shared_blks_read > 0 ||
+ 								  usage->shared_blks_written);
+ 			bool	has_local = (usage->local_blks_hit > 0 ||
+ 								 usage->local_blks_read > 0 ||
+ 								 usage->local_blks_written);
+ 			bool	has_temp = (usage->temp_blks_read > 0 ||
+ 								usage->temp_blks_written);
+ 
+ 			/* Show only positive counter values. */
+ 			if (has_shared || has_local || has_temp)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfoString(es->str, "Buffers:");
+ 
+ 				if (has_shared)
+ 				{
+ 					appendStringInfoString(es->str, " shared");
+ 					if (usage->shared_blks_hit > 0)
+ 						appendStringInfo(es->str, " hit=%ld",
+ 							usage->shared_blks_hit);
+ 					if (usage->shared_blks_read > 0)
+ 						appendStringInfo(es->str, " read=%ld",
+ 							usage->shared_blks_read);
+ 					if (usage->shared_blks_written > 0)
+ 						appendStringInfo(es->str, " written=%ld",
+ 							usage->shared_blks_written);
+ 					if (has_local || has_temp)
+ 						appendStringInfoChar(es->str, ',');
+ 				}
+ 				if (has_local)
+ 				{
+ 					appendStringInfoString(es->str, " local");
+ 					if (usage->local_blks_hit > 0)
+ 						appendStringInfo(es->str, " hit=%ld",
+ 							usage->local_blks_hit);
+ 					if (usage->local_blks_read > 0)
+ 						appendStringInfo(es->str, " read=%ld",
+ 							usage->local_blks_read);
+ 					if (usage->local_blks_written > 0)
+ 						appendStringInfo(es->str, " written=%ld",
+ 							usage->local_blks_written);
+ 					if (has_temp)
+ 						appendStringInfoChar(es->str, ',');
+ 				}
+ 				if (has_temp)
+ 				{
+ 					appendStringInfoString(es->str, " temp");
+ 					if (usage->temp_blks_read > 0)
+ 						appendStringInfo(es->str, " read=%ld",
+ 							usage->temp_blks_read);
+ 					if (usage->temp_blks_written > 0)
+ 						appendStringInfo(es->str, " written=%ld",
+ 							usage->temp_blks_written);
+ 				}
+ 				appendStringInfoChar(es->str, '\n');
+ 			}
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyLong("Shared Hit Blocks", usage->shared_blks_hit, es);
+ 			ExplainPropertyLong("Shared Read Blocks", usage->shared_blks_read, es);
+ 			ExplainPropertyLong("Shared Written Blocks", usage->shared_blks_written, es);
+ 			ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+ 			ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+ 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
+ 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+ 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+ 		}
+ 	}
+ 
  	/* Get ready to display the child plans */
  	haschildren = plan->initPlan ||
  		outerPlan(plan) ||
diff -cprN head/src/backend/executor/instrument.c work/src/backend/executor/instrument.c
*** head/src/backend/executor/instrument.c	Mon Jan  5 00:22:25 2009
--- work/src/backend/executor/instrument.c	Fri Dec 11 11:08:04 2009
***************
*** 17,22 ****
--- 17,26 ----
  
  #include "executor/instrument.h"
  
+ BufferUsage			pgBufferUsage;
+ 
+ static void BufferUsageAccumDiff(BufferUsage *dst,
+ 		const BufferUsage *add, const BufferUsage *sub);
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
*************** InstrStartNode(Instrumentation *instr)
*** 37,42 ****
--- 41,49 ----
  		INSTR_TIME_SET_CURRENT(instr->starttime);
  	else
  		elog(DEBUG2, "InstrStartNode called twice in a row");
+ 
+ 	/* initialize buffer usage per plan node */
+ 	instr->bufusage_start = pgBufferUsage;
  }
  
  /* Exit from a plan node */
*************** InstrStopNode(Instrumentation *instr, do
*** 59,64 ****
--- 66,78 ----
  
  	INSTR_TIME_SET_ZERO(instr->starttime);
  
+ 	/*
+ 	 * Adds delta of buffer usage to node's count and resets counter to start
+ 	 * so that the counters are not double counted by parent nodes.
+ 	 */
+ 	BufferUsageAccumDiff(&instr->bufusage,
+ 		&pgBufferUsage, &instr->bufusage_start);
+ 
  	/* Is this the first tuple of this cycle? */
  	if (!instr->running)
  	{
*************** InstrEndLoop(Instrumentation *instr)
*** 95,97 ****
--- 109,127 ----
  	instr->firsttuple = 0;
  	instr->tuplecount = 0;
  }
+ 
+ static void
+ BufferUsageAccumDiff(BufferUsage *dst,
+ 					 const BufferUsage *add,
+ 					 const BufferUsage *sub)
+ {
+ 	/* dst += add - sub */
+ 	dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
+ 	dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
+ 	dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
+ 	dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
+ 	dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
+ 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
+ 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
+ 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+ }
diff -cprN head/src/backend/storage/buffer/buf_init.c work/src/backend/storage/buffer/buf_init.c
*** head/src/backend/storage/buffer/buf_init.c	Mon Jan  5 00:22:25 2009
--- work/src/backend/storage/buffer/buf_init.c	Fri Dec 11 11:08:04 2009
*************** BufferDesc *BufferDescriptors;
*** 22,37 ****
  char	   *BufferBlocks;
  int32	   *PrivateRefCount;
  
- /* statistics counters */
- long int	ReadBufferCount;
- long int	ReadLocalBufferCount;
- long int	BufferHitCount;
- long int	LocalBufferHitCount;
- long int	BufferFlushCount;
- long int	LocalBufferFlushCount;
- long int	BufFileReadCount;
- long int	BufFileWriteCount;
- 
  
  /*
   * Data Structures:
--- 22,27 ----
diff -cprN head/src/backend/storage/buffer/bufmgr.c work/src/backend/storage/buffer/bufmgr.c
*** head/src/backend/storage/buffer/bufmgr.c	Fri Jun 12 09:52:43 2009
--- work/src/backend/storage/buffer/bufmgr.c	Fri Dec 11 11:08:04 2009
***************
*** 34,39 ****
--- 34,40 ----
  #include <unistd.h>
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "pgstat.h"
*************** ReadBuffer_common(SMgrRelation smgr, boo
*** 300,321 ****
  
  	if (isLocalBuf)
  	{
- 		ReadLocalBufferCount++;
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			LocalBufferHitCount++;
  	}
  	else
  	{
- 		ReadBufferCount++;
- 
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			BufferHitCount++;
  	}
  
  	/* At this point we do NOT hold any locks. */
--- 301,323 ----
  
  	if (isLocalBuf)
  	{
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			pgBufferUsage.local_blks_hit++;
! 		else
! 			pgBufferUsage.local_blks_read++;
  	}
  	else
  	{
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			pgBufferUsage.shared_blks_hit++;
! 		else
! 			pgBufferUsage.shared_blks_read++;
  	}
  
  	/* At this point we do NOT hold any locks. */
*************** SyncOneBuffer(int buf_id, bool skip_rece
*** 1611,1664 ****
  
  
  /*
-  * Return a palloc'd string containing buffer usage statistics.
-  */
- char *
- ShowBufferUsage(void)
- {
- 	StringInfoData str;
- 	float		hitrate;
- 	float		localhitrate;
- 
- 	initStringInfo(&str);
- 
- 	if (ReadBufferCount == 0)
- 		hitrate = 0.0;
- 	else
- 		hitrate = (float) BufferHitCount *100.0 / ReadBufferCount;
- 
- 	if (ReadLocalBufferCount == 0)
- 		localhitrate = 0.0;
- 	else
- 		localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
- 
- 	appendStringInfo(&str,
- 	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 				ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
- 	appendStringInfo(&str,
- 	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 					 ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate);
- 	appendStringInfo(&str,
- 					 "!\tDirect blocks: %10ld read, %10ld written\n",
- 					 BufFileReadCount, BufFileWriteCount);
- 
- 	return str.data;
- }
- 
- void
- ResetBufferUsage(void)
- {
- 	BufferHitCount = 0;
- 	ReadBufferCount = 0;
- 	BufferFlushCount = 0;
- 	LocalBufferHitCount = 0;
- 	ReadLocalBufferCount = 0;
- 	LocalBufferFlushCount = 0;
- 	BufFileReadCount = 0;
- 	BufFileWriteCount = 0;
- }
- 
- /*
   *		AtEOXact_Buffers - clean up at end of transaction.
   *
   *		As of PostgreSQL 8.0, buffer pins should get released by the
--- 1613,1618 ----
*************** FlushBuffer(volatile BufferDesc *buf, SM
*** 1916,1922 ****
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	BufferFlushCount++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
--- 1870,1876 ----
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	pgBufferUsage.shared_blks_written++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
diff -cprN head/src/backend/storage/buffer/localbuf.c work/src/backend/storage/buffer/localbuf.c
*** head/src/backend/storage/buffer/localbuf.c	Fri Jun 12 09:52:43 2009
--- work/src/backend/storage/buffer/localbuf.c	Fri Dec 11 11:08:04 2009
***************
*** 16,21 ****
--- 16,22 ----
  #include "postgres.h"
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/smgr.h"
*************** LocalBufferAlloc(SMgrRelation smgr, Fork
*** 209,215 ****
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		LocalBufferFlushCount++;
  	}
  
  	/*
--- 210,216 ----
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		pgBufferUsage.local_blks_written++;
  	}
  
  	/*
diff -cprN head/src/backend/storage/file/buffile.c work/src/backend/storage/file/buffile.c
*** head/src/backend/storage/file/buffile.c	Fri Jun 12 09:52:43 2009
--- work/src/backend/storage/file/buffile.c	Fri Dec 11 11:08:04 2009
***************
*** 34,39 ****
--- 34,40 ----
  
  #include "postgres.h"
  
+ #include "executor/instrument.h"
  #include "storage/fd.h"
  #include "storage/buffile.h"
  #include "storage/buf_internals.h"
*************** BufFileLoadBuffer(BufFile *file)
*** 240,246 ****
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	BufFileReadCount++;
  }
  
  /*
--- 241,247 ----
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	pgBufferUsage.temp_blks_read++;
  }
  
  /*
*************** BufFileDumpBuffer(BufFile *file)
*** 304,310 ****
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		BufFileWriteCount++;
  	}
  	file->dirty = false;
  
--- 305,311 ----
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		pgBufferUsage.temp_blks_written++;
  	}
  	file->dirty = false;
  
diff -cprN head/src/backend/tcop/postgres.c work/src/backend/tcop/postgres.c
*** head/src/backend/tcop/postgres.c	Fri Nov  6 09:53:35 2009
--- work/src/backend/tcop/postgres.c	Fri Dec 11 11:08:04 2009
*************** ResetUsage(void)
*** 3901,3907 ****
  {
  	getrusage(RUSAGE_SELF, &Save_r);
  	gettimeofday(&Save_t, NULL);
- 	ResetBufferUsage();
  }
  
  void
--- 3901,3906 ----
*************** ShowUsage(const char *title)
*** 3912,3918 ****
  				sys;
  	struct timeval elapse_t;
  	struct rusage r;
- 	char	   *bufusage;
  
  	getrusage(RUSAGE_SELF, &r);
  	gettimeofday(&elapse_t, NULL);
--- 3911,3916 ----
*************** ShowUsage(const char *title)
*** 3986,3995 ****
  					 r.ru_nvcsw, r.ru_nivcsw);
  #endif   /* HAVE_GETRUSAGE */
  
- 	bufusage = ShowBufferUsage();
- 	appendStringInfo(&str, "! buffer usage stats:\n%s", bufusage);
- 	pfree(bufusage);
- 
  	/* remove trailing newline */
  	if (str.data[str.len - 1] == '\n')
  		str.data[--str.len] = '\0';
--- 3984,3989 ----
diff -cprN head/src/include/commands/explain.h work/src/include/commands/explain.h
*** head/src/include/commands/explain.h	Fri Dec 11 10:47:06 2009
--- work/src/include/commands/explain.h	Fri Dec 11 11:08:04 2009
*************** typedef struct ExplainState
*** 30,35 ****
--- 30,36 ----
  	bool		verbose;		/* be verbose */
  	bool		analyze;		/* print actual times */
  	bool		costs;			/* print costs */
+ 	bool		buffers;		/* print buffer usage */
  	ExplainFormat format;		/* output format */
  	/* other states */
  	PlannedStmt *pstmt;			/* top of plan */
diff -cprN head/src/include/executor/instrument.h work/src/include/executor/instrument.h
*** head/src/include/executor/instrument.h	Mon Jan  5 00:22:25 2009
--- work/src/include/executor/instrument.h	Fri Dec 11 11:08:04 2009
***************
*** 16,21 ****
--- 16,33 ----
  #include "portability/instr_time.h"
  
  
+ typedef struct BufferUsage
+ {
+ 	long	shared_blks_hit;		/* # of shared buffer hits */
+ 	long	shared_blks_read;		/* # of shared disk blocks read */
+ 	long	shared_blks_written;	/* # of shared disk blocks written */
+ 	long	local_blks_hit;			/* # of local buffer hits */
+ 	long	local_blks_read;		/* # of local disk blocks read */
+ 	long	local_blks_written;		/* # of local disk blocks written */
+ 	long	temp_blks_read;			/* # of temp blocks read */
+ 	long	temp_blks_written;		/* # of temp blocks written */
+ } BufferUsage;
+ 
  typedef struct Instrumentation
  {
  	/* Info about current plan cycle: */
*************** typedef struct Instrumentation
*** 24,36 ****
--- 36,52 ----
  	instr_time	counter;		/* Accumulated runtime for this node */
  	double		firsttuple;		/* Time for first tuple of this cycle */
  	double		tuplecount;		/* Tuples emitted so far this cycle */
+ 	BufferUsage	bufusage_start;	/* Buffer usage at start */
  	/* Accumulated statistics across all completed cycles: */
  	double		startup;		/* Total startup time (in seconds) */
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
+ 	BufferUsage	bufusage;		/* Total buffer usage */
  } Instrumentation;
  
+ extern BufferUsage		pgBufferUsage;
+ 
  extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
  extern void InstrStopNode(Instrumentation *instr, double nTuples);
diff -cprN head/src/include/storage/buf_internals.h work/src/include/storage/buf_internals.h
*** head/src/include/storage/buf_internals.h	Fri Jun 12 09:52:43 2009
--- work/src/include/storage/buf_internals.h	Fri Dec 11 11:08:04 2009
*************** extern PGDLLIMPORT BufferDesc *BufferDes
*** 173,188 ****
  /* in localbuf.c */
  extern BufferDesc *LocalBufferDescriptors;
  
- /* event counters in buf_init.c */
- extern long int ReadBufferCount;
- extern long int ReadLocalBufferCount;
- extern long int BufferHitCount;
- extern long int LocalBufferHitCount;
- extern long int BufferFlushCount;
- extern long int LocalBufferFlushCount;
- extern long int BufFileReadCount;
- extern long int BufFileWriteCount;
- 
  
  /*
   * Internal routines: only called by bufmgr
--- 173,178 ----
diff -cprN head/src/include/storage/bufmgr.h work/src/include/storage/bufmgr.h
*** head/src/include/storage/bufmgr.h	Fri Jun 12 09:52:43 2009
--- work/src/include/storage/bufmgr.h	Fri Dec 11 11:08:04 2009
*************** extern Buffer ReleaseAndReadBuffer(Buffe
*** 173,180 ****
  extern void InitBufferPool(void);
  extern void InitBufferPoolAccess(void);
  extern void InitBufferPoolBackend(void);
- extern char *ShowBufferUsage(void);
- extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
  extern void CheckPointBuffers(int flags);
--- 173,178 ----
#54Robert Haas
robertmhaas@gmail.com
In reply to: Takahiro Itagaki (#53)
1 attachment(s)
Re: EXPLAIN BUFFERS

On Thu, Dec 10, 2009 at 9:35 PM, Takahiro Itagaki
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Anyway, a revised patch according to the comments is attached.
The new text format is:
 Buffers: shared hit=675 read=968, temp read=1443 written=1443
   * Zero values are omitted. (Non-text formats could have zero values.)
   * Rename "Blocks:" to "Buffers:".
   * Remove parentheses and add a comma between shared, local and temp.

I did a bit of copy-editing of your doc changes to make the English a
bit more correct and idiomatic. Slightly revised patch attached for
your consideration. The output format looks really nice (thanks for
bearing with me), and the functionality is great.

...Robert

Attachments:

explain_buffers_20091211_rmh.patchtext/x-diff; charset=US-ASCII; name=explain_buffers_20091211_rmh.patchDownload
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 75ac9ca..88c33c0 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -22,6 +22,7 @@ PG_MODULE_MAGIC;
 static int	auto_explain_log_min_duration = -1; /* msec or -1 */
 static bool auto_explain_log_analyze = false;
 static bool auto_explain_log_verbose = false;
+static bool auto_explain_log_buffers = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static bool auto_explain_log_nested_statements = false;
 
@@ -93,6 +94,16 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomBoolVariable("auto_explain.log_buffers",
+							 "Log buffers usage.",
+							 NULL,
+							 &auto_explain_log_buffers,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("auto_explain.log_format",
 							 "EXPLAIN format to be used for plan logging.",
 							 NULL,
@@ -219,8 +230,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			ExplainState	es;
 
 			ExplainInitState(&es);
-			es.analyze = (queryDesc->doInstrument && auto_explain_log_analyze);
+			es.analyze = (queryDesc->doInstrument &&
+				(auto_explain_log_analyze || auto_explain_log_buffers));
 			es.verbose = auto_explain_log_verbose;
+			es.buffers = (es.analyze && auto_explain_log_buffers);
 			es.format = auto_explain_log_format;
 
 			ExplainPrintPlan(&es, queryDesc);
diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml
index dd3f3fd..1b9d4d9 100644
--- a/doc/src/sgml/auto-explain.sgml
+++ b/doc/src/sgml/auto-explain.sgml
@@ -104,6 +104,25 @@ LOAD 'auto_explain';
 
    <varlistentry>
     <term>
+     <varname>auto_explain.log_buffers</varname> (<type>boolean</type>)
+    </term>
+    <indexterm>
+     <primary><varname>auto_explain.log_buffers</> configuration parameter</primary>
+    </indexterm>
+    <listitem>
+     <para>
+      <varname>auto_explain.log_buffers</varname> causes <command>EXPLAIN
+      (ANALYZE, BUFFERS)</> output, rather than just <command>EXPLAIN</> 
+      output, to be printed when an execution plan is logged. This parameter is 
+      off by default. Only superusers can change this setting. This
+      parameter has no effect unless <varname>auto_explain.log_analyze</>
+      parameter is set.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
      <varname>auto_explain.log_format</varname> (<type>enum</type>)
     </term>
     <indexterm>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 0d03469..c90a028 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -31,7 +31,7 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON | YAML } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
+EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | BUFFERS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON | YAML } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
 EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
 </synopsis>
  </refsynopsisdiv>
@@ -140,6 +140,23 @@ ROLLBACK;
    </varlistentry>
 
    <varlistentry>
+    <term><literal>BUFFERS</literal></term>
+    <listitem>
+     <para>
+      Include information on buffer usage. Specifically, include the number of
+      shared blocks hits, reads, and writes, the number of local blocks hits,
+      reads, and writes, and the number of temp blocks reads and writes.
+      Shared blocks, local blocks, and temp blocks contain tables and indexes,
+      temporary tables and temporary indexes, and disk blocks used in sort and
+      materialized plans, respectively. The number of blocks shown for an
+      upper-level node includes those used by all its child nodes. This
+      parameter may only be used with <literal>ANALYZE</literal> parameter.
+      It defaults to <literal>FALSE</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
     <term><literal>FORMAT</literal></term>
     <listitem>
      <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0437ffa..0aba2a7 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -127,6 +127,8 @@ ExplainQuery(ExplainStmt *stmt, const char *queryString,
 			es.verbose = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "costs") == 0)
 			es.costs = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "buffers") == 0)
+			es.buffers = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "format") == 0)
 		{
 			char   *p = defGetString(opt);
@@ -152,6 +154,11 @@ ExplainQuery(ExplainStmt *stmt, const char *queryString,
 							opt->defname)));
 	}
 
+	if (es.buffers && !es.analyze)
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			 errmsg("EXPLAIN option BUFFERS requires ANALYZE")));
+
 	/*
 	 * Run parse analysis and rewrite.	Note this also acquires sufficient
 	 * locks on the source table(s).
@@ -1044,6 +1051,84 @@ ExplainNode(Plan *plan, PlanState *planstate,
 			break;
 	}
 
+	/* Show buffer usage */
+	if (es->buffers)
+	{
+		const BufferUsage *usage = &planstate->instrument->bufusage;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			bool	has_shared = (usage->shared_blks_hit > 0 ||
+								  usage->shared_blks_read > 0 ||
+								  usage->shared_blks_written);
+			bool	has_local = (usage->local_blks_hit > 0 ||
+								 usage->local_blks_read > 0 ||
+								 usage->local_blks_written);
+			bool	has_temp = (usage->temp_blks_read > 0 ||
+								usage->temp_blks_written);
+
+			/* Show only positive counter values. */
+			if (has_shared || has_local || has_temp)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfoString(es->str, "Buffers:");
+
+				if (has_shared)
+				{
+					appendStringInfoString(es->str, " shared");
+					if (usage->shared_blks_hit > 0)
+						appendStringInfo(es->str, " hit=%ld",
+							usage->shared_blks_hit);
+					if (usage->shared_blks_read > 0)
+						appendStringInfo(es->str, " read=%ld",
+							usage->shared_blks_read);
+					if (usage->shared_blks_written > 0)
+						appendStringInfo(es->str, " written=%ld",
+							usage->shared_blks_written);
+					if (has_local || has_temp)
+						appendStringInfoChar(es->str, ',');
+				}
+				if (has_local)
+				{
+					appendStringInfoString(es->str, " local");
+					if (usage->local_blks_hit > 0)
+						appendStringInfo(es->str, " hit=%ld",
+							usage->local_blks_hit);
+					if (usage->local_blks_read > 0)
+						appendStringInfo(es->str, " read=%ld",
+							usage->local_blks_read);
+					if (usage->local_blks_written > 0)
+						appendStringInfo(es->str, " written=%ld",
+							usage->local_blks_written);
+					if (has_temp)
+						appendStringInfoChar(es->str, ',');
+				}
+				if (has_temp)
+				{
+					appendStringInfoString(es->str, " temp");
+					if (usage->temp_blks_read > 0)
+						appendStringInfo(es->str, " read=%ld",
+							usage->temp_blks_read);
+					if (usage->temp_blks_written > 0)
+						appendStringInfo(es->str, " written=%ld",
+							usage->temp_blks_written);
+				}
+				appendStringInfoChar(es->str, '\n');
+			}
+		}
+		else
+		{
+			ExplainPropertyLong("Shared Hit Blocks", usage->shared_blks_hit, es);
+			ExplainPropertyLong("Shared Read Blocks", usage->shared_blks_read, es);
+			ExplainPropertyLong("Shared Written Blocks", usage->shared_blks_written, es);
+			ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+			ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
+			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+		}
+	}
+
 	/* Get ready to display the child plans */
 	haschildren = plan->initPlan ||
 		outerPlan(plan) ||
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index d8d7039..8690581 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -17,6 +17,10 @@
 
 #include "executor/instrument.h"
 
+BufferUsage			pgBufferUsage;
+
+static void BufferUsageAccumDiff(BufferUsage *dst,
+		const BufferUsage *add, const BufferUsage *sub);
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -37,6 +41,9 @@ InstrStartNode(Instrumentation *instr)
 		INSTR_TIME_SET_CURRENT(instr->starttime);
 	else
 		elog(DEBUG2, "InstrStartNode called twice in a row");
+
+	/* initialize buffer usage per plan node */
+	instr->bufusage_start = pgBufferUsage;
 }
 
 /* Exit from a plan node */
@@ -59,6 +66,13 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 
 	INSTR_TIME_SET_ZERO(instr->starttime);
 
+	/*
+	 * Adds delta of buffer usage to node's count and resets counter to start
+	 * so that the counters are not double counted by parent nodes.
+	 */
+	BufferUsageAccumDiff(&instr->bufusage,
+		&pgBufferUsage, &instr->bufusage_start);
+
 	/* Is this the first tuple of this cycle? */
 	if (!instr->running)
 	{
@@ -95,3 +109,19 @@ InstrEndLoop(Instrumentation *instr)
 	instr->firsttuple = 0;
 	instr->tuplecount = 0;
 }
+
+static void
+BufferUsageAccumDiff(BufferUsage *dst,
+					 const BufferUsage *add,
+					 const BufferUsage *sub)
+{
+	/* dst += add - sub */
+	dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
+	dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
+	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e0211f5..cc434c3 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -22,16 +22,6 @@ BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
 int32	   *PrivateRefCount;
 
-/* statistics counters */
-long int	ReadBufferCount;
-long int	ReadLocalBufferCount;
-long int	BufferHitCount;
-long int	LocalBufferHitCount;
-long int	BufferFlushCount;
-long int	LocalBufferFlushCount;
-long int	BufFileReadCount;
-long int	BufFileWriteCount;
-
 
 /*
  * Data Structures:
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index de28374..276723d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "catalog/catalog.h"
+#include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -300,22 +301,23 @@ ReadBuffer_common(SMgrRelation smgr, bool isLocalBuf, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		ReadLocalBufferCount++;
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
 		if (found)
-			LocalBufferHitCount++;
+			pgBufferUsage.local_blks_hit++;
+		else
+			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		ReadBufferCount++;
-
 		/*
 		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
 		if (found)
-			BufferHitCount++;
+			pgBufferUsage.shared_blks_hit++;
+		else
+			pgBufferUsage.shared_blks_read++;
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -1611,54 +1613,6 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 
 
 /*
- * Return a palloc'd string containing buffer usage statistics.
- */
-char *
-ShowBufferUsage(void)
-{
-	StringInfoData str;
-	float		hitrate;
-	float		localhitrate;
-
-	initStringInfo(&str);
-
-	if (ReadBufferCount == 0)
-		hitrate = 0.0;
-	else
-		hitrate = (float) BufferHitCount *100.0 / ReadBufferCount;
-
-	if (ReadLocalBufferCount == 0)
-		localhitrate = 0.0;
-	else
-		localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
-
-	appendStringInfo(&str,
-	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
-				ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
-	appendStringInfo(&str,
-	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
-					 ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate);
-	appendStringInfo(&str,
-					 "!\tDirect blocks: %10ld read, %10ld written\n",
-					 BufFileReadCount, BufFileWriteCount);
-
-	return str.data;
-}
-
-void
-ResetBufferUsage(void)
-{
-	BufferHitCount = 0;
-	ReadBufferCount = 0;
-	BufferFlushCount = 0;
-	LocalBufferHitCount = 0;
-	ReadLocalBufferCount = 0;
-	LocalBufferFlushCount = 0;
-	BufFileReadCount = 0;
-	BufFileWriteCount = 0;
-}
-
-/*
  *		AtEOXact_Buffers - clean up at end of transaction.
  *
  *		As of PostgreSQL 8.0, buffer pins should get released by the
@@ -1916,7 +1870,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  (char *) BufHdrGetBlock(buf),
 			  false);
 
-	BufferFlushCount++;
+	pgBufferUsage.shared_blks_written++;
 
 	/*
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 641f8e9..c7d25b9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "catalog/catalog.h"
+#include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
@@ -209,7 +210,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
 
-		LocalBufferFlushCount++;
+		pgBufferUsage.local_blks_written++;
 	}
 
 	/*
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 436a82b..ebe77ff 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -34,6 +34,7 @@
 
 #include "postgres.h"
 
+#include "executor/instrument.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
@@ -240,7 +241,7 @@ BufFileLoadBuffer(BufFile *file)
 	file->offsets[file->curFile] += file->nbytes;
 	/* we choose not to advance curOffset here */
 
-	BufFileReadCount++;
+	pgBufferUsage.temp_blks_read++;
 }
 
 /*
@@ -304,7 +305,7 @@ BufFileDumpBuffer(BufFile *file)
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
-		BufFileWriteCount++;
+		pgBufferUsage.temp_blks_written++;
 	}
 	file->dirty = false;
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0672652..c985478 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3901,7 +3901,6 @@ ResetUsage(void)
 {
 	getrusage(RUSAGE_SELF, &Save_r);
 	gettimeofday(&Save_t, NULL);
-	ResetBufferUsage();
 }
 
 void
@@ -3912,7 +3911,6 @@ ShowUsage(const char *title)
 				sys;
 	struct timeval elapse_t;
 	struct rusage r;
-	char	   *bufusage;
 
 	getrusage(RUSAGE_SELF, &r);
 	gettimeofday(&elapse_t, NULL);
@@ -3986,10 +3984,6 @@ ShowUsage(const char *title)
 					 r.ru_nvcsw, r.ru_nivcsw);
 #endif   /* HAVE_GETRUSAGE */
 
-	bufusage = ShowBufferUsage();
-	appendStringInfo(&str, "! buffer usage stats:\n%s", bufusage);
-	pfree(bufusage);
-
 	/* remove trailing newline */
 	if (str.data[str.len - 1] == '\n')
 		str.data[--str.len] = '\0';
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ab48825..f97c0ee 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -30,6 +30,7 @@ typedef struct ExplainState
 	bool		verbose;		/* be verbose */
 	bool		analyze;		/* print actual times */
 	bool		costs;			/* print costs */
+	bool		buffers;		/* print buffer usage */
 	ExplainFormat format;		/* output format */
 	/* other states */
 	PlannedStmt *pstmt;			/* top of plan */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9846f6f..4bb6f91 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -16,6 +16,18 @@
 #include "portability/instr_time.h"
 
 
+typedef struct BufferUsage
+{
+	long	shared_blks_hit;		/* # of shared buffer hits */
+	long	shared_blks_read;		/* # of shared disk blocks read */
+	long	shared_blks_written;	/* # of shared disk blocks written */
+	long	local_blks_hit;			/* # of local buffer hits */
+	long	local_blks_read;		/* # of local disk blocks read */
+	long	local_blks_written;		/* # of local disk blocks written */
+	long	temp_blks_read;			/* # of temp blocks read */
+	long	temp_blks_written;		/* # of temp blocks written */
+} BufferUsage;
+
 typedef struct Instrumentation
 {
 	/* Info about current plan cycle: */
@@ -24,13 +36,17 @@ typedef struct Instrumentation
 	instr_time	counter;		/* Accumulated runtime for this node */
 	double		firsttuple;		/* Time for first tuple of this cycle */
 	double		tuplecount;		/* Tuples emitted so far this cycle */
+	BufferUsage	bufusage_start;	/* Buffer usage at start */
 	/* Accumulated statistics across all completed cycles: */
 	double		startup;		/* Total startup time (in seconds) */
 	double		total;			/* Total total time (in seconds) */
 	double		ntuples;		/* Total tuples produced */
 	double		nloops;			/* # of run cycles for this node */
+	BufferUsage	bufusage;		/* Total buffer usage */
 } Instrumentation;
 
+extern BufferUsage		pgBufferUsage;
+
 extern Instrumentation *InstrAlloc(int n);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 841cf09..42ed94e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -173,16 +173,6 @@ extern PGDLLIMPORT BufferDesc *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
-/* event counters in buf_init.c */
-extern long int ReadBufferCount;
-extern long int ReadLocalBufferCount;
-extern long int BufferHitCount;
-extern long int LocalBufferHitCount;
-extern long int BufferFlushCount;
-extern long int LocalBufferFlushCount;
-extern long int BufFileReadCount;
-extern long int BufFileWriteCount;
-
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d06eb77..f8d685c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -173,8 +173,6 @@ extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 extern void InitBufferPool(void);
 extern void InitBufferPoolAccess(void);
 extern void InitBufferPoolBackend(void);
-extern char *ShowBufferUsage(void);
-extern void ResetBufferUsage(void);
 extern void AtEOXact_Buffers(bool isCommit);
 extern void PrintBufferLeakWarning(Buffer buffer);
 extern void CheckPointBuffers(int flags);
In reply to: Robert Haas (#54)
Re: EXPLAIN BUFFERS

Robert Haas escreveu:

On Thu, Dec 10, 2009 at 9:35 PM, Takahiro Itagaki
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Anyway, a revised patch according to the comments is attached.
The new text format is:
Buffers: shared hit=675 read=968, temp read=1443 written=1443
* Zero values are omitted. (Non-text formats could have zero values.)
* Rename "Blocks:" to "Buffers:".
* Remove parentheses and add a comma between shared, local and temp.

I did a bit of copy-editing of your doc changes to make the English a
bit more correct and idiomatic. Slightly revised patch attached for
your consideration. The output format looks really nice (thanks for
bearing with me), and the functionality is great.

Please, document that zero values are omitted in the text format. It seems
intuitive but could be surprise because zero values are in non-text formats.

--
Euler Taveira de Oliveira
http://www.timbira.com/

#56Robert Haas
robertmhaas@gmail.com
In reply to: Euler Taveira de Oliveira (#55)
1 attachment(s)
Re: EXPLAIN BUFFERS

On Fri, Dec 11, 2009 at 11:36 AM, Euler Taveira de Oliveira
<euler@timbira.com> wrote:

Robert Haas escreveu:

On Thu, Dec 10, 2009 at 9:35 PM, Takahiro Itagaki
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Anyway, a revised patch according to the comments is attached.
The new text format is:
 Buffers: shared hit=675 read=968, temp read=1443 written=1443
   * Zero values are omitted. (Non-text formats could have zero values.)
   * Rename "Blocks:" to "Buffers:".
   * Remove parentheses and add a comma between shared, local and temp.

I did a bit of copy-editing of your doc changes to make the English a
bit more correct and idiomatic.  Slightly revised patch attached for
your consideration.  The output format looks really nice (thanks for
bearing with me), and the functionality is great.

Please, document that zero values are omitted in the text format. It seems
intuitive but could be surprise because zero values are in non-text formats.

OK, done, see attached. I also noticed when looking through this that
the documentation says that auto_explain.log_buffers is ignored unless
auto_explain.log_analyze is set. That is true and seems right to me,
but for some reason explain_ExecutorEnd() had been changed to set
es.analyze if either log_analyze or log_buffers was set. It actually
didn't have any effect unless log_analyze was set, but only because
explain_ExecutorStart doesn't set queryDesc->doInstrument in that
case. So I've reverted that here for clarity.

...Robert

Attachments:

explain_buffers_20091211_rmh2.patchtext/x-diff; charset=US-ASCII; name=explain_buffers_20091211_rmh2.patchDownload
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index f0d907d..491f479 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -22,6 +22,7 @@ PG_MODULE_MAGIC;
 static int	auto_explain_log_min_duration = -1; /* msec or -1 */
 static bool auto_explain_log_analyze = false;
 static bool auto_explain_log_verbose = false;
+static bool auto_explain_log_buffers = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static bool auto_explain_log_nested_statements = false;
 
@@ -93,6 +94,16 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomBoolVariable("auto_explain.log_buffers",
+							 "Log buffers usage.",
+							 NULL,
+							 &auto_explain_log_buffers,
+							 false,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("auto_explain.log_format",
 							 "EXPLAIN format to be used for plan logging.",
 							 NULL,
@@ -221,6 +232,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			ExplainInitState(&es);
 			es.analyze = (queryDesc->doInstrument && auto_explain_log_analyze);
 			es.verbose = auto_explain_log_verbose;
+			es.buffers = (es.analyze && auto_explain_log_buffers);
 			es.format = auto_explain_log_format;
 
 			ExplainBeginOutput(&es);
diff --git a/doc/src/sgml/auto-explain.sgml b/doc/src/sgml/auto-explain.sgml
index dd3f3fd..1b9d4d9 100644
--- a/doc/src/sgml/auto-explain.sgml
+++ b/doc/src/sgml/auto-explain.sgml
@@ -104,6 +104,25 @@ LOAD 'auto_explain';
 
    <varlistentry>
     <term>
+     <varname>auto_explain.log_buffers</varname> (<type>boolean</type>)
+    </term>
+    <indexterm>
+     <primary><varname>auto_explain.log_buffers</> configuration parameter</primary>
+    </indexterm>
+    <listitem>
+     <para>
+      <varname>auto_explain.log_buffers</varname> causes <command>EXPLAIN
+      (ANALYZE, BUFFERS)</> output, rather than just <command>EXPLAIN</> 
+      output, to be printed when an execution plan is logged. This parameter is 
+      off by default. Only superusers can change this setting. This
+      parameter has no effect unless <varname>auto_explain.log_analyze</>
+      parameter is set.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
      <varname>auto_explain.log_format</varname> (<type>enum</type>)
     </term>
     <indexterm>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 0d03469..6c68afd 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -31,7 +31,7 @@ PostgreSQL documentation
 
  <refsynopsisdiv>
 <synopsis>
-EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON | YAML } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
+EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | BUFFERS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON | YAML } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
 EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
 </synopsis>
  </refsynopsisdiv>
@@ -140,6 +140,24 @@ ROLLBACK;
    </varlistentry>
 
    <varlistentry>
+    <term><literal>BUFFERS</literal></term>
+    <listitem>
+     <para>
+      Include information on buffer usage. Specifically, include the number of
+      shared blocks hits, reads, and writes, the number of local blocks hits,
+      reads, and writes, and the number of temp blocks reads and writes.
+      Shared blocks, local blocks, and temp blocks contain tables and indexes,
+      temporary tables and temporary indexes, and disk blocks used in sort and
+      materialized plans, respectively. The number of blocks shown for an
+      upper-level node includes those used by all its child nodes.  In text
+      format, only non-zero values are printed.  This parameter may only be
+      used with <literal>ANALYZE</literal> parameter.  It defaults to
+      <literal>FALSE</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
     <term><literal>FORMAT</literal></term>
     <listitem>
      <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 2067636..03a39c1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -125,6 +125,8 @@ ExplainQuery(ExplainStmt *stmt, const char *queryString,
 			es.verbose = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "costs") == 0)
 			es.costs = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "buffers") == 0)
+			es.buffers = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "format") == 0)
 		{
 			char   *p = defGetString(opt);
@@ -150,6 +152,11 @@ ExplainQuery(ExplainStmt *stmt, const char *queryString,
 							opt->defname)));
 	}
 
+	if (es.buffers && !es.analyze)
+		ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			 errmsg("EXPLAIN option BUFFERS requires ANALYZE")));
+
 	/*
 	 * Run parse analysis and rewrite.	Note this also acquires sufficient
 	 * locks on the source table(s).
@@ -1042,6 +1049,84 @@ ExplainNode(Plan *plan, PlanState *planstate,
 			break;
 	}
 
+	/* Show buffer usage */
+	if (es->buffers)
+	{
+		const BufferUsage *usage = &planstate->instrument->bufusage;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			bool	has_shared = (usage->shared_blks_hit > 0 ||
+								  usage->shared_blks_read > 0 ||
+								  usage->shared_blks_written);
+			bool	has_local = (usage->local_blks_hit > 0 ||
+								 usage->local_blks_read > 0 ||
+								 usage->local_blks_written);
+			bool	has_temp = (usage->temp_blks_read > 0 ||
+								usage->temp_blks_written);
+
+			/* Show only positive counter values. */
+			if (has_shared || has_local || has_temp)
+			{
+				appendStringInfoSpaces(es->str, es->indent * 2);
+				appendStringInfoString(es->str, "Buffers:");
+
+				if (has_shared)
+				{
+					appendStringInfoString(es->str, " shared");
+					if (usage->shared_blks_hit > 0)
+						appendStringInfo(es->str, " hit=%ld",
+							usage->shared_blks_hit);
+					if (usage->shared_blks_read > 0)
+						appendStringInfo(es->str, " read=%ld",
+							usage->shared_blks_read);
+					if (usage->shared_blks_written > 0)
+						appendStringInfo(es->str, " written=%ld",
+							usage->shared_blks_written);
+					if (has_local || has_temp)
+						appendStringInfoChar(es->str, ',');
+				}
+				if (has_local)
+				{
+					appendStringInfoString(es->str, " local");
+					if (usage->local_blks_hit > 0)
+						appendStringInfo(es->str, " hit=%ld",
+							usage->local_blks_hit);
+					if (usage->local_blks_read > 0)
+						appendStringInfo(es->str, " read=%ld",
+							usage->local_blks_read);
+					if (usage->local_blks_written > 0)
+						appendStringInfo(es->str, " written=%ld",
+							usage->local_blks_written);
+					if (has_temp)
+						appendStringInfoChar(es->str, ',');
+				}
+				if (has_temp)
+				{
+					appendStringInfoString(es->str, " temp");
+					if (usage->temp_blks_read > 0)
+						appendStringInfo(es->str, " read=%ld",
+							usage->temp_blks_read);
+					if (usage->temp_blks_written > 0)
+						appendStringInfo(es->str, " written=%ld",
+							usage->temp_blks_written);
+				}
+				appendStringInfoChar(es->str, '\n');
+			}
+		}
+		else
+		{
+			ExplainPropertyLong("Shared Hit Blocks", usage->shared_blks_hit, es);
+			ExplainPropertyLong("Shared Read Blocks", usage->shared_blks_read, es);
+			ExplainPropertyLong("Shared Written Blocks", usage->shared_blks_written, es);
+			ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+			ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
+			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+		}
+	}
+
 	/* Get ready to display the child plans */
 	haschildren = plan->initPlan ||
 		outerPlan(plan) ||
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index d8d7039..8690581 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -17,6 +17,10 @@
 
 #include "executor/instrument.h"
 
+BufferUsage			pgBufferUsage;
+
+static void BufferUsageAccumDiff(BufferUsage *dst,
+		const BufferUsage *add, const BufferUsage *sub);
 
 /* Allocate new instrumentation structure(s) */
 Instrumentation *
@@ -37,6 +41,9 @@ InstrStartNode(Instrumentation *instr)
 		INSTR_TIME_SET_CURRENT(instr->starttime);
 	else
 		elog(DEBUG2, "InstrStartNode called twice in a row");
+
+	/* initialize buffer usage per plan node */
+	instr->bufusage_start = pgBufferUsage;
 }
 
 /* Exit from a plan node */
@@ -59,6 +66,13 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 
 	INSTR_TIME_SET_ZERO(instr->starttime);
 
+	/*
+	 * Adds delta of buffer usage to node's count and resets counter to start
+	 * so that the counters are not double counted by parent nodes.
+	 */
+	BufferUsageAccumDiff(&instr->bufusage,
+		&pgBufferUsage, &instr->bufusage_start);
+
 	/* Is this the first tuple of this cycle? */
 	if (!instr->running)
 	{
@@ -95,3 +109,19 @@ InstrEndLoop(Instrumentation *instr)
 	instr->firsttuple = 0;
 	instr->tuplecount = 0;
 }
+
+static void
+BufferUsageAccumDiff(BufferUsage *dst,
+					 const BufferUsage *add,
+					 const BufferUsage *sub)
+{
+	/* dst += add - sub */
+	dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
+	dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
+	dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
+	dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
+	dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
+	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
+	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
+	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e0211f5..cc434c3 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -22,16 +22,6 @@ BufferDesc *BufferDescriptors;
 char	   *BufferBlocks;
 int32	   *PrivateRefCount;
 
-/* statistics counters */
-long int	ReadBufferCount;
-long int	ReadLocalBufferCount;
-long int	BufferHitCount;
-long int	LocalBufferHitCount;
-long int	BufferFlushCount;
-long int	LocalBufferFlushCount;
-long int	BufFileReadCount;
-long int	BufFileWriteCount;
-
 
 /*
  * Data Structures:
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index de28374..276723d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -34,6 +34,7 @@
 #include <unistd.h>
 
 #include "catalog/catalog.h"
+#include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -300,22 +301,23 @@ ReadBuffer_common(SMgrRelation smgr, bool isLocalBuf, ForkNumber forkNum,
 
 	if (isLocalBuf)
 	{
-		ReadLocalBufferCount++;
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
 		if (found)
-			LocalBufferHitCount++;
+			pgBufferUsage.local_blks_hit++;
+		else
+			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		ReadBufferCount++;
-
 		/*
 		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
 		 * not currently in memory.
 		 */
 		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
 		if (found)
-			BufferHitCount++;
+			pgBufferUsage.shared_blks_hit++;
+		else
+			pgBufferUsage.shared_blks_read++;
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -1611,54 +1613,6 @@ SyncOneBuffer(int buf_id, bool skip_recently_used)
 
 
 /*
- * Return a palloc'd string containing buffer usage statistics.
- */
-char *
-ShowBufferUsage(void)
-{
-	StringInfoData str;
-	float		hitrate;
-	float		localhitrate;
-
-	initStringInfo(&str);
-
-	if (ReadBufferCount == 0)
-		hitrate = 0.0;
-	else
-		hitrate = (float) BufferHitCount *100.0 / ReadBufferCount;
-
-	if (ReadLocalBufferCount == 0)
-		localhitrate = 0.0;
-	else
-		localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
-
-	appendStringInfo(&str,
-	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
-				ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
-	appendStringInfo(&str,
-	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
-					 ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate);
-	appendStringInfo(&str,
-					 "!\tDirect blocks: %10ld read, %10ld written\n",
-					 BufFileReadCount, BufFileWriteCount);
-
-	return str.data;
-}
-
-void
-ResetBufferUsage(void)
-{
-	BufferHitCount = 0;
-	ReadBufferCount = 0;
-	BufferFlushCount = 0;
-	LocalBufferHitCount = 0;
-	ReadLocalBufferCount = 0;
-	LocalBufferFlushCount = 0;
-	BufFileReadCount = 0;
-	BufFileWriteCount = 0;
-}
-
-/*
  *		AtEOXact_Buffers - clean up at end of transaction.
  *
  *		As of PostgreSQL 8.0, buffer pins should get released by the
@@ -1916,7 +1870,7 @@ FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
 			  (char *) BufHdrGetBlock(buf),
 			  false);
 
-	BufferFlushCount++;
+	pgBufferUsage.shared_blks_written++;
 
 	/*
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 641f8e9..c7d25b9 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "catalog/catalog.h"
+#include "executor/instrument.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
@@ -209,7 +210,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		/* Mark not-dirty now in case we error out below */
 		bufHdr->flags &= ~BM_DIRTY;
 
-		LocalBufferFlushCount++;
+		pgBufferUsage.local_blks_written++;
 	}
 
 	/*
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 436a82b..ebe77ff 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -34,6 +34,7 @@
 
 #include "postgres.h"
 
+#include "executor/instrument.h"
 #include "storage/fd.h"
 #include "storage/buffile.h"
 #include "storage/buf_internals.h"
@@ -240,7 +241,7 @@ BufFileLoadBuffer(BufFile *file)
 	file->offsets[file->curFile] += file->nbytes;
 	/* we choose not to advance curOffset here */
 
-	BufFileReadCount++;
+	pgBufferUsage.temp_blks_read++;
 }
 
 /*
@@ -304,7 +305,7 @@ BufFileDumpBuffer(BufFile *file)
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
-		BufFileWriteCount++;
+		pgBufferUsage.temp_blks_written++;
 	}
 	file->dirty = false;
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0672652..c985478 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3901,7 +3901,6 @@ ResetUsage(void)
 {
 	getrusage(RUSAGE_SELF, &Save_r);
 	gettimeofday(&Save_t, NULL);
-	ResetBufferUsage();
 }
 
 void
@@ -3912,7 +3911,6 @@ ShowUsage(const char *title)
 				sys;
 	struct timeval elapse_t;
 	struct rusage r;
-	char	   *bufusage;
 
 	getrusage(RUSAGE_SELF, &r);
 	gettimeofday(&elapse_t, NULL);
@@ -3986,10 +3984,6 @@ ShowUsage(const char *title)
 					 r.ru_nvcsw, r.ru_nivcsw);
 #endif   /* HAVE_GETRUSAGE */
 
-	bufusage = ShowBufferUsage();
-	appendStringInfo(&str, "! buffer usage stats:\n%s", bufusage);
-	pfree(bufusage);
-
 	/* remove trailing newline */
 	if (str.data[str.len - 1] == '\n')
 		str.data[--str.len] = '\0';
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ba2ba08..648b2be 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -30,6 +30,7 @@ typedef struct ExplainState
 	bool		verbose;		/* be verbose */
 	bool		analyze;		/* print actual times */
 	bool		costs;			/* print costs */
+	bool		buffers;		/* print buffer usage */
 	ExplainFormat format;		/* output format */
 	/* other states */
 	PlannedStmt *pstmt;			/* top of plan */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9846f6f..4bb6f91 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -16,6 +16,18 @@
 #include "portability/instr_time.h"
 
 
+typedef struct BufferUsage
+{
+	long	shared_blks_hit;		/* # of shared buffer hits */
+	long	shared_blks_read;		/* # of shared disk blocks read */
+	long	shared_blks_written;	/* # of shared disk blocks written */
+	long	local_blks_hit;			/* # of local buffer hits */
+	long	local_blks_read;		/* # of local disk blocks read */
+	long	local_blks_written;		/* # of local disk blocks written */
+	long	temp_blks_read;			/* # of temp blocks read */
+	long	temp_blks_written;		/* # of temp blocks written */
+} BufferUsage;
+
 typedef struct Instrumentation
 {
 	/* Info about current plan cycle: */
@@ -24,13 +36,17 @@ typedef struct Instrumentation
 	instr_time	counter;		/* Accumulated runtime for this node */
 	double		firsttuple;		/* Time for first tuple of this cycle */
 	double		tuplecount;		/* Tuples emitted so far this cycle */
+	BufferUsage	bufusage_start;	/* Buffer usage at start */
 	/* Accumulated statistics across all completed cycles: */
 	double		startup;		/* Total startup time (in seconds) */
 	double		total;			/* Total total time (in seconds) */
 	double		ntuples;		/* Total tuples produced */
 	double		nloops;			/* # of run cycles for this node */
+	BufferUsage	bufusage;		/* Total buffer usage */
 } Instrumentation;
 
+extern BufferUsage		pgBufferUsage;
+
 extern Instrumentation *InstrAlloc(int n);
 extern void InstrStartNode(Instrumentation *instr);
 extern void InstrStopNode(Instrumentation *instr, double nTuples);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 841cf09..42ed94e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -173,16 +173,6 @@ extern PGDLLIMPORT BufferDesc *BufferDescriptors;
 /* in localbuf.c */
 extern BufferDesc *LocalBufferDescriptors;
 
-/* event counters in buf_init.c */
-extern long int ReadBufferCount;
-extern long int ReadLocalBufferCount;
-extern long int BufferHitCount;
-extern long int LocalBufferHitCount;
-extern long int BufferFlushCount;
-extern long int LocalBufferFlushCount;
-extern long int BufFileReadCount;
-extern long int BufFileWriteCount;
-
 
 /*
  * Internal routines: only called by bufmgr
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d06eb77..f8d685c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -173,8 +173,6 @@ extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 extern void InitBufferPool(void);
 extern void InitBufferPoolAccess(void);
 extern void InitBufferPoolBackend(void);
-extern char *ShowBufferUsage(void);
-extern void ResetBufferUsage(void);
 extern void AtEOXact_Buffers(bool isCommit);
 extern void PrintBufferLeakWarning(Buffer buffer);
 extern void CheckPointBuffers(int flags);
#57Takahiro Itagaki
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#56)
Re: EXPLAIN BUFFERS

Robert Haas <robertmhaas@gmail.com> wrote:

OK, done, see attached. I also noticed when looking through this that
the documentation says that auto_explain.log_buffers is ignored unless
auto_explain.log_analyze is set. That is true and seems right to me,
but for some reason explain_ExecutorEnd() had been changed to set
es.analyze if either log_analyze or log_buffers was set.

Thanks. It was my bug.

Could you apply the patch? Or, may I do by myself?

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

#58Robert Haas
robertmhaas@gmail.com
In reply to: Takahiro Itagaki (#57)
Re: EXPLAIN BUFFERS

On Sun, Dec 13, 2009 at 7:55 PM, Takahiro Itagaki
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

OK, done, see attached.  I also noticed when looking through this that
the documentation says that auto_explain.log_buffers is ignored unless
auto_explain.log_analyze is set.  That is true and seems right to me,
but for some reason explain_ExecutorEnd() had been changed to set
es.analyze if either log_analyze or log_buffers was set.

Thanks. It was my bug.

Could you apply the patch? Or, may I do by myself?

Sorry, I've been meaning to look at this a little more and have gotten
distracted.

I have a question about the comment in InstrStopNode(), which reads:
"Adds delta of buffer usage to node's count and resets counter to
start so that the counters are not double counted by parent nodes."
It then calls BufferUsageAccumDiff(), but that function doesn't
actually "reset" anything, so it seems like the comment is wrong.

Two other thoughts:

1. It doesn't appear that there is any provision to ever zero
pgBufferUsage. Shouldn't we do this, say, once per explain, just to
avoid the possibility of overflowing the counters?

2. We seem to do all the work associated with pgBufferUsage even when
the "buffers" option is not passed to explain. The overhead of
incrementing the counters is probably negligible (and we were paying
the equivalent overhead before anyway) but I'm not sure whether saving
the starting counters and accumulating the deltas might be enough to
slow down EXPLAIN ANALYZE. That's sorta slow already so I'd hate to
whack it any more - have you benchmarked this at all?

...Robert

#59Takahiro Itagaki
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#58)
Re: EXPLAIN BUFFERS

Robert Haas <robertmhaas@gmail.com> wrote:

I have a question about the comment in InstrStopNode(), which reads:
"Adds delta of buffer usage to node's count and resets counter to
start so that the counters are not double counted by parent nodes."
It then calls BufferUsageAccumDiff(), but that function doesn't
actually "reset" anything, so it seems like the comment is wrong.

Oops, it's wrong. It just does "Adds delta of buffer usage to node's count."

Two other thoughts:

1. It doesn't appear that there is any provision to ever zero
pgBufferUsage. Shouldn't we do this, say, once per explain, just to
avoid the possibility of overflowing the counters?

I think the overflowing will not be a problem because we only use
the differences of values. The delta is always corrent unless we use
2^32 buffer accesses during one execution of a node.

2. We seem to do all the work associated with pgBufferUsage even when
the "buffers" option is not passed to explain. The overhead of
incrementing the counters is probably negligible (and we were paying
the equivalent overhead before anyway) but I'm not sure whether saving
the starting counters and accumulating the deltas might be enough to
slow down EXPLAIN ANALYZE. That's sorta slow already so I'd hate to
whack it any more - have you benchmarked this at all?

There are 5% of overheads in the worst cases. The difference will be
little if we have more complex operations or some disk I/Os.

Adding Instrumentation->count_bufusage flag could reduce the overheads.
if (instr->count_bufusage)
BufferUsageAccumDiff(...)

Should I add countBufferUsage boolean arguments to all places
doInstrument booleans are currently used? This requires several
minor modifications of codes in many places.

[without patch]
=# EXPLAIN (ANALYZE) SELECT * FROM pgbench_accounts;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pgbench_accounts (cost=0.00..263935.00 rows=10000000 width=97) (actual time=0.003..571.794 rows=10000000 loops=1)
Total runtime: 899.427 ms

[with patch]
=# EXPLAIN (ANALYZE) SELECT * FROM pgbench_accounts;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pgbench_accounts (cost=0.00..263935.00 rows=10000000 width=97) (actual time=0.003..585.885 rows=10000000 loops=1)
Total runtime: 955.280 ms

- shared_buffers = 1500MB
- pgbench -i -s100
- Read all pages in the pgbench_accounts into shared buffers before runs.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: Takahiro Itagaki (#59)
Re: EXPLAIN BUFFERS

Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes:

Should I add countBufferUsage boolean arguments to all places
doInstrument booleans are currently used? This requires several
minor modifications of codes in many places.

Pushing extra arguments around would create overhead of its own ...
overhead that would be paid even when not using EXPLAIN at all.

regards, tom lane

#61Takahiro Itagaki
itagaki.takahiro@oss.ntt.co.jp
In reply to: Tom Lane (#60)
Re: EXPLAIN BUFFERS

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes:

Should I add countBufferUsage boolean arguments to all places
doInstrument booleans are currently used? This requires several
minor modifications of codes in many places.

Pushing extra arguments around would create overhead of its own ...
overhead that would be paid even when not using EXPLAIN at all.

I cannot understand what you mean... The additional argument should
not be a performance overhead because the code path is run only once
per execution. Instrumentation structures are still not allocated
in normal or EXPLAIN queries; allocated only in "EXPLAIN ANALYZE".

Or, are you suggesting to separate buffer counters with Instrumentation
structure? It still requires extra arguments, but it could minimize the
overhead when we use EXPLAIN ANALYZE without BUFFERS. However, we need
additional codes around InstrStartNode/InstrStopNode calls.

Or, are you complaining about non-performance overheads,
something like overheads of code maintenance?

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

#62Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#60)
Re: EXPLAIN BUFFERS

On Sun, Dec 13, 2009 at 10:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes:

Should I add countBufferUsage boolean arguments to all places
doInstrument booleans are currently used? This requires several
minor modifications of codes in many places.

Pushing extra arguments around would create overhead of its own ...
overhead that would be paid even when not using EXPLAIN at all.

Well, I think we need to do something. I don't really want to tack
another 5-6% overhead onto EXPLAIN ANALYZE. Maybe we could recast the
doInstrument argument as a set of OR'd flags?

...Robert

#63Robert Haas
robertmhaas@gmail.com
In reply to: Takahiro Itagaki (#59)
Re: EXPLAIN BUFFERS

On Sun, Dec 13, 2009 at 10:00 PM, Takahiro Itagaki
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Two other thoughts:

1. It doesn't appear that there is any provision to ever zero
pgBufferUsage.  Shouldn't we do this, say, once per explain, just to
avoid the possibility of overflowing the counters?

I think the overflowing will not be a problem because we only use
the differences of values. The delta is always corrent unless we use
2^32 buffer accesses during one execution of a node.

Hmm... you might be right. I'm not savvy enough to know whether there
are any portability concerns here.

Anyone else know?

...Robert

#64Takahiro Itagaki
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#62)
1 attachment(s)
Re: EXPLAIN BUFFERS

Robert Haas <robertmhaas@gmail.com> wrote:

Well, I think we need to do something. I don't really want to tack
another 5-6% overhead onto EXPLAIN ANALYZE. Maybe we could recast the
doInstrument argument as a set of OR'd flags?

I'm thinking the same thing (OR'd flags) right now.

The attached patch adds INSTRUMENT_TIMER and INSTRUMENT_BUFFERS flags.
The types of QueryDesc.doInstrument (renamed to instrument_options) and
EState.es_instrument are changed from bool to int, and they store
OR of InstrumentOption flags. INSTRUMENT_TIMER is always enabled when
instrumetations are initialized, but INSTRUMENT_BUFFERS is enabled only if
we use EXPLAIN BUFFERS. I think the flag options are not so bad idea because
of extensibility. For example, we could support EXPLAIN CPU_USAGE someday.

One issue is in the top-level instrumentation (queryDesc->totaltime).
Since the field might be used by multiple plugins, the first initializer
need to initialize the counter with all options. I used INSTRUMENT_ALL
for it in the patch.

=# EXPLAIN (ANALYZE) SELECT * FROM pgbench_accounts;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pgbench_accounts (cost=0.00..263935.00 rows=10000000 width=97) (actual time=0.003..572.126 rows=10000000 loops=1)
Total runtime: 897.729 ms

=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM pgbench_accounts;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pgbench_accounts (cost=0.00..263935.00 rows=10000000 width=97) (actual time=0.002..580.642 rows=10000000 loops=1)
Buffers: shared hit=163935
Total runtime: 955.744 ms

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

Attachments:

explain_buffers_20091214.patchapplication/octet-stream; name=explain_buffers_20091214.patchDownload
diff -cprN head/contrib/auto_explain/auto_explain.c work/contrib/auto_explain/auto_explain.c
*** head/contrib/auto_explain/auto_explain.c	2009-12-14 09:21:34.822978000 +0900
--- work/contrib/auto_explain/auto_explain.c	2009-12-14 13:28:35.935089826 +0900
*************** PG_MODULE_MAGIC;
*** 22,27 ****
--- 22,28 ----
  static int	auto_explain_log_min_duration = -1; /* msec or -1 */
  static bool auto_explain_log_analyze = false;
  static bool auto_explain_log_verbose = false;
+ static bool auto_explain_log_buffers = false;
  static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
  static bool auto_explain_log_nested_statements = false;
  
*************** _PG_init(void)
*** 93,98 ****
--- 94,109 ----
  							 NULL,
  							 NULL);
  
+ 	DefineCustomBoolVariable("auto_explain.log_buffers",
+ 							 "Log buffers usage.",
+ 							 NULL,
+ 							 &auto_explain_log_buffers,
+ 							 false,
+ 							 PGC_SUSET,
+ 							 0,
+ 							 NULL,
+ 							 NULL);
+ 
  	DefineCustomEnumVariable("auto_explain.log_format",
  							 "EXPLAIN format to be used for plan logging.",
  							 NULL,
*************** explain_ExecutorStart(QueryDesc *queryDe
*** 147,153 ****
  	{
  		/* Enable per-node instrumentation iff log_analyze is required. */
  		if (auto_explain_log_analyze && (eflags & EXEC_FLAG_EXPLAIN_ONLY) == 0)
! 			queryDesc->doInstrument = true;
  	}
  
  	if (prev_ExecutorStart)
--- 158,168 ----
  	{
  		/* Enable per-node instrumentation iff log_analyze is required. */
  		if (auto_explain_log_analyze && (eflags & EXEC_FLAG_EXPLAIN_ONLY) == 0)
! 		{
! 			queryDesc->instrument_options |= INSTRUMENT_TIMER;
! 			if (auto_explain_log_buffers)
! 				queryDesc->instrument_options |= INSTRUMENT_BUFFERS;
! 		}
  	}
  
  	if (prev_ExecutorStart)
*************** explain_ExecutorStart(QueryDesc *queryDe
*** 167,173 ****
  			MemoryContext oldcxt;
  
  			oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
! 			queryDesc->totaltime = InstrAlloc(1);
  			MemoryContextSwitchTo(oldcxt);
  		}
  	}
--- 182,188 ----
  			MemoryContext oldcxt;
  
  			oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
! 			queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
  			MemoryContextSwitchTo(oldcxt);
  		}
  	}
*************** explain_ExecutorEnd(QueryDesc *queryDesc
*** 219,226 ****
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->doInstrument && auto_explain_log_analyze);
  			es.verbose = auto_explain_log_verbose;
  			es.format = auto_explain_log_format;
  
  			ExplainBeginOutput(&es);
--- 234,242 ----
  			ExplainState	es;
  
  			ExplainInitState(&es);
! 			es.analyze = (queryDesc->instrument_options && auto_explain_log_analyze);
  			es.verbose = auto_explain_log_verbose;
+ 			es.buffers = (es.analyze && auto_explain_log_buffers);
  			es.format = auto_explain_log_format;
  
  			ExplainBeginOutput(&es);
diff -cprN head/contrib/pg_stat_statements/pg_stat_statements.c work/contrib/pg_stat_statements/pg_stat_statements.c
*** head/contrib/pg_stat_statements/pg_stat_statements.c	2009-12-03 13:12:47.180551000 +0900
--- work/contrib/pg_stat_statements/pg_stat_statements.c	2009-12-14 13:16:42.859766691 +0900
*************** pgss_ExecutorStart(QueryDesc *queryDesc,
*** 495,501 ****
  			MemoryContext oldcxt;
  
  			oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
! 			queryDesc->totaltime = InstrAlloc(1);
  			MemoryContextSwitchTo(oldcxt);
  		}
  	}
--- 495,501 ----
  			MemoryContext oldcxt;
  
  			oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
! 			queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
  			MemoryContextSwitchTo(oldcxt);
  		}
  	}
diff -cprN head/doc/src/sgml/auto-explain.sgml work/doc/src/sgml/auto-explain.sgml
*** head/doc/src/sgml/auto-explain.sgml	2009-12-11 10:47:06.949855000 +0900
--- work/doc/src/sgml/auto-explain.sgml	2009-12-14 11:32:38.419722226 +0900
*************** LOAD 'auto_explain';
*** 104,109 ****
--- 104,128 ----
  
     <varlistentry>
      <term>
+      <varname>auto_explain.log_buffers</varname> (<type>boolean</type>)
+     </term>
+     <indexterm>
+      <primary><varname>auto_explain.log_buffers</> configuration parameter</primary>
+     </indexterm>
+     <listitem>
+      <para>
+       <varname>auto_explain.log_buffers</varname> causes <command>EXPLAIN
+       (ANALYZE, BUFFERS)</> output, rather than just <command>EXPLAIN</> 
+       output, to be printed when an execution plan is logged. This parameter is 
+       off by default. Only superusers can change this setting. This
+       parameter has no effect unless <varname>auto_explain.log_analyze</>
+       parameter is set.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
+     <term>
       <varname>auto_explain.log_format</varname> (<type>enum</type>)
      </term>
      <indexterm>
diff -cprN head/doc/src/sgml/ref/explain.sgml work/doc/src/sgml/ref/explain.sgml
*** head/doc/src/sgml/ref/explain.sgml	2009-12-11 10:47:06.949855000 +0900
--- work/doc/src/sgml/ref/explain.sgml	2009-12-14 11:32:38.419722226 +0900
*************** PostgreSQL documentation
*** 31,37 ****
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON | YAML } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
--- 31,37 ----
  
   <refsynopsisdiv>
  <synopsis>
! EXPLAIN [ ( { ANALYZE <replaceable class="parameter">boolean</replaceable> | VERBOSE <replaceable class="parameter">boolean</replaceable> | COSTS <replaceable class="parameter">boolean</replaceable> | BUFFERS <replaceable class="parameter">boolean</replaceable> | FORMAT { TEXT | XML | JSON | YAML } } [, ...] ) ] <replaceable class="parameter">statement</replaceable>
  EXPLAIN [ ANALYZE ] [ VERBOSE ] <replaceable class="parameter">statement</replaceable>
  </synopsis>
   </refsynopsisdiv>
*************** ROLLBACK;
*** 140,145 ****
--- 140,163 ----
     </varlistentry>
  
     <varlistentry>
+     <term><literal>BUFFERS</literal></term>
+     <listitem>
+      <para>
+       Include information on buffer usage. Specifically, include the number of
+       shared blocks hits, reads, and writes, the number of local blocks hits,
+       reads, and writes, and the number of temp blocks reads and writes.
+       Shared blocks, local blocks, and temp blocks contain tables and indexes,
+       temporary tables and temporary indexes, and disk blocks used in sort and
+       materialized plans, respectively. The number of blocks shown for an
+       upper-level node includes those used by all its child nodes.  In text
+       format, only non-zero values are printed.  This parameter may only be
+       used with <literal>ANALYZE</literal> parameter.  It defaults to
+       <literal>FALSE</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
+    <varlistentry>
      <term><literal>FORMAT</literal></term>
      <listitem>
       <para>
diff -cprN head/src/backend/commands/copy.c work/src/backend/commands/copy.c
*** head/src/backend/commands/copy.c	2009-11-24 10:04:57.883822000 +0900
--- work/src/backend/commands/copy.c	2009-12-14 13:21:28.588811970 +0900
*************** DoCopy(const CopyStmt *stmt, const char 
*** 1094,1100 ****
  		cstate->queryDesc = CreateQueryDesc(plan, queryString,
  											GetActiveSnapshot(),
  											InvalidSnapshot,
! 											dest, NULL, false);
  
  		/*
  		 * Call ExecutorStart to prepare the plan for execution.
--- 1094,1100 ----
  		cstate->queryDesc = CreateQueryDesc(plan, queryString,
  											GetActiveSnapshot(),
  											InvalidSnapshot,
! 											dest, NULL, 0);
  
  		/*
  		 * Call ExecutorStart to prepare the plan for execution.
diff -cprN head/src/backend/commands/explain.c work/src/backend/commands/explain.c
*** head/src/backend/commands/explain.c	2009-12-14 09:21:34.822978000 +0900
--- work/src/backend/commands/explain.c	2009-12-14 13:20:19.415713632 +0900
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 125,130 ****
--- 125,132 ----
  			es.verbose = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "costs") == 0)
  			es.costs = defGetBoolean(opt);
+ 		else if (strcmp(opt->defname, "buffers") == 0)
+ 			es.buffers = defGetBoolean(opt);
  		else if (strcmp(opt->defname, "format") == 0)
  		{
  			char   *p = defGetString(opt);
*************** ExplainQuery(ExplainStmt *stmt, const ch
*** 150,155 ****
--- 152,162 ----
  							opt->defname)));
  	}
  
+ 	if (es.buffers && !es.analyze)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 			 errmsg("EXPLAIN option BUFFERS requires ANALYZE")));
+ 
  	/*
  	 * Run parse analysis and rewrite.	Note this also acquires sufficient
  	 * locks on the source table(s).
*************** ExplainOnePlan(PlannedStmt *plannedstmt,
*** 339,344 ****
--- 346,357 ----
  	instr_time	starttime;
  	double		totaltime = 0;
  	int			eflags;
+ 	int			instrument_option = 0;
+ 
+ 	if (es->analyze)
+ 		instrument_option |= INSTRUMENT_TIMER;
+ 	if (es->buffers)
+ 		instrument_option |= INSTRUMENT_BUFFERS;
  
  	/*
  	 * Use a snapshot with an updated command ID to ensure this query sees
*************** ExplainOnePlan(PlannedStmt *plannedstmt,
*** 349,355 ****
  	/* Create a QueryDesc requesting no output */
  	queryDesc = CreateQueryDesc(plannedstmt, queryString,
  								GetActiveSnapshot(), InvalidSnapshot,
! 								None_Receiver, params, es->analyze);
  
  	INSTR_TIME_SET_CURRENT(starttime);
  
--- 362,368 ----
  	/* Create a QueryDesc requesting no output */
  	queryDesc = CreateQueryDesc(plannedstmt, queryString,
  								GetActiveSnapshot(), InvalidSnapshot,
! 								None_Receiver, params, instrument_option);
  
  	INSTR_TIME_SET_CURRENT(starttime);
  
*************** ExplainNode(Plan *plan, PlanState *plans
*** 1042,1047 ****
--- 1055,1138 ----
  			break;
  	}
  
+ 	/* Show buffer usage */
+ 	if (es->buffers)
+ 	{
+ 		const BufferUsage *usage = &planstate->instrument->bufusage;
+ 
+ 		if (es->format == EXPLAIN_FORMAT_TEXT)
+ 		{
+ 			bool	has_shared = (usage->shared_blks_hit > 0 ||
+ 								  usage->shared_blks_read > 0 ||
+ 								  usage->shared_blks_written);
+ 			bool	has_local = (usage->local_blks_hit > 0 ||
+ 								 usage->local_blks_read > 0 ||
+ 								 usage->local_blks_written);
+ 			bool	has_temp = (usage->temp_blks_read > 0 ||
+ 								usage->temp_blks_written);
+ 
+ 			/* Show only positive counter values. */
+ 			if (has_shared || has_local || has_temp)
+ 			{
+ 				appendStringInfoSpaces(es->str, es->indent * 2);
+ 				appendStringInfoString(es->str, "Buffers:");
+ 
+ 				if (has_shared)
+ 				{
+ 					appendStringInfoString(es->str, " shared");
+ 					if (usage->shared_blks_hit > 0)
+ 						appendStringInfo(es->str, " hit=%ld",
+ 							usage->shared_blks_hit);
+ 					if (usage->shared_blks_read > 0)
+ 						appendStringInfo(es->str, " read=%ld",
+ 							usage->shared_blks_read);
+ 					if (usage->shared_blks_written > 0)
+ 						appendStringInfo(es->str, " written=%ld",
+ 							usage->shared_blks_written);
+ 					if (has_local || has_temp)
+ 						appendStringInfoChar(es->str, ',');
+ 				}
+ 				if (has_local)
+ 				{
+ 					appendStringInfoString(es->str, " local");
+ 					if (usage->local_blks_hit > 0)
+ 						appendStringInfo(es->str, " hit=%ld",
+ 							usage->local_blks_hit);
+ 					if (usage->local_blks_read > 0)
+ 						appendStringInfo(es->str, " read=%ld",
+ 							usage->local_blks_read);
+ 					if (usage->local_blks_written > 0)
+ 						appendStringInfo(es->str, " written=%ld",
+ 							usage->local_blks_written);
+ 					if (has_temp)
+ 						appendStringInfoChar(es->str, ',');
+ 				}
+ 				if (has_temp)
+ 				{
+ 					appendStringInfoString(es->str, " temp");
+ 					if (usage->temp_blks_read > 0)
+ 						appendStringInfo(es->str, " read=%ld",
+ 							usage->temp_blks_read);
+ 					if (usage->temp_blks_written > 0)
+ 						appendStringInfo(es->str, " written=%ld",
+ 							usage->temp_blks_written);
+ 				}
+ 				appendStringInfoChar(es->str, '\n');
+ 			}
+ 		}
+ 		else
+ 		{
+ 			ExplainPropertyLong("Shared Hit Blocks", usage->shared_blks_hit, es);
+ 			ExplainPropertyLong("Shared Read Blocks", usage->shared_blks_read, es);
+ 			ExplainPropertyLong("Shared Written Blocks", usage->shared_blks_written, es);
+ 			ExplainPropertyLong("Local Hit Blocks", usage->local_blks_hit, es);
+ 			ExplainPropertyLong("Local Read Blocks", usage->local_blks_read, es);
+ 			ExplainPropertyLong("Local Written Blocks", usage->local_blks_written, es);
+ 			ExplainPropertyLong("Temp Read Blocks", usage->temp_blks_read, es);
+ 			ExplainPropertyLong("Temp Written Blocks", usage->temp_blks_written, es);
+ 		}
+ 	}
+ 
  	/* Get ready to display the child plans */
  	haschildren = plan->initPlan ||
  		outerPlan(plan) ||
diff -cprN head/src/backend/commands/tablecmds.c work/src/backend/commands/tablecmds.c
*** head/src/backend/commands/tablecmds.c	2009-12-11 12:39:49.829461000 +0900
--- work/src/backend/commands/tablecmds.c	2009-12-14 13:21:28.546723185 +0900
*************** ExecuteTruncate(TruncateStmt *stmt)
*** 936,942 ****
  						  rel,
  						  0,	/* dummy rangetable index */
  						  CMD_DELETE,	/* don't need any index info */
! 						  false);
  		resultRelInfo++;
  	}
  	estate->es_result_relations = resultRelInfos;
--- 936,942 ----
  						  rel,
  						  0,	/* dummy rangetable index */
  						  CMD_DELETE,	/* don't need any index info */
! 						  0);
  		resultRelInfo++;
  	}
  	estate->es_result_relations = resultRelInfos;
diff -cprN head/src/backend/executor/execMain.c work/src/backend/executor/execMain.c
*** head/src/backend/executor/execMain.c	2009-12-14 09:21:34.822978000 +0900
--- work/src/backend/executor/execMain.c	2009-12-14 13:10:49.340796341 +0900
*************** standard_ExecutorStart(QueryDesc *queryD
*** 180,186 ****
  	 */
  	estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
  	estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
! 	estate->es_instrument = queryDesc->doInstrument;
  
  	/*
  	 * Initialize the plan state tree
--- 180,186 ----
  	 */
  	estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
  	estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
! 	estate->es_instrument = queryDesc->instrument_options;
  
  	/*
  	 * Initialize the plan state tree
*************** InitResultRelInfo(ResultRelInfo *resultR
*** 859,865 ****
  				  Relation resultRelationDesc,
  				  Index resultRelationIndex,
  				  CmdType operation,
! 				  bool doInstrument)
  {
  	/*
  	 * Check valid relkind ... parser and/or planner should have noticed this
--- 859,865 ----
  				  Relation resultRelationDesc,
  				  Index resultRelationIndex,
  				  CmdType operation,
! 				  int instrument_options)
  {
  	/*
  	 * Check valid relkind ... parser and/or planner should have noticed this
*************** InitResultRelInfo(ResultRelInfo *resultR
*** 914,923 ****
  			palloc0(n * sizeof(FmgrInfo));
  		resultRelInfo->ri_TrigWhenExprs = (List **)
  			palloc0(n * sizeof(List *));
! 		if (doInstrument)
! 			resultRelInfo->ri_TrigInstrument = InstrAlloc(n);
! 		else
! 			resultRelInfo->ri_TrigInstrument = NULL;
  	}
  	else
  	{
--- 914,921 ----
  			palloc0(n * sizeof(FmgrInfo));
  		resultRelInfo->ri_TrigWhenExprs = (List **)
  			palloc0(n * sizeof(List *));
! 		if (instrument_options)
! 			resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
  	}
  	else
  	{
diff -cprN head/src/backend/executor/execProcnode.c work/src/backend/executor/execProcnode.c
*** head/src/backend/executor/execProcnode.c	2009-10-13 09:24:03.097662000 +0900
--- work/src/backend/executor/execProcnode.c	2009-12-14 13:10:30.437752389 +0900
*************** ExecInitNode(Plan *node, EState *estate,
*** 321,327 ****
  
  	/* Set up instrumentation for this node if requested */
  	if (estate->es_instrument)
! 		result->instrument = InstrAlloc(1);
  
  	return result;
  }
--- 321,327 ----
  
  	/* Set up instrumentation for this node if requested */
  	if (estate->es_instrument)
! 		result->instrument = InstrAlloc(1, estate->es_instrument);
  
  	return result;
  }
diff -cprN head/src/backend/executor/functions.c work/src/backend/executor/functions.c
*** head/src/backend/executor/functions.c	2009-11-06 09:53:35.834256000 +0900
--- work/src/backend/executor/functions.c	2009-12-14 13:21:28.623763373 +0900
*************** postquel_start(execution_state *es, SQLF
*** 414,420 ****
  								 fcache->src,
  								 snapshot, InvalidSnapshot,
  								 dest,
! 								 fcache->paramLI, false);
  	else
  		es->qd = CreateUtilityQueryDesc(es->stmt,
  										fcache->src,
--- 414,420 ----
  								 fcache->src,
  								 snapshot, InvalidSnapshot,
  								 dest,
! 								 fcache->paramLI, 0);
  	else
  		es->qd = CreateUtilityQueryDesc(es->stmt,
  										fcache->src,
diff -cprN head/src/backend/executor/instrument.c work/src/backend/executor/instrument.c
*** head/src/backend/executor/instrument.c	2009-01-05 00:22:25.168790000 +0900
--- work/src/backend/executor/instrument.c	2009-12-14 13:17:59.739739775 +0900
***************
*** 17,30 ****
  
  #include "executor/instrument.h"
  
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
! InstrAlloc(int n)
  {
! 	Instrumentation *instr = palloc0(n * sizeof(Instrumentation));
  
! 	/* we don't need to do any initialization except zero 'em */
  
  	return instr;
  }
--- 17,44 ----
  
  #include "executor/instrument.h"
  
+ BufferUsage			pgBufferUsage;
+ 
+ static void BufferUsageAccumDiff(BufferUsage *dst,
+ 		const BufferUsage *add, const BufferUsage *sub);
  
  /* Allocate new instrumentation structure(s) */
  Instrumentation *
! InstrAlloc(int n, int instrument_options)
  {
! 	Instrumentation *instr;
! 
! 	/* timer is always required for now */
! 	Assert(instrument_options & INSTRUMENT_TIMER);
  
! 	instr = palloc0(n * sizeof(Instrumentation));
! 	if (instrument_options & INSTRUMENT_BUFFERS)
! 	{
! 		int		i;
! 
! 		for (i = 0; i < n; i++)
! 			instr[i].needs_bufusage = true;
! 	}
  
  	return instr;
  }
*************** InstrStartNode(Instrumentation *instr)
*** 37,42 ****
--- 51,60 ----
  		INSTR_TIME_SET_CURRENT(instr->starttime);
  	else
  		elog(DEBUG2, "InstrStartNode called twice in a row");
+ 
+ 	/* initialize buffer usage per plan node */
+ 	if (instr->needs_bufusage)
+ 		instr->bufusage_start = pgBufferUsage;
  }
  
  /* Exit from a plan node */
*************** InstrStopNode(Instrumentation *instr, do
*** 59,64 ****
--- 77,87 ----
  
  	INSTR_TIME_SET_ZERO(instr->starttime);
  
+ 	/* Adds delta of buffer usage to node's count. */
+ 	if (instr->needs_bufusage)
+ 		BufferUsageAccumDiff(&instr->bufusage,
+ 			&pgBufferUsage, &instr->bufusage_start);
+ 
  	/* Is this the first tuple of this cycle? */
  	if (!instr->running)
  	{
*************** InstrEndLoop(Instrumentation *instr)
*** 95,97 ****
--- 118,136 ----
  	instr->firsttuple = 0;
  	instr->tuplecount = 0;
  }
+ 
+ static void
+ BufferUsageAccumDiff(BufferUsage *dst,
+ 					 const BufferUsage *add,
+ 					 const BufferUsage *sub)
+ {
+ 	/* dst += add - sub */
+ 	dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
+ 	dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
+ 	dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
+ 	dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
+ 	dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
+ 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
+ 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
+ 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+ }
diff -cprN head/src/backend/executor/spi.c work/src/backend/executor/spi.c
*** head/src/backend/executor/spi.c	2009-11-06 09:53:35.834256000 +0900
--- work/src/backend/executor/spi.c	2009-12-14 13:21:28.660753362 +0900
*************** _SPI_execute_plan(SPIPlanPtr plan, Param
*** 1908,1914 ****
  										plansource->query_string,
  										snap, crosscheck_snapshot,
  										dest,
! 										paramLI, false);
  				res = _SPI_pquery(qdesc, fire_triggers,
  								  canSetTag ? tcount : 0);
  				FreeQueryDesc(qdesc);
--- 1908,1914 ----
  										plansource->query_string,
  										snap, crosscheck_snapshot,
  										dest,
! 										paramLI, 0);
  				res = _SPI_pquery(qdesc, fire_triggers,
  								  canSetTag ? tcount : 0);
  				FreeQueryDesc(qdesc);
diff -cprN head/src/backend/storage/buffer/buf_init.c work/src/backend/storage/buffer/buf_init.c
*** head/src/backend/storage/buffer/buf_init.c	2009-01-05 00:22:25.168790000 +0900
--- work/src/backend/storage/buffer/buf_init.c	2009-12-14 11:32:38.421721964 +0900
*************** BufferDesc *BufferDescriptors;
*** 22,37 ****
  char	   *BufferBlocks;
  int32	   *PrivateRefCount;
  
- /* statistics counters */
- long int	ReadBufferCount;
- long int	ReadLocalBufferCount;
- long int	BufferHitCount;
- long int	LocalBufferHitCount;
- long int	BufferFlushCount;
- long int	LocalBufferFlushCount;
- long int	BufFileReadCount;
- long int	BufFileWriteCount;
- 
  
  /*
   * Data Structures:
--- 22,27 ----
diff -cprN head/src/backend/storage/buffer/bufmgr.c work/src/backend/storage/buffer/bufmgr.c
*** head/src/backend/storage/buffer/bufmgr.c	2009-06-12 09:52:43.356212000 +0900
--- work/src/backend/storage/buffer/bufmgr.c	2009-12-14 11:32:38.422722048 +0900
***************
*** 34,39 ****
--- 34,40 ----
  #include <unistd.h>
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "pgstat.h"
*************** ReadBuffer_common(SMgrRelation smgr, boo
*** 300,321 ****
  
  	if (isLocalBuf)
  	{
- 		ReadLocalBufferCount++;
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			LocalBufferHitCount++;
  	}
  	else
  	{
- 		ReadBufferCount++;
- 
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			BufferHitCount++;
  	}
  
  	/* At this point we do NOT hold any locks. */
--- 301,323 ----
  
  	if (isLocalBuf)
  	{
  		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
  		if (found)
! 			pgBufferUsage.local_blks_hit++;
! 		else
! 			pgBufferUsage.local_blks_read++;
  	}
  	else
  	{
  		/*
  		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
  		 * not currently in memory.
  		 */
  		bufHdr = BufferAlloc(smgr, forkNum, blockNum, strategy, &found);
  		if (found)
! 			pgBufferUsage.shared_blks_hit++;
! 		else
! 			pgBufferUsage.shared_blks_read++;
  	}
  
  	/* At this point we do NOT hold any locks. */
*************** SyncOneBuffer(int buf_id, bool skip_rece
*** 1611,1664 ****
  
  
  /*
-  * Return a palloc'd string containing buffer usage statistics.
-  */
- char *
- ShowBufferUsage(void)
- {
- 	StringInfoData str;
- 	float		hitrate;
- 	float		localhitrate;
- 
- 	initStringInfo(&str);
- 
- 	if (ReadBufferCount == 0)
- 		hitrate = 0.0;
- 	else
- 		hitrate = (float) BufferHitCount *100.0 / ReadBufferCount;
- 
- 	if (ReadLocalBufferCount == 0)
- 		localhitrate = 0.0;
- 	else
- 		localhitrate = (float) LocalBufferHitCount *100.0 / ReadLocalBufferCount;
- 
- 	appendStringInfo(&str,
- 	"!\tShared blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 				ReadBufferCount - BufferHitCount, BufferFlushCount, hitrate);
- 	appendStringInfo(&str,
- 	"!\tLocal  blocks: %10ld read, %10ld written, buffer hit rate = %.2f%%\n",
- 					 ReadLocalBufferCount - LocalBufferHitCount, LocalBufferFlushCount, localhitrate);
- 	appendStringInfo(&str,
- 					 "!\tDirect blocks: %10ld read, %10ld written\n",
- 					 BufFileReadCount, BufFileWriteCount);
- 
- 	return str.data;
- }
- 
- void
- ResetBufferUsage(void)
- {
- 	BufferHitCount = 0;
- 	ReadBufferCount = 0;
- 	BufferFlushCount = 0;
- 	LocalBufferHitCount = 0;
- 	ReadLocalBufferCount = 0;
- 	LocalBufferFlushCount = 0;
- 	BufFileReadCount = 0;
- 	BufFileWriteCount = 0;
- }
- 
- /*
   *		AtEOXact_Buffers - clean up at end of transaction.
   *
   *		As of PostgreSQL 8.0, buffer pins should get released by the
--- 1613,1618 ----
*************** FlushBuffer(volatile BufferDesc *buf, SM
*** 1916,1922 ****
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	BufferFlushCount++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
--- 1870,1876 ----
  			  (char *) BufHdrGetBlock(buf),
  			  false);
  
! 	pgBufferUsage.shared_blks_written++;
  
  	/*
  	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
diff -cprN head/src/backend/storage/buffer/localbuf.c work/src/backend/storage/buffer/localbuf.c
*** head/src/backend/storage/buffer/localbuf.c	2009-06-12 09:52:43.356212000 +0900
--- work/src/backend/storage/buffer/localbuf.c	2009-12-14 11:32:38.422722048 +0900
***************
*** 16,21 ****
--- 16,22 ----
  #include "postgres.h"
  
  #include "catalog/catalog.h"
+ #include "executor/instrument.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/smgr.h"
*************** LocalBufferAlloc(SMgrRelation smgr, Fork
*** 209,215 ****
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		LocalBufferFlushCount++;
  	}
  
  	/*
--- 210,216 ----
  		/* Mark not-dirty now in case we error out below */
  		bufHdr->flags &= ~BM_DIRTY;
  
! 		pgBufferUsage.local_blks_written++;
  	}
  
  	/*
diff -cprN head/src/backend/storage/file/buffile.c work/src/backend/storage/file/buffile.c
*** head/src/backend/storage/file/buffile.c	2009-06-12 09:52:43.356212000 +0900
--- work/src/backend/storage/file/buffile.c	2009-12-14 11:32:38.422722048 +0900
***************
*** 34,39 ****
--- 34,40 ----
  
  #include "postgres.h"
  
+ #include "executor/instrument.h"
  #include "storage/fd.h"
  #include "storage/buffile.h"
  #include "storage/buf_internals.h"
*************** BufFileLoadBuffer(BufFile *file)
*** 240,246 ****
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	BufFileReadCount++;
  }
  
  /*
--- 241,247 ----
  	file->offsets[file->curFile] += file->nbytes;
  	/* we choose not to advance curOffset here */
  
! 	pgBufferUsage.temp_blks_read++;
  }
  
  /*
*************** BufFileDumpBuffer(BufFile *file)
*** 304,310 ****
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		BufFileWriteCount++;
  	}
  	file->dirty = false;
  
--- 305,311 ----
  		file->curOffset += bytestowrite;
  		wpos += bytestowrite;
  
! 		pgBufferUsage.temp_blks_written++;
  	}
  	file->dirty = false;
  
diff -cprN head/src/backend/tcop/postgres.c work/src/backend/tcop/postgres.c
*** head/src/backend/tcop/postgres.c	2009-11-06 09:53:35.834256000 +0900
--- work/src/backend/tcop/postgres.c	2009-12-14 11:32:38.423855423 +0900
*************** ResetUsage(void)
*** 3901,3907 ****
  {
  	getrusage(RUSAGE_SELF, &Save_r);
  	gettimeofday(&Save_t, NULL);
- 	ResetBufferUsage();
  }
  
  void
--- 3901,3906 ----
*************** ShowUsage(const char *title)
*** 3912,3918 ****
  				sys;
  	struct timeval elapse_t;
  	struct rusage r;
- 	char	   *bufusage;
  
  	getrusage(RUSAGE_SELF, &r);
  	gettimeofday(&elapse_t, NULL);
--- 3911,3916 ----
*************** ShowUsage(const char *title)
*** 3986,3995 ****
  					 r.ru_nvcsw, r.ru_nivcsw);
  #endif   /* HAVE_GETRUSAGE */
  
- 	bufusage = ShowBufferUsage();
- 	appendStringInfo(&str, "! buffer usage stats:\n%s", bufusage);
- 	pfree(bufusage);
- 
  	/* remove trailing newline */
  	if (str.data[str.len - 1] == '\n')
  		str.data[--str.len] = '\0';
--- 3984,3989 ----
diff -cprN head/src/backend/tcop/pquery.c work/src/backend/tcop/pquery.c
*** head/src/backend/tcop/pquery.c	2009-10-13 09:24:03.097662000 +0900
--- work/src/backend/tcop/pquery.c	2009-12-14 13:21:28.483752760 +0900
*************** CreateQueryDesc(PlannedStmt *plannedstmt
*** 67,73 ****
  				Snapshot crosscheck_snapshot,
  				DestReceiver *dest,
  				ParamListInfo params,
! 				bool doInstrument)
  {
  	QueryDesc  *qd = (QueryDesc *) palloc(sizeof(QueryDesc));
  
--- 67,73 ----
  				Snapshot crosscheck_snapshot,
  				DestReceiver *dest,
  				ParamListInfo params,
! 				int instrument_options)
  {
  	QueryDesc  *qd = (QueryDesc *) palloc(sizeof(QueryDesc));
  
*************** CreateQueryDesc(PlannedStmt *plannedstmt
*** 80,86 ****
  	qd->crosscheck_snapshot = RegisterSnapshot(crosscheck_snapshot);
  	qd->dest = dest;			/* output dest */
  	qd->params = params;		/* parameter values passed into query */
! 	qd->doInstrument = doInstrument;	/* instrumentation wanted? */
  
  	/* null these fields until set by ExecutorStart */
  	qd->tupDesc = NULL;
--- 80,86 ----
  	qd->crosscheck_snapshot = RegisterSnapshot(crosscheck_snapshot);
  	qd->dest = dest;			/* output dest */
  	qd->params = params;		/* parameter values passed into query */
! 	qd->instrument_options = instrument_options;	/* instrumentation wanted? */
  
  	/* null these fields until set by ExecutorStart */
  	qd->tupDesc = NULL;
*************** CreateUtilityQueryDesc(Node *utilitystmt
*** 111,117 ****
  	qd->crosscheck_snapshot = InvalidSnapshot;	/* RI check snapshot */
  	qd->dest = dest;			/* output dest */
  	qd->params = params;		/* parameter values passed into query */
! 	qd->doInstrument = false;	/* uninteresting for utilities */
  
  	/* null these fields until set by ExecutorStart */
  	qd->tupDesc = NULL;
--- 111,117 ----
  	qd->crosscheck_snapshot = InvalidSnapshot;	/* RI check snapshot */
  	qd->dest = dest;			/* output dest */
  	qd->params = params;		/* parameter values passed into query */
! 	qd->instrument_options = false;	/* uninteresting for utilities */
  
  	/* null these fields until set by ExecutorStart */
  	qd->tupDesc = NULL;
*************** ProcessQuery(PlannedStmt *plan,
*** 178,184 ****
  	 */
  	queryDesc = CreateQueryDesc(plan, sourceText,
  								GetActiveSnapshot(), InvalidSnapshot,
! 								dest, params, false);
  
  	/*
  	 * Set up to collect AFTER triggers
--- 178,184 ----
  	 */
  	queryDesc = CreateQueryDesc(plan, sourceText,
  								GetActiveSnapshot(), InvalidSnapshot,
! 								dest, params, 0);
  
  	/*
  	 * Set up to collect AFTER triggers
*************** PortalStart(Portal portal, ParamListInfo
*** 515,521 ****
  											InvalidSnapshot,
  											None_Receiver,
  											params,
! 											false);
  
  				/*
  				 * We do *not* call AfterTriggerBeginQuery() here.	We assume
--- 515,521 ----
  											InvalidSnapshot,
  											None_Receiver,
  											params,
! 											0);
  
  				/*
  				 * We do *not* call AfterTriggerBeginQuery() here.	We assume
diff -cprN head/src/include/commands/explain.h work/src/include/commands/explain.h
*** head/src/include/commands/explain.h	2009-12-14 09:21:34.822978000 +0900
--- work/src/include/commands/explain.h	2009-12-14 11:32:38.424717292 +0900
*************** typedef struct ExplainState
*** 30,35 ****
--- 30,36 ----
  	bool		verbose;		/* be verbose */
  	bool		analyze;		/* print actual times */
  	bool		costs;			/* print costs */
+ 	bool		buffers;		/* print buffer usage */
  	ExplainFormat format;		/* output format */
  	/* other states */
  	PlannedStmt *pstmt;			/* top of plan */
diff -cprN head/src/include/executor/execdesc.h work/src/include/executor/execdesc.h
*** head/src/include/executor/execdesc.h	2009-01-05 00:22:25.168790000 +0900
--- work/src/include/executor/execdesc.h	2009-12-14 13:16:23.718736978 +0900
*************** typedef struct QueryDesc
*** 42,48 ****
  	Snapshot	crosscheck_snapshot;	/* crosscheck for RI update/delete */
  	DestReceiver *dest;			/* the destination for tuple output */
  	ParamListInfo params;		/* param values being passed in */
! 	bool		doInstrument;	/* TRUE requests runtime instrumentation */
  
  	/* These fields are set by ExecutorStart */
  	TupleDesc	tupDesc;		/* descriptor for result tuples */
--- 42,48 ----
  	Snapshot	crosscheck_snapshot;	/* crosscheck for RI update/delete */
  	DestReceiver *dest;			/* the destination for tuple output */
  	ParamListInfo params;		/* param values being passed in */
! 	int			instrument_options;		/* OR of InstrumentOption flags */
  
  	/* These fields are set by ExecutorStart */
  	TupleDesc	tupDesc;		/* descriptor for result tuples */
*************** extern QueryDesc *CreateQueryDesc(Planne
*** 60,66 ****
  				Snapshot crosscheck_snapshot,
  				DestReceiver *dest,
  				ParamListInfo params,
! 				bool doInstrument);
  
  extern QueryDesc *CreateUtilityQueryDesc(Node *utilitystmt,
  					   const char *sourceText,
--- 60,66 ----
  				Snapshot crosscheck_snapshot,
  				DestReceiver *dest,
  				ParamListInfo params,
! 				int instrument_options);
  
  extern QueryDesc *CreateUtilityQueryDesc(Node *utilitystmt,
  					   const char *sourceText,
diff -cprN head/src/include/executor/executor.h work/src/include/executor/executor.h
*** head/src/include/executor/executor.h	2009-12-09 13:45:22.745455000 +0900
--- work/src/include/executor/executor.h	2009-12-14 13:09:03.597764161 +0900
*************** extern void InitResultRelInfo(ResultRelI
*** 161,167 ****
  				  Relation resultRelationDesc,
  				  Index resultRelationIndex,
  				  CmdType operation,
! 				  bool doInstrument);
  extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
  extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
  extern void ExecConstraints(ResultRelInfo *resultRelInfo,
--- 161,167 ----
  				  Relation resultRelationDesc,
  				  Index resultRelationIndex,
  				  CmdType operation,
! 				  int instrument_options);
  extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
  extern bool ExecContextForcesOids(PlanState *planstate, bool *hasoids);
  extern void ExecConstraints(ResultRelInfo *resultRelInfo,
diff -cprN head/src/include/executor/instrument.h work/src/include/executor/instrument.h
*** head/src/include/executor/instrument.h	2009-01-05 00:22:25.168790000 +0900
--- work/src/include/executor/instrument.h	2009-12-14 13:17:12.753756326 +0900
***************
*** 16,37 ****
  #include "portability/instr_time.h"
  
  
  typedef struct Instrumentation
  {
  	/* Info about current plan cycle: */
  	bool		running;		/* TRUE if we've completed first tuple */
  	instr_time	starttime;		/* Start time of current iteration of node */
  	instr_time	counter;		/* Accumulated runtime for this node */
  	double		firsttuple;		/* Time for first tuple of this cycle */
  	double		tuplecount;		/* Tuples emitted so far this cycle */
  	/* Accumulated statistics across all completed cycles: */
  	double		startup;		/* Total startup time (in seconds) */
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
  } Instrumentation;
  
! extern Instrumentation *InstrAlloc(int n);
  extern void InstrStartNode(Instrumentation *instr);
  extern void InstrStopNode(Instrumentation *instr, double nTuples);
  extern void InstrEndLoop(Instrumentation *instr);
--- 16,61 ----
  #include "portability/instr_time.h"
  
  
+ typedef struct BufferUsage
+ {
+ 	long	shared_blks_hit;		/* # of shared buffer hits */
+ 	long	shared_blks_read;		/* # of shared disk blocks read */
+ 	long	shared_blks_written;	/* # of shared disk blocks written */
+ 	long	local_blks_hit;			/* # of local buffer hits */
+ 	long	local_blks_read;		/* # of local disk blocks read */
+ 	long	local_blks_written;		/* # of local disk blocks written */
+ 	long	temp_blks_read;			/* # of temp blocks read */
+ 	long	temp_blks_written;		/* # of temp blocks written */
+ } BufferUsage;
+ 
+ typedef enum InstrumentOption
+ {
+ 	INSTRUMENT_TIMER	= 1 << 0,		/* needs timer */
+ 	INSTRUMENT_BUFFERS	= 1 << 1,		/* needs buffer usage */
+ 	INSTRUMENT_ALL		= 0x7FFFFFFF
+ } InstrumentOption;
+ 
  typedef struct Instrumentation
  {
  	/* Info about current plan cycle: */
  	bool		running;		/* TRUE if we've completed first tuple */
+ 	bool		needs_bufusage;	/* TRUE if we need buffer usage */
  	instr_time	starttime;		/* Start time of current iteration of node */
  	instr_time	counter;		/* Accumulated runtime for this node */
  	double		firsttuple;		/* Time for first tuple of this cycle */
  	double		tuplecount;		/* Tuples emitted so far this cycle */
+ 	BufferUsage	bufusage_start;	/* Buffer usage at start */
  	/* Accumulated statistics across all completed cycles: */
  	double		startup;		/* Total startup time (in seconds) */
  	double		total;			/* Total total time (in seconds) */
  	double		ntuples;		/* Total tuples produced */
  	double		nloops;			/* # of run cycles for this node */
+ 	BufferUsage	bufusage;		/* Total buffer usage */
  } Instrumentation;
  
! extern BufferUsage		pgBufferUsage;
! 
! extern Instrumentation *InstrAlloc(int n, int instrument_options);
  extern void InstrStartNode(Instrumentation *instr);
  extern void InstrStopNode(Instrumentation *instr, double nTuples);
  extern void InstrEndLoop(Instrumentation *instr);
diff -cprN head/src/include/nodes/execnodes.h work/src/include/nodes/execnodes.h
*** head/src/include/nodes/execnodes.h	2009-12-09 13:45:22.745455000 +0900
--- work/src/include/nodes/execnodes.h	2009-12-14 13:09:42.846822263 +0900
*************** typedef struct EState
*** 370,376 ****
  	uint32		es_processed;	/* # of tuples processed */
  	Oid			es_lastoid;		/* last oid processed (by INSERT) */
  
! 	bool		es_instrument;	/* true requests runtime instrumentation */
  	bool		es_select_into; /* true if doing SELECT INTO */
  	bool		es_into_oids;	/* true to generate OIDs in SELECT INTO */
  
--- 370,376 ----
  	uint32		es_processed;	/* # of tuples processed */
  	Oid			es_lastoid;		/* last oid processed (by INSERT) */
  
! 	int			es_instrument;	/* OR of InstrumentOption flags */
  	bool		es_select_into; /* true if doing SELECT INTO */
  	bool		es_into_oids;	/* true to generate OIDs in SELECT INTO */
  
diff -cprN head/src/include/storage/buf_internals.h work/src/include/storage/buf_internals.h
*** head/src/include/storage/buf_internals.h	2009-06-12 09:52:43.356212000 +0900
--- work/src/include/storage/buf_internals.h	2009-12-14 11:32:38.424717292 +0900
*************** extern PGDLLIMPORT BufferDesc *BufferDes
*** 173,188 ****
  /* in localbuf.c */
  extern BufferDesc *LocalBufferDescriptors;
  
- /* event counters in buf_init.c */
- extern long int ReadBufferCount;
- extern long int ReadLocalBufferCount;
- extern long int BufferHitCount;
- extern long int LocalBufferHitCount;
- extern long int BufferFlushCount;
- extern long int LocalBufferFlushCount;
- extern long int BufFileReadCount;
- extern long int BufFileWriteCount;
- 
  
  /*
   * Internal routines: only called by bufmgr
--- 173,178 ----
diff -cprN head/src/include/storage/bufmgr.h work/src/include/storage/bufmgr.h
*** head/src/include/storage/bufmgr.h	2009-06-12 09:52:43.356212000 +0900
--- work/src/include/storage/bufmgr.h	2009-12-14 11:32:38.424717292 +0900
*************** extern Buffer ReleaseAndReadBuffer(Buffe
*** 173,180 ****
  extern void InitBufferPool(void);
  extern void InitBufferPoolAccess(void);
  extern void InitBufferPoolBackend(void);
- extern char *ShowBufferUsage(void);
- extern void ResetBufferUsage(void);
  extern void AtEOXact_Buffers(bool isCommit);
  extern void PrintBufferLeakWarning(Buffer buffer);
  extern void CheckPointBuffers(int flags);
--- 173,178 ----
#65Robert Haas
robertmhaas@gmail.com
In reply to: Takahiro Itagaki (#64)
Re: EXPLAIN BUFFERS

On Sun, Dec 13, 2009 at 11:49 PM, Takahiro Itagaki
<itagaki.takahiro@oss.ntt.co.jp> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

Well, I think we need to do something.  I don't really want to tack
another 5-6% overhead onto EXPLAIN ANALYZE.  Maybe we could recast the
doInstrument argument as a set of OR'd flags?

I'm thinking the same thing (OR'd flags) right now.

The attached patch adds INSTRUMENT_TIMER and INSTRUMENT_BUFFERS flags.
The types of QueryDesc.doInstrument (renamed to instrument_options) and
EState.es_instrument are changed from bool to int, and they store
OR of InstrumentOption flags. INSTRUMENT_TIMER is always enabled when
instrumetations are initialized, but INSTRUMENT_BUFFERS is enabled only if
we use EXPLAIN BUFFERS. I think the flag options are not so bad idea because
of extensibility. For example, we could support EXPLAIN CPU_USAGE someday.

One issue is in the top-level instrumentation (queryDesc->totaltime).
Since the field might be used by multiple plugins, the first initializer
need to initialize the counter with all options. I used INSTRUMENT_ALL
for it in the patch.

=# EXPLAIN (ANALYZE) SELECT * FROM pgbench_accounts;
                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pgbench_accounts  (cost=0.00..263935.00 rows=10000000 width=97) (actual time=0.003..572.126 rows=10000000 loops=1)
 Total runtime: 897.729 ms

=# EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM pgbench_accounts;
                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on pgbench_accounts  (cost=0.00..263935.00 rows=10000000 width=97) (actual time=0.002..580.642 rows=10000000 loops=1)
  Buffers: shared hit=163935
 Total runtime: 955.744 ms

That seems very promising, but it's almost midnight here so I have to
turn in for now. I'll take another look at this tomorrow.

...Robert

#66Tom Lane
tgl@sss.pgh.pa.us
In reply to: Takahiro Itagaki (#61)
Re: EXPLAIN BUFFERS

Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Pushing extra arguments around would create overhead of its own ...
overhead that would be paid even when not using EXPLAIN at all.

I cannot understand what you mean... The additional argument should
not be a performance overhead because the code path is run only once
per execution.

Hmm, maybe, but still: once you have two flags you're likely to need
more. I concur with turning doInstrument into a bitmask as per Robert's
suggestion downthread.

regards, tom lane

#67Robert Haas
robertmhaas@gmail.com
In reply to: Takahiro Itagaki (#64)
Re: EXPLAIN BUFFERS

On Sun, Dec 13, 2009 at 11:49 PM, Takahiro Itagaki
<itagaki.takahiro@oss.ntt.co.jp> wrote:

The attached patch [...]

Committed.

...Robert