Measuring relation free space

greg@2ndQuadrant.com

about 14 years ago

In reply to: Mark Kirkwood (#4)

Re: Measuring relation free space

On 11/06/2011 11:55 PM, Mark Kirkwood wrote:

I am guessing (at this point very much guessing) that pg_freespace map
may return its data faster, as pageinspect is gonna have to grovel
through all the pages for itself (whereas pg_freespacemap relies on
using info from the ... free space map fork).

I started with pageinspect because I wasn't sure if other methods would
be as accurate. For example, I wasn't sure until just before submission
that only the free space and size of the relation are needed to get a
useful measure here; at one point I was considering some other things
too. I've ruled them out as unnecessary as far as I can tell, but I
can't claim my tests are exhaustive. Some review confirmation that this
measure is useful for other people is one thing I was hoping for
feedback on, as one thing to consider in addition to the actual
implementation.

If this measurement is the only one needed, than as I said at the start
of the thread here it might easily be implemented to run just against
the free space map instead. I'm thinking of what's been sent so far as
a UI with matching output it should produce. If it's possible to get
the same numbers faster, exactly how to implement the function under the
hood is easy enough to change. Jaime already has a new version in
development that adds a ring buffer for example.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

jaime@2ndquadrant.com

about 14 years ago

In reply to: Magnus Hagander (#2)

1 attachment(s)

Re: Measuring relation free space

On Sun, Nov 6, 2011 at 5:38 AM, Magnus Hagander <magnus@hagander.net> wrote:

Looks pretty useful.

thanks for the review, attached is a new version of it

One quick stylistic comment - we don't generally use "* 1.0" to turn
an int into a double - just use a cast.

Hooking into a ring buffer seems like an almost requirement before you
can run this on a larger production system, wouldn't it? I don't know
how hard that is code-wise, but it certainly seems worthwhile.

seems it wasn't too difficult... i just have to indicate the right
buffer access strategy so it's using a ring buffer now

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

Attachments:

relation_free_space-v3.patchtext/x-patch; charset=US-ASCII; name=relation_free_space-v3.patchDownload

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
new file mode 100644
index dbb2158..8be21ed
*** a/contrib/pageinspect/btreefuncs.c
--- b/contrib/pageinspect/btreefuncs.c
***************
*** 34,39 ****
--- 34,40 ----
  #include "utils/builtins.h"
  #include "utils/rel.h"
  
+ #include "btreefuncs.h"
  
  extern Datum bt_metap(PG_FUNCTION_ARGS);
  extern Datum bt_page_items(PG_FUNCTION_ARGS);
*************** GetBTPageStatistics(BlockNumber blkno, B
*** 155,160 ****
--- 156,204 ----
  		stat->avg_item_size = 0;
  }
  
+ /*------------------------------------------------
+  * GetBTRelationFreeSpace
+  *
+  * Get the free space for a btree index.
+  * This is a helper function for relation_free_space()
+  *------------------------------------------------
+  */
+ float4
+ GetBTRelationFreeSpace(Relation rel)
+ {
+ 	BTPageStat stat;
+ 
+ 	Buffer		buffer;
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	Size 		free_space = 0;
+ 	double		free_percent = 0;
+ 
+     BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	
+ 	/* Skip page 0 because it is a metapage */
+ 	for (blkno = 1; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		/* 
+ 		 * get the statistics of the indexes and use that info
+ 		 * to determine free space on the page
+ 		 */
+ 		GetBTPageStatistics(blkno, buffer, &stat);
+ 		if (stat.type == 'd')
+ 			free_space += stat.page_size;
+ 		else
+ 			free_space += stat.free_size;		
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksInRelation > 1)
+ 		free_percent = ((float4) free_space) / ((totalBlcksInRelation - 1) * BLCKSZ);
+ 	return free_percent;
+ }
+ 
+ 
  /* -----------------------------------------------
   * bt_page()
   *
diff --git a/contrib/pageinspect/btreefuncs.h b/contrib/pageinspect/btreefuncs.h
new file mode 100644
index ...549f878
*** a/contrib/pageinspect/btreefuncs.h
--- b/contrib/pageinspect/btreefuncs.h
***************
*** 0 ****
--- 1,5 ----
+ /*
+  * contrib/pageinspect/btreefuncs.h
+  */
+ 
+ float4 GetBTRelationFreeSpace(Relation);
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
new file mode 100644
index fa50655..e7436fb
*** a/contrib/pageinspect/heapfuncs.c
--- b/contrib/pageinspect/heapfuncs.c
***************
*** 28,33 ****
--- 28,36 ----
  #include "funcapi.h"
  #include "utils/builtins.h"
  #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ 
+ #include "heapfuncs.h"
  
  Datum		heap_page_items(PG_FUNCTION_ARGS);
  
*************** bits_to_text(bits8 *bits, int len)
*** 55,60 ****
--- 58,96 ----
  }
  
  
+ /*
+  * GetHeapRelationFreeSpace()
+  *
+  * Get the free space for a heap relation.
+  * This is a helper function for relation_free_space()
+  */
+ float4
+ GetHeapRelationFreeSpace(Relation rel)
+ {
+ 	Buffer      buffer;
+ 	Page		page;
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	Size        free_space = 0;
+ 	double		free_percent = 0;
+ 
+ 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 
+ 	for (blkno = 0; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		page   = BufferGetPage(buffer);
+ 		free_space += PageGetHeapFreeSpace(page);
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksInRelation > 0)
+ 		free_percent = ((float4) free_space) / (totalBlcksInRelation * BLCKSZ);
+ 	return free_percent;
+ }
+ 
+ 
  /*
   * heap_page_items
   *
diff --git a/contrib/pageinspect/heapfuncs.h b/contrib/pageinspect/heapfuncs.h
new file mode 100644
index ...17b7cb3
*** a/contrib/pageinspect/heapfuncs.h
--- b/contrib/pageinspect/heapfuncs.h
***************
*** 0 ****
--- 1,5 ----
+ /*
+  * contrib/pageinspect/heapfuncs.h
+  */
+ 
+ float4 GetHeapRelationFreeSpace(Relation);
diff --git a/contrib/pageinspect/pageinspect--1.0.sql b/contrib/pageinspect/pageinspect--1.0.sql
new file mode 100644
index 5613956..4502a13
*** a/contrib/pageinspect/pageinspect--1.0.sql
--- b/contrib/pageinspect/pageinspect--1.0.sql
*************** CREATE FUNCTION fsm_page_contents(IN pag
*** 105,107 ****
--- 105,115 ----
  RETURNS text
  AS 'MODULE_PATHNAME', 'fsm_page_contents'
  LANGUAGE C STRICT;
+ 
+ --
+ -- relation_free_space()
+ --
+ CREATE FUNCTION relation_free_space(IN relname text)
+ RETURNS real
+ AS 'MODULE_PATHNAME', 'relation_free_space'
+ LANGUAGE C STRICT;
diff --git a/contrib/pageinspect/pageinspect--unpackaged--1.0.sql b/contrib/pageinspect/pageinspect--unpackaged--1.0.sql
new file mode 100644
index 13e2167..ccfe6e4
*** a/contrib/pageinspect/pageinspect--unpackaged--1.0.sql
--- b/contrib/pageinspect/pageinspect--unpackaged--1.0.sql
*************** ALTER EXTENSION pageinspect ADD function
*** 29,31 ****
--- 29,32 ----
  ALTER EXTENSION pageinspect ADD function bt_page_stats(text,integer);
  ALTER EXTENSION pageinspect ADD function bt_page_items(text,integer);
  ALTER EXTENSION pageinspect ADD function fsm_page_contents(bytea);
+ ALTER EXTENSION pageinspect ADD function relation_free_space(integer);
diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
new file mode 100644
index 362ad84..988d1e3
*** a/contrib/pageinspect/rawpage.c
--- b/contrib/pageinspect/rawpage.c
***************
*** 23,33 ****
--- 23,37 ----
  #include "utils/builtins.h"
  #include "utils/rel.h"
  
+ #include "btreefuncs.h"
+ #include "heapfuncs.h"
+ 
  PG_MODULE_MAGIC;
  
  Datum		get_raw_page(PG_FUNCTION_ARGS);
  Datum		get_raw_page_fork(PG_FUNCTION_ARGS);
  Datum		page_header(PG_FUNCTION_ARGS);
+ Datum		relation_free_space(PG_FUNCTION_ARGS);
  
  static bytea *get_raw_page_internal(text *relname, ForkNumber forknum,
  					  BlockNumber blkno);
*************** page_header(PG_FUNCTION_ARGS)
*** 227,229 ****
--- 231,285 ----
  
  	PG_RETURN_DATUM(result);
  }
+ 
+ 
+ /*
+  * relation_free_space
+  *
+  * Returns the percentage of free space for a given relation.
+  * Supported relation types are tables and indexes.  TOAST is not
+  * considered/handled.
+  */
+ PG_FUNCTION_INFO_V1(relation_free_space);
+ 
+ Datum
+ relation_free_space(PG_FUNCTION_ARGS)
+ {
+ 	text       *relname = PG_GETARG_TEXT_P(0);
+ 	RangeVar	*relrv;
+ 	Relation	rel;
+ 	float4 		free_space = 0;
+ 
+ 	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+ 	rel = relation_openrv(relrv, AccessShareLock);
+ 
+ 	/* Only allow tables or indexes */
+ 	if (rel->rd_rel->relkind != RELKIND_INDEX && rel->rd_rel->relkind != RELKIND_RELATION)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ 				errmsg("cannot get free space from object \"%s\"",
+ 					RelationGetRelationName(rel))));
+ 
+ 	/*
+ 	 * Reject attempts to read non-local temporary relations.  We would be
+ 	 * likely to get wrong data, since we have no visibility into the owning
+ 	 * session's local buffers.
+ 	 */
+ 	if (RELATION_IS_OTHER_TEMP(rel))
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 				errmsg("cannot access temporary tables of other sessions")));
+ 
+ 	switch (rel->rd_rel->relkind)
+ 	{
+ 		case RELKIND_RELATION: 
+ 			free_space = GetHeapRelationFreeSpace(rel);
+ 			break;
+ 		case RELKIND_INDEX: 
+ 			free_space = GetBTRelationFreeSpace(rel);
+ 			break;
+ 	}
+ 	relation_close(rel, AccessShareLock);
+ 
+ 	PG_RETURN_FLOAT4(free_space);
+ }
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
new file mode 100644
index acbb05b..49708c3
*** a/doc/src/sgml/pageinspect.sgml
--- b/doc/src/sgml/pageinspect.sgml
*************** test=# SELECT * FROM bt_page_items('pg_c
*** 196,201 ****
--- 196,216 ----
       </para>
      </listitem>
     </varlistentry>
+ 
+    <varlistentry>
+     <term>
+      <function>relation_free_space(relname text) returns float</function>
+     </term>
+ 
+     <listitem>
+      <para>
+       <function>relation_free_space</function> returns the total amount of free
+       free space in a table or index.  The entire relation is scanned in order
+       to compute this amount, which may generate a large amount of disk reads.
+       Information stored in <acronym>TOAST</> tables is not included.
+      </para>
+     </listitem>
+    </varlistentry>
    </variablelist>
   </sect2>

Robert Treat

rob@xzilla.net

about 14 years ago

In reply to: Greg Smith (#5)

Re: Measuring relation free space

On Tue, Nov 8, 2011 at 1:07 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 11/06/2011 11:55 PM, Mark Kirkwood wrote:

I am guessing (at this point very much guessing) that pg_freespace map may
return its data faster, as pageinspect is gonna have to grovel through all
the pages for itself (whereas pg_freespacemap relies on using info from the
... free space map fork).

I started with pageinspect because I wasn't sure if other methods would be
as accurate. For example, I wasn't sure until just before submission that
only the free space and size of the relation are needed to get a useful
measure here; at one point I was considering some other things too. I've
ruled them out as unnecessary as far as I can tell, but I can't claim my
tests are exhaustive. Some review confirmation that this measure is useful
for other people is one thing I was hoping for feedback on, as one thing to
consider in addition to the actual implementation.

If this measurement is the only one needed, than as I said at the start of
the thread here it might easily be implemented to run just against the free
space map instead. I'm thinking of what's been sent so far as a UI with
matching output it should produce. If it's possible to get the same numbers
faster, exactly how to implement the function under the hood is easy enough
to change. Jaime already has a new version in development that adds a ring
buffer for example.

It's already easy to get "good enough" numbers based on user space
tools with very little overhead, so I think it's more important that
the server side tool be accurate rather than fast. Of course, if we
can get both, that's a bonus, but I'd rather not go that route at the
expense of accuracy. Just my .02.

Robert Treat
conjecture: xzilla.net
consulting: omniti.com

greg@2ndQuadrant.com

about 14 years ago

In reply to: Robert Treat (#7)

Re: Measuring relation free space

On 11/08/2011 05:07 PM, Robert Treat wrote:

It's already easy to get "good enough" numbers based on user space
tools with very little overhead, so I think it's more important that
the server side tool be accurate rather than fast.

What user space method do you consider good enough here? I haven't
found any approximation that I was really happy with; wouldn't have
bothered with this otherwise.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

Alvaro Herrera

alvherre@commandprompt.com

about 14 years ago

In reply to: Jaime Casanova (#6)

Re: Measuring relation free space

Excerpts from Jaime Casanova's message of mar nov 08 18:12:25 -0300 2011:

On Sun, Nov 6, 2011 at 5:38 AM, Magnus Hagander <magnus@hagander.net> wrote:

Looks pretty useful.

thanks for the review, attached is a new version of it

Note that AFAIK you shouldn't update the 1.0 extension script ... you
have to create a 1.1 version (or whatever), update the default version
in the control file, and create an 1.0--1.1 script to upgrade from the
original version to 1.1.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#10

Robert Treat

rob@xzilla.net

about 14 years ago

In reply to: Greg Smith (#8)

Re: Measuring relation free space

On Tue, Nov 8, 2011 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 11/08/2011 05:07 PM, Robert Treat wrote:

It's already easy to get "good enough" numbers based on user space
tools with very little overhead, so I think it's more important that
the server side tool be accurate rather than fast.

What user space method do you consider good enough here? I haven't found
any approximation that I was really happy with; wouldn't have bothered with
this otherwise.

check_postgres and the pg_bloat_report both use a method of comparing
on disk size vs estimated size based on table structure (or index
info). Run regularly, it's certainly possible to keep bloat under
control. That said, I'd still like to see something more accurate.

Robert Treat
conjecture: xzilla.net
consulting: omniti.com

#11

jaime@2ndquadrant.com

about 14 years ago

In reply to: Alvaro Herrera (#9)

1 attachment(s)

Re: Measuring relation free space

On Wed, Nov 9, 2011 at 7:58 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Jaime Casanova's message of mar nov 08 18:12:25 -0300 2011:

On Sun, Nov 6, 2011 at 5:38 AM, Magnus Hagander <magnus@hagander.net> wrote:

Looks pretty useful.

thanks for the review, attached is a new version of it

Note that AFAIK you shouldn't update the 1.0 extension script ... you
have to create a 1.1 version (or whatever), update the default version
in the control file, and create an 1.0--1.1 script to upgrade from the
original version to 1.1.

good point... fixed that...
a question i have is: are we supposed to let the old script (1.0) around?

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

Attachments:

relation_free_space-v4.patchtext/x-patch; charset=US-ASCII; name=relation_free_space-v4.patchDownload

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
new file mode 100644
index 13ba6d3..63fab95
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
*************** MODULE_big	= pageinspect
*** 4,10 ****
  OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
  
  EXTENSION = pageinspect
! DATA = pageinspect--1.0.sql pageinspect--unpackaged--1.0.sql
  
  ifdef USE_PGXS
  PG_CONFIG = pg_config
--- 4,12 ----
  OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
  
  EXTENSION = pageinspect
! DATA = pageinspect--1.0.sql pageinspect--1.1.sql \
!        pageinspect--1.0--1.1.sql \
!        pageinspect--unpackaged--1.0.sql
  
  ifdef USE_PGXS
  PG_CONFIG = pg_config
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
new file mode 100644
index dbb2158..8be21ed
*** a/contrib/pageinspect/btreefuncs.c
--- b/contrib/pageinspect/btreefuncs.c
***************
*** 34,39 ****
--- 34,40 ----
  #include "utils/builtins.h"
  #include "utils/rel.h"
  
+ #include "btreefuncs.h"
  
  extern Datum bt_metap(PG_FUNCTION_ARGS);
  extern Datum bt_page_items(PG_FUNCTION_ARGS);
*************** GetBTPageStatistics(BlockNumber blkno, B
*** 155,160 ****
--- 156,204 ----
  		stat->avg_item_size = 0;
  }
  
+ /*------------------------------------------------
+  * GetBTRelationFreeSpace
+  *
+  * Get the free space for a btree index.
+  * This is a helper function for relation_free_space()
+  *------------------------------------------------
+  */
+ float4
+ GetBTRelationFreeSpace(Relation rel)
+ {
+ 	BTPageStat stat;
+ 
+ 	Buffer		buffer;
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	Size 		free_space = 0;
+ 	double		free_percent = 0;
+ 
+     BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	
+ 	/* Skip page 0 because it is a metapage */
+ 	for (blkno = 1; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		/* 
+ 		 * get the statistics of the indexes and use that info
+ 		 * to determine free space on the page
+ 		 */
+ 		GetBTPageStatistics(blkno, buffer, &stat);
+ 		if (stat.type == 'd')
+ 			free_space += stat.page_size;
+ 		else
+ 			free_space += stat.free_size;		
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksInRelation > 1)
+ 		free_percent = ((float4) free_space) / ((totalBlcksInRelation - 1) * BLCKSZ);
+ 	return free_percent;
+ }
+ 
+ 
  /* -----------------------------------------------
   * bt_page()
   *
diff --git a/contrib/pageinspect/btreefuncs.h b/contrib/pageinspect/btreefuncs.h
new file mode 100644
index ...549f878
*** a/contrib/pageinspect/btreefuncs.h
--- b/contrib/pageinspect/btreefuncs.h
***************
*** 0 ****
--- 1,5 ----
+ /*
+  * contrib/pageinspect/btreefuncs.h
+  */
+ 
+ float4 GetBTRelationFreeSpace(Relation);
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
new file mode 100644
index fa50655..e7436fb
*** a/contrib/pageinspect/heapfuncs.c
--- b/contrib/pageinspect/heapfuncs.c
***************
*** 28,33 ****
--- 28,36 ----
  #include "funcapi.h"
  #include "utils/builtins.h"
  #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ 
+ #include "heapfuncs.h"
  
  Datum		heap_page_items(PG_FUNCTION_ARGS);
  
*************** bits_to_text(bits8 *bits, int len)
*** 55,60 ****
--- 58,96 ----
  }
  
  
+ /*
+  * GetHeapRelationFreeSpace()
+  *
+  * Get the free space for a heap relation.
+  * This is a helper function for relation_free_space()
+  */
+ float4
+ GetHeapRelationFreeSpace(Relation rel)
+ {
+ 	Buffer      buffer;
+ 	Page		page;
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	Size        free_space = 0;
+ 	double		free_percent = 0;
+ 
+ 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 
+ 	for (blkno = 0; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		page   = BufferGetPage(buffer);
+ 		free_space += PageGetHeapFreeSpace(page);
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksInRelation > 0)
+ 		free_percent = ((float4) free_space) / (totalBlcksInRelation * BLCKSZ);
+ 	return free_percent;
+ }
+ 
+ 
  /*
   * heap_page_items
   *
diff --git a/contrib/pageinspect/heapfuncs.h b/contrib/pageinspect/heapfuncs.h
new file mode 100644
index ...17b7cb3
*** a/contrib/pageinspect/heapfuncs.h
--- b/contrib/pageinspect/heapfuncs.h
***************
*** 0 ****
--- 1,5 ----
+ /*
+  * contrib/pageinspect/heapfuncs.h
+  */
+ 
+ float4 GetHeapRelationFreeSpace(Relation);
diff --git a/contrib/pageinspect/pageinspect--1.0--1.1.sql b/contrib/pageinspect/pageinspect--1.0--1.1.sql
new file mode 100644
index ...c97aeba
*** a/contrib/pageinspect/pageinspect--1.0--1.1.sql
--- b/contrib/pageinspect/pageinspect--1.0--1.1.sql
***************
*** 0 ****
--- 1,12 ----
+ /* contrib/pageinspect/pageinspect--1.0--1.1.sql */
+ 
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "ALTER EXTENSION pageinspect UPDATE TO '1.1'" to load this file. \quit
+ 
+ --
+ -- relation_free_space()
+ --
+ CREATE FUNCTION relation_free_space(IN relname text)
+ RETURNS real
+ AS 'MODULE_PATHNAME', 'relation_free_space'
+ LANGUAGE C STRICT;
diff --git a/contrib/pageinspect/pageinspect--1.1.sql b/contrib/pageinspect/pageinspect--1.1.sql
new file mode 100644
index ...035195b
*** a/contrib/pageinspect/pageinspect--1.1.sql
--- b/contrib/pageinspect/pageinspect--1.1.sql
***************
*** 0 ****
--- 1,115 ----
+ /* contrib/pageinspect/pageinspect--1.1.sql */
+ 
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "CREATE EXTENSION pageinspect" to load this file. \quit
+ 
+ --
+ -- get_raw_page()
+ --
+ CREATE FUNCTION get_raw_page(text, int4)
+ RETURNS bytea
+ AS 'MODULE_PATHNAME', 'get_raw_page'
+ LANGUAGE C STRICT;
+ 
+ CREATE FUNCTION get_raw_page(text, text, int4)
+ RETURNS bytea
+ AS 'MODULE_PATHNAME', 'get_raw_page_fork'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- page_header()
+ --
+ CREATE FUNCTION page_header(IN page bytea,
+     OUT lsn text,
+     OUT tli smallint,
+     OUT flags smallint,
+     OUT lower smallint,
+     OUT upper smallint,
+     OUT special smallint,
+     OUT pagesize smallint,
+     OUT version smallint,
+     OUT prune_xid xid)
+ AS 'MODULE_PATHNAME', 'page_header'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- heap_page_items()
+ --
+ CREATE FUNCTION heap_page_items(IN page bytea,
+     OUT lp smallint,
+     OUT lp_off smallint,
+     OUT lp_flags smallint,
+     OUT lp_len smallint,
+     OUT t_xmin xid,
+     OUT t_xmax xid,
+     OUT t_field3 int4,
+     OUT t_ctid tid,
+     OUT t_infomask2 integer,
+     OUT t_infomask integer,
+     OUT t_hoff smallint,
+     OUT t_bits text,
+     OUT t_oid oid)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'heap_page_items'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- bt_metap()
+ --
+ CREATE FUNCTION bt_metap(IN relname text,
+     OUT magic int4,
+     OUT version int4,
+     OUT root int4,
+     OUT level int4,
+     OUT fastroot int4,
+     OUT fastlevel int4)
+ AS 'MODULE_PATHNAME', 'bt_metap'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- bt_page_stats()
+ --
+ CREATE FUNCTION bt_page_stats(IN relname text, IN blkno int4,
+     OUT blkno int4,
+     OUT type "char",
+     OUT live_items int4,
+     OUT dead_items int4,
+     OUT avg_item_size int4,
+     OUT page_size int4,
+     OUT free_size int4,
+     OUT btpo_prev int4,
+     OUT btpo_next int4,
+     OUT btpo int4,
+     OUT btpo_flags int4)
+ AS 'MODULE_PATHNAME', 'bt_page_stats'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- bt_page_items()
+ --
+ CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+     OUT itemoffset smallint,
+     OUT ctid tid,
+     OUT itemlen smallint,
+     OUT nulls bool,
+     OUT vars bool,
+     OUT data text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'bt_page_items'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- fsm_page_contents()
+ --
+ CREATE FUNCTION fsm_page_contents(IN page bytea)
+ RETURNS text
+ AS 'MODULE_PATHNAME', 'fsm_page_contents'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- relation_free_space()
+ --
+ CREATE FUNCTION relation_free_space(IN relname text)
+ RETURNS real
+ AS 'MODULE_PATHNAME', 'relation_free_space'
+ LANGUAGE C STRICT;
diff --git a/contrib/pageinspect/pageinspect.control b/contrib/pageinspect/pageinspect.control
new file mode 100644
index f9da0e8..a412cf1
*** a/contrib/pageinspect/pageinspect.control
--- b/contrib/pageinspect/pageinspect.control
***************
*** 1,5 ****
  # pageinspect extension
  comment = 'inspect the contents of database pages at a low level'
! default_version = '1.0'
  module_pathname = '$libdir/pageinspect'
  relocatable = true
--- 1,5 ----
  # pageinspect extension
  comment = 'inspect the contents of database pages at a low level'
! default_version = '1.1'
  module_pathname = '$libdir/pageinspect'
  relocatable = true
diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
new file mode 100644
index 362ad84..988d1e3
*** a/contrib/pageinspect/rawpage.c
--- b/contrib/pageinspect/rawpage.c
***************
*** 23,33 ****
--- 23,37 ----
  #include "utils/builtins.h"
  #include "utils/rel.h"
  
+ #include "btreefuncs.h"
+ #include "heapfuncs.h"
+ 
  PG_MODULE_MAGIC;
  
  Datum		get_raw_page(PG_FUNCTION_ARGS);
  Datum		get_raw_page_fork(PG_FUNCTION_ARGS);
  Datum		page_header(PG_FUNCTION_ARGS);
+ Datum		relation_free_space(PG_FUNCTION_ARGS);
  
  static bytea *get_raw_page_internal(text *relname, ForkNumber forknum,
  					  BlockNumber blkno);
*************** page_header(PG_FUNCTION_ARGS)
*** 227,229 ****
--- 231,285 ----
  
  	PG_RETURN_DATUM(result);
  }
+ 
+ 
+ /*
+  * relation_free_space
+  *
+  * Returns the percentage of free space for a given relation.
+  * Supported relation types are tables and indexes.  TOAST is not
+  * considered/handled.
+  */
+ PG_FUNCTION_INFO_V1(relation_free_space);
+ 
+ Datum
+ relation_free_space(PG_FUNCTION_ARGS)
+ {
+ 	text       *relname = PG_GETARG_TEXT_P(0);
+ 	RangeVar	*relrv;
+ 	Relation	rel;
+ 	float4 		free_space = 0;
+ 
+ 	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+ 	rel = relation_openrv(relrv, AccessShareLock);
+ 
+ 	/* Only allow tables or indexes */
+ 	if (rel->rd_rel->relkind != RELKIND_INDEX && rel->rd_rel->relkind != RELKIND_RELATION)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ 				errmsg("cannot get free space from object \"%s\"",
+ 					RelationGetRelationName(rel))));
+ 
+ 	/*
+ 	 * Reject attempts to read non-local temporary relations.  We would be
+ 	 * likely to get wrong data, since we have no visibility into the owning
+ 	 * session's local buffers.
+ 	 */
+ 	if (RELATION_IS_OTHER_TEMP(rel))
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 				errmsg("cannot access temporary tables of other sessions")));
+ 
+ 	switch (rel->rd_rel->relkind)
+ 	{
+ 		case RELKIND_RELATION: 
+ 			free_space = GetHeapRelationFreeSpace(rel);
+ 			break;
+ 		case RELKIND_INDEX: 
+ 			free_space = GetBTRelationFreeSpace(rel);
+ 			break;
+ 	}
+ 	relation_close(rel, AccessShareLock);
+ 
+ 	PG_RETURN_FLOAT4(free_space);
+ }
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
new file mode 100644
index acbb05b..49708c3
*** a/doc/src/sgml/pageinspect.sgml
--- b/doc/src/sgml/pageinspect.sgml
*************** test=# SELECT * FROM bt_page_items('pg_c
*** 196,201 ****
--- 196,216 ----
       </para>
      </listitem>
     </varlistentry>
+ 
+    <varlistentry>
+     <term>
+      <function>relation_free_space(relname text) returns float</function>
+     </term>
+ 
+     <listitem>
+      <para>
+       <function>relation_free_space</function> returns the total amount of free
+       free space in a table or index.  The entire relation is scanned in order
+       to compute this amount, which may generate a large amount of disk reads.
+       Information stored in <acronym>TOAST</> tables is not included.
+      </para>
+     </listitem>
+    </varlistentry>
    </variablelist>
   </sect2>

#12

Jeff Janes

jeff.janes@gmail.com

about 14 years ago

In reply to: Jaime Casanova (#11)

Re: Measuring relation free space

On Mon, Nov 14, 2011 at 2:02 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Wed, Nov 9, 2011 at 7:58 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Jaime Casanova's message of mar nov 08 18:12:25 -0300 2011:

On Sun, Nov 6, 2011 at 5:38 AM, Magnus Hagander <magnus@hagander.net> wrote:

Looks pretty useful.

thanks for the review, attached is a new version of it

Note that AFAIK you shouldn't update the 1.0 extension script ... you
have to create a 1.1 version (or whatever), update the default version
in the control file, and create an 1.0--1.1 script to upgrade from the
original version to 1.1.

good point... fixed that...
a question i have is: are we supposed to let the old script (1.0) around?

Since the syntax to install a non-default version is supported, I
would argue the old script should be kept.
CREATE extension pageinspect with version "1.0"

This patch applies and builds cleanly. It works either for "CREATE
EXTENSION" from scratch, or for updating from the prior version with
"ALTER EXTENSION..UPDATE".

It seems to be using the buffer ring strategy as advertised.

It reports space that is free exclusively for updates as being free.
In other words, it considers space free even if it is reserved against
inserts in deference to fillfactor. This is in contrast to
pg_freespace, which only reports space available for inserts as being
available. I think this is reasonable behavior, but it is subtle and
should perhaps be documented. (Is it common to use fill factors other
than the default in the first place? Do we assume that people using
fillfactor are sophisticated enough not to shot themselves in the
foot?)

As noted by Greg, the documentation calls it "total amount of free
free [sic] space" when that is not what is reported. However, it also
is not reporting a percentage, but rather a decimal fraction. The
reported value should be multiplied by 100, especially if the docs are
going to be changed to call it a percentage.

Unless I am missing something, all indexes are handled via a procedure
designed for BTree indices, "GetBTRelationFreeSpace". I don't know
that the ultimate behavior of this is wrong, but it seems unusual. If
I get some more time, I will try to explore what is actually going on
when called on other types of indexes.

I have no insight into how to handle toast tables, or non-superusers.
I had thought that toast tables had names of their own which could be
used, but I could not figure out how to do that.

Even if there are other ways to get approximately the same
information, this functionality seems to be a natural thing to have in
the pageinspect extension.

Cheers,

Jeff

#13

greg@2ndQuadrant.com

about 14 years ago

In reply to: Jeff Janes (#12)

Re: Measuring relation free space

On 11/25/2011 04:42 PM, Jeff Janes wrote:

It reports space that is free exclusively for updates as being free.
In other words, it considers space free even if it is reserved against
inserts in deference to fillfactor. This is in contrast to
pg_freespace, which only reports space available for inserts as being
available. I think this is reasonable behavior, but it is subtle and
should perhaps be documented.

Ah, that's right, this is why I first wandered this specific path.
Ignoring fillfactor seems to have even more downsides as I see it.
Certainly deserves a doc improvement, as well as fixing the description
of the value so it's clearly a ratio rather than a true percentage.

(Is it common to use fill factors other
than the default in the first place? Do we assume that people using
fillfactor are sophisticated enough not to shot themselves in the
foot?)

It's not common, and I think anyone who sets fillfactor themselves would
understand the downside. The bigger risk are people who inherit designs
from others that use this feature, but the new person doesn't understand
it. If using this feature calls attention to a problem there that
prompts an investigation, I'd see that as a good thing, rather than a
foot shot.

Unless I am missing something, all indexes are handled via a procedure
designed for BTree indices, "GetBTRelationFreeSpace". I don't know
that the ultimate behavior of this is wrong, but it seems unusual. If
I get some more time, I will try to explore what is actually going on
when called on other types of indexes.

This I think I'll punt back toward Jaime, as well as asking "did you
have a plan for TOAST here?"

#14

jaime@2ndquadrant.com

about 14 years ago

In reply to: Greg Smith (#13)

Re: Measuring relation free space

On Mon, Nov 28, 2011 at 5:40 AM, Greg Smith <greg@2ndquadrant.com> wrote:

Unless I am missing something, all indexes are handled via a procedure
designed for BTree indices, "GetBTRelationFreeSpace". I don't know
that the ultimate behavior of this is wrong, but it seems unusual. If
I get some more time, I will try to explore what is actually going on
when called on other types of indexes.

This I think I'll punt back toward Jaime, as well as asking "did you have a
plan for TOAST here?"

for indexes. it seems pageinspect only deals with btree indexes and i
neglected to put a similar limitation on this function... now, because
the free space is calculated using PageGetFreeSpace() for indexes it
should be doing the right thing for all kind of indexes, i only put
the function there because i was trying to avoid to create a new file.
But if the function is right for all kind of indexes that's maybe
enough to create a new file and rename the helper function so is
obvious that it can manage all kind of indexes

for toast tables. a simple test here seems to show that is as easy as
to add toast tables in the supported objects and treat them as normal
pages...

or there is something i'm missing?

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

#15

greg@2ndQuadrant.com

about 14 years ago

In reply to: Greg Smith (#13)

Re: Measuring relation free space

On 11/28/2011 05:40 AM, Greg Smith wrote:

Ignoring fillfactor seems to have even more downsides as I see it.
Certainly deserves a doc improvement, as well as fixing the
description of the value so it's clearly a ratio rather than a true
percentage.

So: I'm very clear on what to do here now:

-Make the computation be in units that match it documetnation
-Take a look at other index types, as well as TOAST, at least to get the
easy ones right.
-Fully confirm the extension upgrade logic works as hoped

That's the must do stuff. Then there's two more features to consider
and do something with if sensible:

-Double check whether there is really a useful role in using
pg_freespace here. I don't think the numbers will be as good, but maybe
we don't care.
-Once the above is all sorted, add a second UI that samples some pages
and extrapolates, ANALYZE-style, rather than hitting everything.

This ones leaves as returned with feedback, feeling pretty good it will
be whipped into good shape for the last 9.2 CommitFest.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#16

noah@leadboat.com

about 14 years ago

In reply to: Bernd Helmle (#3)

Re: Measuring relation free space

On Sun, Nov 06, 2011 at 10:20:49PM +0100, Bernd Helmle wrote:

--On 6. November 2011 01:08:11 -0200 Greg Smith <greg@2ndQuadrant.com> wrote:

Attached patch adds a new function to the pageinspect extension for measuring
total free space, in either tables or indexes.

I wonder if that should be done in the pgstattuple module, which output
some similar numbers.

Indeed, pgstattuple already claims to show precisely the same measure. Its
reckoning is right in line for heaps, but the proposed pageinspect function
finds more free space in indexes:

[local] test=# SELECT t.free_percent, relation_free_space('pg_proc'), i.free_percent, relation_free_space('pg_proc_proname_args_nsp_index') FROM pgstattuple('pg_proc') t, pgstattuple('pg_proc_proname_args_nsp_index') i;
free_percent | relation_free_space | free_percent | relation_free_space
--------------+---------------------+--------------+---------------------
2.53 | 0.0253346 | 8.61 | 0.128041
(1 row)

Is one of those index figures simply wrong, or do they measure two senses of
free space, both of which are interesting to DBAs?

Thanks,
nm

#17

greg@2ndQuadrant.com

about 14 years ago

In reply to: Noah Misch (#16)

Re: Measuring relation free space

On 12/15/2011 04:11 PM, Noah Misch wrote:

Is one of those index figures simply wrong, or do they measure two senses of
free space, both of which are interesting to DBAs?

I think the bigger one--the one I was aiming to measure--also includes
fill-factor space. It should be possible to isolate whether that's true
by running the function against a fresh index, or by trying tests with a
table where there's no useful fill. I need to add some of those to the
test example suite.

While in theory both measures of free space might be interesting to
DBAs, I'd prefer to have the one that reflects the lost space to
fill-factor if I'm checking an index. But as Robert Treat was pointing
out, even the very rough estimates being made with existing user-space
tools not using the contrib module features are helpful enough for a lot
of users. So long as it's easy and accuracy is good enough to find
bloated indexes, either implementation is probably good enough.

Shaking out the alternate implementation ideas was really my goal for
this CF here. The major goal of the next revision is to present the
options with a measure of their respective accuracy and runtime. If I
have to give up just a of bit of accuracy and make it much faster,
that's probably what most people want as an option. When Jaime and I
come back with an update, it really needs to have benchmarks and
accuracy numbers for each option. That may be complicated a bit
depending on how much of the table or index is cached, so isolating that
out will be a pain.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#18

noah@leadboat.com

about 14 years ago

In reply to: Greg Smith (#17)

Re: Measuring relation free space

On Fri, Dec 16, 2011 at 02:02:03AM -0500, Greg Smith wrote:

On 12/15/2011 04:11 PM, Noah Misch wrote:

Is one of those index figures simply wrong, or do they measure two senses of
free space, both of which are interesting to DBAs?

I think the bigger one--the one I was aiming to measure--also includes
fill-factor space. It should be possible to isolate whether that's true
by running the function against a fresh index, or by trying tests with a
table where there's no useful fill. I need to add some of those to the
test example suite.

No, both measures include fillfactor space. From a brief look at the code, the
proposed function counts space in non-leaf pages, while pgstattuple does not.
Also, the proposed function counts half-dead pages like live pages, while
pgstattuple counts them like dead pages.

One could perhaps justify those choices either way, but they seem too esoteric
for DBA exposure. I recommend choosing a policy on each and making both
pgstattuple() and any new code respect that policy.

Shaking out the alternate implementation ideas was really my goal for
this CF here. The major goal of the next revision is to present the
options with a measure of their respective accuracy and runtime. If I
have to give up just a of bit of accuracy and make it much faster,
that's probably what most people want as an option. When Jaime and I
come back with an update, it really needs to have benchmarks and
accuracy numbers for each option. That may be complicated a bit
depending on how much of the table or index is cached, so isolating that
out will be a pain.

The previous submission seemed to boil down to a speedier version of "SELECT
free_percent FROM pgstattuple('foo')". (Some of the other statistics aren't
cheap.) Considering that, the code does belong in the pgstattuple module.

The sampling approach you have mentioned sounds promising, especially for
indexes. For heap bloat, it may be hard to improve on pg_freespacemap-based and
check_postgres-style estimates with anything less than a full heap scan.

Thanks,
nm

#19

jaime@2ndquadrant.com

almost 14 years ago

In reply to: Noah Misch (#16)

2 attachment(s)

Re: Measuring relation free space

On Thu, Dec 15, 2011 at 4:11 PM, Noah Misch <noah@leadboat.com> wrote:

On Sun, Nov 06, 2011 at 10:20:49PM +0100, Bernd Helmle wrote:

--On 6. November 2011 01:08:11 -0200 Greg Smith <greg@2ndQuadrant.com> wrote:

Attached patch adds a new function to the pageinspect extension for measuring
total free space, in either tables or indexes.

I wonder if that should be done in the pgstattuple module, which output
some similar numbers.

Indeed, pgstattuple already claims to show precisely the same measure. Its
reckoning is right in line for heaps, but the proposed pageinspect function
finds more free space in indexes:

[local] test=# SELECT t.free_percent, relation_free_space('pg_proc'), i.free_percent, relation_free_space('pg_proc_proname_args_nsp_index') FROM pgstattuple('pg_proc') t, pgstattuple('pg_proc_proname_args_nsp_index') i;
free_percent | relation_free_space | free_percent | relation_free_space
--------------+---------------------+--------------+---------------------
2.53 | 0.0253346 | 8.61 | 0.128041
(1 row)

Is one of those index figures simply wrong, or do they measure two senses of
free space, both of which are interesting to DBAs?

i created a test env using pgbench -s 20 -F 90, i then create a new
table (that keep tracks actions that happens the the pgbench tables,
so insert only) and changed a few fillfactors:
"""
relname | reltuples | reloptions
-------------------------------------+---- -------+------------------
audit_log | 804977 |
pgbench_accounts | 1529890 | {fillfactor=90}
pgbench_accounts_pkey | 1529890 | {fillfactor=50}
pgbench_branches | 20 | {fillfactor=100}
pgbench_branches_pkey | 20 |
pgbench_history | 94062 |
pgbench_tellers | 200 | {fillfactor=100}
pgbench_tellers_pkey | 200 |
(8 rows)
"""

and after running "pgbench -n -c 4 -j 2 -T 300" a few times, i used
attached free_space.sql to see what pg_freespacemap, pgstattuple and
relation_free_space had to say about these tables. the result is
attached in result_free_space.out

my first conclusion is that pg_freespacemap is unreliable when indexes
are involved (and looking at the documentation of that module confirms
that), also the fact that FSM is not designed for accuracy make me
think is not an option.

pgstattuple and relation_free_space are very close in all the numbers
except for 2 indexes pgbench_branches_pkey and pgbench_tellers_pkey;
after a VACUUM FULL and a REINDEX (and the difference persistence) i
checked pgbench_tellers_pkey contents (it has only one page besides
the metapage) and the numbers that i look at where:

page size: 8192
free size: 4148

which in good romance means 50% of free space... so, answering Noah's
question: if that difference has some meaning i can't see it but
looking at the evidence the measure relation_free_space() is giving is
the good one

so, tomorrow (or ...looking at the clock... later today) i will update
the relation_free_space() patch to accept toast tables and other kind
of indexes and add it to the commitfest unless someone says that my
math is wrong and somehow there is a more accurate way of getting the
free space (which is entirely possible)

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

#20

[1]: http://archives.postgresql.org/message-id/20111218165625.GB6393@tornado.leadboat.com

noah@leadboat.com

almost 14 years ago

In reply to: Jaime Casanova (#19)

Re: Measuring relation free space

On Sat, Jan 14, 2012 at 04:41:57AM -0500, Jaime Casanova wrote:

pgstattuple and relation_free_space are very close in all the numbers
except for 2 indexes pgbench_branches_pkey and pgbench_tellers_pkey;
after a VACUUM FULL and a REINDEX (and the difference persistence) i
checked pgbench_tellers_pkey contents (it has only one page besides
the metapage) and the numbers that i look at where:

page size: 8192
free size: 4148

which in good romance means 50% of free space... so, answering Noah's
question: if that difference has some meaning i can't see it but
looking at the evidence the measure relation_free_space() is giving is
the good one

so, tomorrow (or ...looking at the clock... later today) i will update
the relation_free_space() patch to accept toast tables and other kind
of indexes and add it to the commitfest unless someone says that my
math is wrong and somehow there is a more accurate way of getting the
free space (which is entirely possible)

Did you see this followup[1]http://archives.postgresql.org/message-id/20111218165625.GB6393@tornado.leadboat.com? To summarize:

- pgstattuple() and relation_free_space() should emit the same number, even if
that means improving pgstattuple() at the same time.
- relation_free_space() belongs in the pgstattuple extension, because its role
is cheaper access to a single pgstattuple() metric.

Thanks,
nm

#21

jaime@2ndquadrant.com

almost 14 years ago

In reply to: Noah Misch (#20)

Re: Measuring relation free space

On Sat, Jan 14, 2012 at 6:26 AM, Noah Misch <noah@leadboat.com> wrote:

- pgstattuple() and relation_free_space() should emit the same number, even if
that means improving pgstattuple() at the same time.

yes, i just wanted to understand which one was more accurate and
why... and give the opportunity for anyone to point my error if any

- relation_free_space() belongs in the pgstattuple extension, because its role
is cheaper access to a single pgstattuple() metric.

oh! right! so, what about just fixing the free_percent that
pgstattuple is providing

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

#22

noah@leadboat.com

almost 14 years ago

In reply to: Jaime Casanova (#21)

Re: Measuring relation free space

On Sat, Jan 14, 2012 at 02:40:46PM -0500, Jaime Casanova wrote:

On Sat, Jan 14, 2012 at 6:26 AM, Noah Misch <noah@leadboat.com> wrote:

- pgstattuple() and relation_free_space() should emit the same number, even if
?that means improving pgstattuple() at the same time.

yes, i just wanted to understand which one was more accurate and
why... and give the opportunity for anyone to point my error if any

pgstattuple()'s decision to treat half-dead pages like deleted pages is
better. That transient state can only end in the page's deletion.

I don't know about counting non-leaf pages, but I personally wouldn't revisit
pgstattuple()'s decision there. In the indexes I've briefly surveyed, the
ratio of leaf pages to non-leaf pages is 100:1 or better. No amount of bloat
in that 1% will matter. Feel free to make the argument if you think
otherwise, though; I've only taken a brief look at the topic.

- relation_free_space() belongs in the pgstattuple extension, because its role
?is cheaper access to a single pgstattuple() metric.

oh! right! so, what about just fixing the free_percent that
pgstattuple is providing

If pgstattuple() meets your needs, you might indeed no longer need any patch.

#23

jaime@2ndquadrant.com

almost 14 years ago

In reply to: Noah Misch (#22)

Re: Measuring relation free space

On Mon, Jan 16, 2012 at 5:09 AM, Noah Misch <noah@leadboat.com> wrote:

pgstattuple()'s decision to treat half-dead pages like deleted pages is
better. That transient state can only end in the page's deletion.

the only page in that index has 200 records (all live 0 dead) using
half the page size (which is a leaf page and is not half dead, btw).
so, how do you justify that pgstattuple say we have just 25% of free
space?

I don't know about counting non-leaf pages

ignoring all non-leaf pages still gives a considerable difference
between pgstattuple and relation_free_space()

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

#24

noah@leadboat.com

almost 14 years ago

In reply to: Jaime Casanova (#23)

Re: Measuring relation free space

On Wed, Jan 18, 2012 at 09:46:20AM -0500, Jaime Casanova wrote:

On Mon, Jan 16, 2012 at 5:09 AM, Noah Misch <noah@leadboat.com> wrote:

pgstattuple()'s decision to treat half-dead pages like deleted pages is
better. ?That transient state can only end in the page's deletion.

the only page in that index has 200 records (all live 0 dead) using
half the page size (which is a leaf page and is not half dead, btw).
so, how do you justify that pgstattuple say we have just 25% of free
space?

postgres=# SELECT * from bt_page_stats('pgbench_tellers_pkey', 1);
-[ RECORD 1 ]-+-----
blkno | 1
type | l
live_items | 200
dead_items | 0
avg_item_size | 16
page_size | 8192
free_size | 4148
btpo_prev | 0
btpo_next | 0
btpo | 0
btpo_flags | 3

I don't know about counting non-leaf pages

ignoring all non-leaf pages still gives a considerable difference
between pgstattuple and relation_free_space()

pgstattuple() counts the single B-tree meta page as always-full, while
relation_free_space() skips it for all purposes. For tiny indexes, that can
shift the percentage dramatically.

#25

jaime@2ndquadrant.com

almost 14 years ago

In reply to: Noah Misch (#24)

Re: Measuring relation free space

On Wed, Jan 18, 2012 at 7:01 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Jan 18, 2012 at 09:46:20AM -0500, Jaime Casanova wrote:

ignoring all non-leaf pages still gives a considerable difference
between pgstattuple and relation_free_space()

pgstattuple() counts the single B-tree meta page as always-full, while
relation_free_space() skips it for all purposes. For tiny indexes, that can
shift the percentage dramatically.

ok, i will reformulate the question. why is fine ignoring non-leaf
pages but is not fine to ignore the meta page?

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

#26

noah@leadboat.com

almost 14 years ago

In reply to: Jaime Casanova (#25)

Re: Measuring relation free space

On Fri, Jan 20, 2012 at 07:03:22PM -0500, Jaime Casanova wrote:

On Wed, Jan 18, 2012 at 7:01 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Jan 18, 2012 at 09:46:20AM -0500, Jaime Casanova wrote:

ignoring all non-leaf pages still gives a considerable difference
between pgstattuple and relation_free_space()

pgstattuple() counts the single B-tree meta page as always-full, while
relation_free_space() skips it for all purposes. ?For tiny indexes, that can
shift the percentage dramatically.

ok, i will reformulate the question. why is fine ignoring non-leaf
pages but is not fine to ignore the meta page?

pgstattuple() figures the free_percent by adding up all space available to
hold tuples and dividing that by the simple size of the relation. Non-leaf
pages and the meta page get identical treatment: both never hold tuples, so
they do not contribute to the free space.

#27

Alvaro Herrera

alvherre@commandprompt.com

almost 14 years ago

In reply to: Noah Misch (#26)

Re: Measuring relation free space

Excerpts from Noah Misch's message of vie ene 20 22:33:30 -0300 2012:

On Fri, Jan 20, 2012 at 07:03:22PM -0500, Jaime Casanova wrote:

On Wed, Jan 18, 2012 at 7:01 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Jan 18, 2012 at 09:46:20AM -0500, Jaime Casanova wrote:

ignoring all non-leaf pages still gives a considerable difference
between pgstattuple and relation_free_space()

pgstattuple() counts the single B-tree meta page as always-full, while
relation_free_space() skips it for all purposes. ?For tiny indexes, that can
shift the percentage dramatically.

ok, i will reformulate the question. why is fine ignoring non-leaf
pages but is not fine to ignore the meta page?

pgstattuple() figures the free_percent by adding up all space available to
hold tuples and dividing that by the simple size of the relation. Non-leaf
pages and the meta page get identical treatment: both never hold tuples, so
they do not contribute to the free space.

Hm. Leaf pages hold as much tuples as non-leaf pages, no? I mean
for each page element there's a value and a CTID. In non-leaf those
CTIDs point to other index pages, one level down the tree; in leaf pages
they point to the heap.

The metapage is special in that it is not used to store any user data,
just a pointer to the root page.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#28

noah@leadboat.com

almost 14 years ago

In reply to: Alvaro Herrera (#27)

Re: Measuring relation free space

On Mon, Jan 23, 2012 at 04:56:24PM -0300, Alvaro Herrera wrote:

Excerpts from Noah Misch's message of vie ene 20 22:33:30 -0300 2012:

pgstattuple() figures the free_percent by adding up all space available to
hold tuples and dividing that by the simple size of the relation. Non-leaf
pages and the meta page get identical treatment: both never hold tuples, so
they do not contribute to the free space.

Hm. Leaf pages hold as much tuples as non-leaf pages, no? I mean
for each page element there's a value and a CTID. In non-leaf those
CTIDs point to other index pages, one level down the tree; in leaf pages
they point to the heap.

That distinction seemed important when I sent my last message, but now I agree
that it's largely irrelevant for free space purposes. If someone feels like
doing it, +1 for making pgstattuple() count non-leaf free space.

Thanks,
nm

#29

jaime@2ndquadrant.com

almost 14 years ago

In reply to: Noah Misch (#28)

2 attachment(s)

Re: Measuring relation free space

On Mon, Jan 23, 2012 at 7:18 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Jan 23, 2012 at 04:56:24PM -0300, Alvaro Herrera wrote:

Hm. Leaf pages hold as much tuples as non-leaf pages, no? I mean
for each page element there's a value and a CTID. In non-leaf those
CTIDs point to other index pages, one level down the tree; in leaf pages
they point to the heap.

That distinction seemed important when I sent my last message, but now I agree
that it's largely irrelevant for free space purposes. If someone feels like
doing it, +1 for making pgstattuple() count non-leaf free space.

actually i agreed that non-leaf pages are irrelevant... i just
confirmed that in a production system with 300GB none of the indexes
in an 84M rows table nor in a heavily updated one has more than 1 root
page, all the rest are deleted, half_dead or leaf. so the posibility
of bloat coming from non-leaf pages seems very odd

but the possibility of bloat coming from the meta page doesn't exist,
AFAIUI at least

we need the most accurate value about usable free space, because the
idea is to add a sampler mode to the function so we don't scan the
whole relation. that's why we still need the function.

btw... pgstattuple also has the problem that it's not using a ring buffer

attached are two patches:
- v5: is the same original patch but only track space in leaf, deleted
and half_dead pages
- v5.1: adds the same for all kind of indexes (problem is that this is
inconsistent with the fact that pageinspect only manages btree indexes
for everything else)

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

Attachments:

relation_free_space-v5.patchtext/x-patch; charset=US-ASCII; name=relation_free_space-v5.patchDownload

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
new file mode 100644
index 13ba6d3..63fab95
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
*************** MODULE_big	= pageinspect
*** 4,10 ****
  OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
  
  EXTENSION = pageinspect
! DATA = pageinspect--1.0.sql pageinspect--unpackaged--1.0.sql
  
  ifdef USE_PGXS
  PG_CONFIG = pg_config
--- 4,12 ----
  OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
  
  EXTENSION = pageinspect
! DATA = pageinspect--1.0.sql pageinspect--1.1.sql \
!        pageinspect--1.0--1.1.sql \
!        pageinspect--unpackaged--1.0.sql
  
  ifdef USE_PGXS
  PG_CONFIG = pg_config
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
new file mode 100644
index dbb2158..9c0b0fb
*** a/contrib/pageinspect/btreefuncs.c
--- b/contrib/pageinspect/btreefuncs.c
***************
*** 34,39 ****
--- 34,40 ----
  #include "utils/builtins.h"
  #include "utils/rel.h"
  
+ #include "btreefuncs.h"
  
  extern Datum bt_metap(PG_FUNCTION_ARGS);
  extern Datum bt_page_items(PG_FUNCTION_ARGS);
*************** GetBTPageStatistics(BlockNumber blkno, B
*** 155,160 ****
--- 156,216 ----
  		stat->avg_item_size = 0;
  }
  
+ /*------------------------------------------------
+  * GetBTRelationFreeSpace
+  *
+  * Get the free space for a btree index.
+  * This is a helper function for relation_free_space()
+  *------------------------------------------------
+  */
+ float4
+ GetBTRelationFreeSpace(Relation rel)
+ {
+ 	BTPageStat stat;
+ 
+ 	Buffer		buffer;
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	BlockNumber totalBlcksCounted = 0;
+ 	Size 		free_space = 0;
+ 	double		free_percent = 0;
+ 
+     BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	
+ 	/* Skip page 0 because it is a metapage */
+ 	for (blkno = 1; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		/* 
+ 		 * get the statistics of the indexes and use that info
+ 		 * to determine free space on the page
+ 		 */
+ 		GetBTPageStatistics(blkno, buffer, &stat);
+ 		/* 
+ 		 * Consider pages DELETED and HALF_DEAD as empty,
+ 		 * besides those only consider LEAF pages
+ 		 */
+ 		if (stat.type == 'd' || stat.type == 'e')
+ 		{
+ 			free_space += stat.page_size;
+ 			totalBlcksCounted++;
+ 		}
+ 		else if (stat.type == 'l')
+ 		{
+ 			free_space += stat.free_size;		
+ 			totalBlcksCounted++;
+ 		}
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksCounted > 0)
+ 		free_percent = ((float4) free_space) / (totalBlcksCounted * BLCKSZ);
+ 
+ 	return free_percent;
+ }
+ 
+ 
  /* -----------------------------------------------
   * bt_page()
   *
diff --git a/contrib/pageinspect/btreefuncs.h b/contrib/pageinspect/btreefuncs.h
new file mode 100644
index ...549f878
*** a/contrib/pageinspect/btreefuncs.h
--- b/contrib/pageinspect/btreefuncs.h
***************
*** 0 ****
--- 1,5 ----
+ /*
+  * contrib/pageinspect/btreefuncs.h
+  */
+ 
+ float4 GetBTRelationFreeSpace(Relation);
diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
new file mode 100644
index 260ccff..8e3961d
*** a/contrib/pageinspect/heapfuncs.c
--- b/contrib/pageinspect/heapfuncs.c
***************
*** 28,33 ****
--- 28,36 ----
  #include "funcapi.h"
  #include "utils/builtins.h"
  #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ 
+ #include "heapfuncs.h"
  
  Datum		heap_page_items(PG_FUNCTION_ARGS);
  
*************** bits_to_text(bits8 *bits, int len)
*** 55,60 ****
--- 58,96 ----
  }
  
  
+ /*
+  * GetHeapRelationFreeSpace()
+  *
+  * Get the free space for a heap relation.
+  * This is a helper function for relation_free_space()
+  */
+ float4
+ GetHeapRelationFreeSpace(Relation rel)
+ {
+ 	Buffer      buffer;
+ 	Page		page;
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	Size        free_space = 0;
+ 	double		free_percent = 0;
+ 
+ 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 
+ 	for (blkno = 0; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		page   = BufferGetPage(buffer);
+ 		free_space += PageGetHeapFreeSpace(page);
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksInRelation > 0)
+ 		free_percent = ((float4) free_space) / (totalBlcksInRelation * BLCKSZ);
+ 	return free_percent;
+ }
+ 
+ 
  /*
   * heap_page_items
   *
diff --git a/contrib/pageinspect/heapfuncs.h b/contrib/pageinspect/heapfuncs.h
new file mode 100644
index ...17b7cb3
*** a/contrib/pageinspect/heapfuncs.h
--- b/contrib/pageinspect/heapfuncs.h
***************
*** 0 ****
--- 1,5 ----
+ /*
+  * contrib/pageinspect/heapfuncs.h
+  */
+ 
+ float4 GetHeapRelationFreeSpace(Relation);
diff --git a/contrib/pageinspect/pageinspect--1.0--1.1.sql b/contrib/pageinspect/pageinspect--1.0--1.1.sql
new file mode 100644
index ...c97aeba
*** a/contrib/pageinspect/pageinspect--1.0--1.1.sql
--- b/contrib/pageinspect/pageinspect--1.0--1.1.sql
***************
*** 0 ****
--- 1,12 ----
+ /* contrib/pageinspect/pageinspect--1.0--1.1.sql */
+ 
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "ALTER EXTENSION pageinspect UPDATE TO '1.1'" to load this file. \quit
+ 
+ --
+ -- relation_free_space()
+ --
+ CREATE FUNCTION relation_free_space(IN relname text)
+ RETURNS real
+ AS 'MODULE_PATHNAME', 'relation_free_space'
+ LANGUAGE C STRICT;
diff --git a/contrib/pageinspect/pageinspect--1.1.sql b/contrib/pageinspect/pageinspect--1.1.sql
new file mode 100644
index ...035195b
*** a/contrib/pageinspect/pageinspect--1.1.sql
--- b/contrib/pageinspect/pageinspect--1.1.sql
***************
*** 0 ****
--- 1,115 ----
+ /* contrib/pageinspect/pageinspect--1.1.sql */
+ 
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "CREATE EXTENSION pageinspect" to load this file. \quit
+ 
+ --
+ -- get_raw_page()
+ --
+ CREATE FUNCTION get_raw_page(text, int4)
+ RETURNS bytea
+ AS 'MODULE_PATHNAME', 'get_raw_page'
+ LANGUAGE C STRICT;
+ 
+ CREATE FUNCTION get_raw_page(text, text, int4)
+ RETURNS bytea
+ AS 'MODULE_PATHNAME', 'get_raw_page_fork'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- page_header()
+ --
+ CREATE FUNCTION page_header(IN page bytea,
+     OUT lsn text,
+     OUT tli smallint,
+     OUT flags smallint,
+     OUT lower smallint,
+     OUT upper smallint,
+     OUT special smallint,
+     OUT pagesize smallint,
+     OUT version smallint,
+     OUT prune_xid xid)
+ AS 'MODULE_PATHNAME', 'page_header'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- heap_page_items()
+ --
+ CREATE FUNCTION heap_page_items(IN page bytea,
+     OUT lp smallint,
+     OUT lp_off smallint,
+     OUT lp_flags smallint,
+     OUT lp_len smallint,
+     OUT t_xmin xid,
+     OUT t_xmax xid,
+     OUT t_field3 int4,
+     OUT t_ctid tid,
+     OUT t_infomask2 integer,
+     OUT t_infomask integer,
+     OUT t_hoff smallint,
+     OUT t_bits text,
+     OUT t_oid oid)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'heap_page_items'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- bt_metap()
+ --
+ CREATE FUNCTION bt_metap(IN relname text,
+     OUT magic int4,
+     OUT version int4,
+     OUT root int4,
+     OUT level int4,
+     OUT fastroot int4,
+     OUT fastlevel int4)
+ AS 'MODULE_PATHNAME', 'bt_metap'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- bt_page_stats()
+ --
+ CREATE FUNCTION bt_page_stats(IN relname text, IN blkno int4,
+     OUT blkno int4,
+     OUT type "char",
+     OUT live_items int4,
+     OUT dead_items int4,
+     OUT avg_item_size int4,
+     OUT page_size int4,
+     OUT free_size int4,
+     OUT btpo_prev int4,
+     OUT btpo_next int4,
+     OUT btpo int4,
+     OUT btpo_flags int4)
+ AS 'MODULE_PATHNAME', 'bt_page_stats'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- bt_page_items()
+ --
+ CREATE FUNCTION bt_page_items(IN relname text, IN blkno int4,
+     OUT itemoffset smallint,
+     OUT ctid tid,
+     OUT itemlen smallint,
+     OUT nulls bool,
+     OUT vars bool,
+     OUT data text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'bt_page_items'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- fsm_page_contents()
+ --
+ CREATE FUNCTION fsm_page_contents(IN page bytea)
+ RETURNS text
+ AS 'MODULE_PATHNAME', 'fsm_page_contents'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- relation_free_space()
+ --
+ CREATE FUNCTION relation_free_space(IN relname text)
+ RETURNS real
+ AS 'MODULE_PATHNAME', 'relation_free_space'
+ LANGUAGE C STRICT;
diff --git a/contrib/pageinspect/pageinspect.control b/contrib/pageinspect/pageinspect.control
new file mode 100644
index f9da0e8..a412cf1
*** a/contrib/pageinspect/pageinspect.control
--- b/contrib/pageinspect/pageinspect.control
***************
*** 1,5 ****
  # pageinspect extension
  comment = 'inspect the contents of database pages at a low level'
! default_version = '1.0'
  module_pathname = '$libdir/pageinspect'
  relocatable = true
--- 1,5 ----
  # pageinspect extension
  comment = 'inspect the contents of database pages at a low level'
! default_version = '1.1'
  module_pathname = '$libdir/pageinspect'
  relocatable = true
diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
new file mode 100644
index f51a4e3..9a54744
*** a/contrib/pageinspect/rawpage.c
--- b/contrib/pageinspect/rawpage.c
***************
*** 23,33 ****
--- 23,37 ----
  #include "utils/builtins.h"
  #include "utils/rel.h"
  
+ #include "btreefuncs.h"
+ #include "heapfuncs.h"
+ 
  PG_MODULE_MAGIC;
  
  Datum		get_raw_page(PG_FUNCTION_ARGS);
  Datum		get_raw_page_fork(PG_FUNCTION_ARGS);
  Datum		page_header(PG_FUNCTION_ARGS);
+ Datum		relation_free_space(PG_FUNCTION_ARGS);
  
  static bytea *get_raw_page_internal(text *relname, ForkNumber forknum,
  					  BlockNumber blkno);
*************** page_header(PG_FUNCTION_ARGS)
*** 227,229 ****
--- 231,285 ----
  
  	PG_RETURN_DATUM(result);
  }
+ 
+ 
+ /*
+  * relation_free_space
+  *
+  * Returns the percentage of free space for a given relation.
+  * Supported relation types are tables and indexes.  TOAST is not
+  * considered/handled.
+  */
+ PG_FUNCTION_INFO_V1(relation_free_space);
+ 
+ Datum
+ relation_free_space(PG_FUNCTION_ARGS)
+ {
+ 	text       *relname = PG_GETARG_TEXT_P(0);
+ 	RangeVar	*relrv;
+ 	Relation	rel;
+ 	float4 		free_space = 0;
+ 
+ 	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+ 	rel = relation_openrv(relrv, AccessShareLock);
+ 
+ 	/* Only allow tables or indexes */
+ 	if (rel->rd_rel->relkind != RELKIND_INDEX && rel->rd_rel->relkind != RELKIND_RELATION)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ 				errmsg("cannot get free space from object \"%s\"",
+ 					RelationGetRelationName(rel))));
+ 
+ 	/*
+ 	 * Reject attempts to read non-local temporary relations.  We would be
+ 	 * likely to get wrong data, since we have no visibility into the owning
+ 	 * session's local buffers.
+ 	 */
+ 	if (RELATION_IS_OTHER_TEMP(rel))
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 				errmsg("cannot access temporary tables of other sessions")));
+ 
+ 	switch (rel->rd_rel->relkind)
+ 	{
+ 		case RELKIND_RELATION: 
+ 			free_space = GetHeapRelationFreeSpace(rel);
+ 			break;
+ 		case RELKIND_INDEX: 
+ 			free_space = GetBTRelationFreeSpace(rel);
+ 			break;
+ 	}
+ 	relation_close(rel, AccessShareLock);
+ 
+ 	PG_RETURN_FLOAT4(free_space);
+ }
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
new file mode 100644
index acbb05b..49708c3
*** a/doc/src/sgml/pageinspect.sgml
--- b/doc/src/sgml/pageinspect.sgml
*************** test=# SELECT * FROM bt_page_items('pg_c
*** 196,201 ****
--- 196,216 ----
       </para>
      </listitem>
     </varlistentry>
+ 
+    <varlistentry>
+     <term>
+      <function>relation_free_space(relname text) returns float</function>
+     </term>
+ 
+     <listitem>
+      <para>
+       <function>relation_free_space</function> returns the total amount of free
+       free space in a table or index.  The entire relation is scanned in order
+       to compute this amount, which may generate a large amount of disk reads.
+       Information stored in <acronym>TOAST</> tables is not included.
+      </para>
+     </listitem>
+    </varlistentry>
    </variablelist>
   </sect2>

relation_free_space-v5.1.patchtext/x-patch; charset=US-ASCII; name=relation_free_space-v5.1.patchDownload

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
new file mode 100644
index 63fab95..f63b92f
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
***************
*** 1,7 ****
  # contrib/pageinspect/Makefile
  
  MODULE_big	= pageinspect
! OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
  
  EXTENSION = pageinspect
  DATA = pageinspect--1.0.sql pageinspect--1.1.sql \
--- 1,7 ----
  # contrib/pageinspect/Makefile
  
  MODULE_big	= pageinspect
! OBJS		= rawpage.o heapfuncs.o btreefuncs.o idxfuncs.o fsmfuncs.o
  
  EXTENSION = pageinspect
  DATA = pageinspect--1.0.sql pageinspect--1.1.sql \
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
new file mode 100644
index 9c0b0fb..b399b85
*** a/contrib/pageinspect/btreefuncs.c
--- b/contrib/pageinspect/btreefuncs.c
***************
*** 34,41 ****
  #include "utils/builtins.h"
  #include "utils/rel.h"
  
- #include "btreefuncs.h"
- 
  extern Datum bt_metap(PG_FUNCTION_ARGS);
  extern Datum bt_page_items(PG_FUNCTION_ARGS);
  extern Datum bt_page_stats(PG_FUNCTION_ARGS);
--- 34,39 ----
*************** GetBTPageStatistics(BlockNumber blkno, B
*** 156,216 ****
  		stat->avg_item_size = 0;
  }
  
- /*------------------------------------------------
-  * GetBTRelationFreeSpace
-  *
-  * Get the free space for a btree index.
-  * This is a helper function for relation_free_space()
-  *------------------------------------------------
-  */
- float4
- GetBTRelationFreeSpace(Relation rel)
- {
- 	BTPageStat stat;
- 
- 	Buffer		buffer;
- 	BlockNumber blkno;
- 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
- 	BlockNumber totalBlcksCounted = 0;
- 	Size 		free_space = 0;
- 	double		free_percent = 0;
- 
-     BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
- 	
- 	/* Skip page 0 because it is a metapage */
- 	for (blkno = 1; blkno < totalBlcksInRelation; blkno++)
- 	{
- 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
- 		/* 
- 		 * get the statistics of the indexes and use that info
- 		 * to determine free space on the page
- 		 */
- 		GetBTPageStatistics(blkno, buffer, &stat);
- 		/* 
- 		 * Consider pages DELETED and HALF_DEAD as empty,
- 		 * besides those only consider LEAF pages
- 		 */
- 		if (stat.type == 'd' || stat.type == 'e')
- 		{
- 			free_space += stat.page_size;
- 			totalBlcksCounted++;
- 		}
- 		else if (stat.type == 'l')
- 		{
- 			free_space += stat.free_size;		
- 			totalBlcksCounted++;
- 		}
- 
- 		ReleaseBuffer(buffer);
- 	}
- 
- 	if (totalBlcksCounted > 0)
- 		free_percent = ((float4) free_space) / (totalBlcksCounted * BLCKSZ);
- 
- 	return free_percent;
- }
- 
- 
  /* -----------------------------------------------
   * bt_page()
   *
--- 154,159 ----
diff --git a/contrib/pageinspect/btreefuncs.h b/contrib/pageinspect/btreefuncs.h
new file mode .
index 549f878..e69de29
*** a/contrib/pageinspect/btreefuncs.h
--- b/contrib/pageinspect/btreefuncs.h
***************
*** 1,5 ****
- /*
-  * contrib/pageinspect/btreefuncs.h
-  */
- 
- float4 GetBTRelationFreeSpace(Relation);
--- 0 ----
diff --git a/contrib/pageinspect/idxfuncs.c b/contrib/pageinspect/idxfuncs.c
new file mode 100644
index ...6533429
*** a/contrib/pageinspect/idxfuncs.c
--- b/contrib/pageinspect/idxfuncs.c
***************
*** 0 ****
--- 1,313 ----
+ /*
+  * contrib/pageinspect/idxfuncs.c
+  *
+  *
+  * idxfuncs.c
+  *
+  * Permission to use, copy, modify, and distribute this software and
+  * its documentation for any purpose, without fee, and without a
+  * written agreement is hereby granted, provided that the above
+  * copyright notice and this paragraph and the following two
+  * paragraphs appear in all copies.
+  *
+  * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+  * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+  * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+  * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+  * OF THE POSSIBILITY OF SUCH DAMAGE.
+  *
+  * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+  * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+  * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
+  * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+  * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+  */
+ 
+ #include "postgres.h"
+ 
+ #include "access/gin_private.h"
+ #include "access/gist_private.h"
+ #include "access/hash.h"
+ #include "access/nbtree.h"
+ #include "access/spgist_private.h"
+ #include "funcapi.h"
+ #include "utils/rel.h"
+ 
+ #include "idxfuncs.h"
+ 
+ /*------------------------------------------------
+  * GetIndexRelationFreeSpace
+  *
+  * Look the type of the index and call the appropiate
+  * function to get the free space
+  *------------------------------------------------
+  */
+ float4
+ GetIndexRelationFreeSpace(Relation rel)
+ {
+ 	switch (rel->rd_rel->relam)
+ 	{
+ 		case BTREE_AM_OID:
+ 			return GetBTRelationFreeSpace(rel);
+ 		case HASH_AM_OID:
+ 			return GetHashRelationFreeSpace(rel);
+ 		case GIST_AM_OID:
+ 			return GetGistRelationFreeSpace(rel);
+ 		case GIN_AM_OID:
+ 			return GetGinRelationFreeSpace(rel);
+ 		case SPGIST_AM_OID:
+ 			return GetSpGistRelationFreeSpace(rel);
+ 		default:
+ 			elog(ERROR, "Unknown index type %d", rel->rd_rel->relam);
+ 	}
+ 
+ 	return 0;
+ }
+ 
+ /*------------------------------------------------
+  * GetBTRelationFreeSpace
+  *
+  * Get the free space for a btree index.
+  * This is a helper function for relation_free_space()
+  *------------------------------------------------
+  */
+ float4
+ GetBTRelationFreeSpace(Relation rel)
+ {
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	BlockNumber totalBlcksCounted = 0;
+ 	Size 		free_space = 0;
+ 	double		free_percent = 0;
+ 
+     BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	
+ 	/* Start just after the metapage */
+ 	for (blkno = BTREE_METAPAGE + 1; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		Buffer buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		Page page = BufferGetPage(buffer);
+ 		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ 
+ 		/* 
+ 		 * Consider pages DELETED and HALF_DEAD as empty,
+ 		 * besides those only consider LEAF pages
+ 		 */
+ 		if (P_ISDELETED(opaque) || P_ISHALFDEAD(opaque))
+ 		{
+ 			free_space += PageGetPageSize(page);
+ 			totalBlcksCounted++;
+ 		}
+ 		else if (P_ISLEAF(opaque))
+ 		{
+ 			free_space += PageGetFreeSpace(page);	
+ 			totalBlcksCounted++;
+ 		}
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksCounted > 0)
+ 		free_percent = ((float4) free_space) / (totalBlcksCounted * BLCKSZ);
+ 
+ 	return free_percent;
+ }
+ 
+ 
+ /*------------------------------------------------
+  * GetHashRelationFreeSpace
+  *
+  * Get the free space for a Hash index.
+  * This is a helper function for relation_free_space()
+  *------------------------------------------------
+  */
+ float4
+ GetHashRelationFreeSpace(Relation rel)
+ {
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	BlockNumber totalBlcksCounted = 0;
+ 	Size 		free_space = 0;
+ 	double		free_percent = 0;
+ 
+     BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	
+ 	/* Start just after the metapage */
+ 	for (blkno = HASH_METAPAGE + 1; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		Buffer buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		Page page = BufferGetPage(buffer);
+ 		HashPageOpaque opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ 
+ 		/* 
+ 		 * Consider pages LH_UNUSED_PAGE as empty,
+ 		 * besides that consider the free space in LH_OVERFLOW_PAGE and LH_BUCKET_PAGE pages
+ 		 */
+ 		if (opaque->hasho_flag == LH_UNUSED_PAGE)
+ 		{
+ 			free_space += PageGetPageSize(page);
+ 			totalBlcksCounted++;
+ 		}
+ 		else if (opaque->hasho_flag == LH_OVERFLOW_PAGE || opaque->hasho_flag == LH_BUCKET_PAGE)
+ 		{
+ 			free_space += PageGetFreeSpace(page);	
+ 			totalBlcksCounted++;
+ 		}
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksCounted > 0)
+ 		free_percent = ((float4) free_space) / (totalBlcksCounted * BLCKSZ);
+ 
+ 	return free_percent;
+ }
+ 
+ /*------------------------------------------------
+  * GetGistRelationFreeSpace
+  *
+  * Get the free space for a Gist index.
+  * This is a helper function for relation_free_space()
+  *------------------------------------------------
+  */
+ float4
+ GetGistRelationFreeSpace(Relation rel)
+ {
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	BlockNumber totalBlcksCounted = 0;
+ 	Size 		free_space = 0;
+ 	double		free_percent = 0;
+ 
+     BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	
+ 	/* Start from the root */
+ 	for (blkno = GIST_ROOT_BLKNO; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		Buffer buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		Page page = BufferGetPage(buffer);
+ 
+ 		/* 
+ 		 * Consider pages LH_UNUSED_PAGE as empty,
+ 		 * besides that consider the free space in LH_OVERFLOW_PAGE and LH_BUCKET_PAGE pages
+ 		 */
+ 		if (GistPageIsDeleted(page))
+ 		{
+ 			free_space += PageGetPageSize(page);
+ 			totalBlcksCounted++;
+ 		}
+ 		else if (GistPageIsLeaf(page))
+ 		{
+ 			free_space += PageGetFreeSpace(page);	
+ 			totalBlcksCounted++;
+ 		}
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksCounted > 0)
+ 		free_percent = ((float4) free_space) / (totalBlcksCounted * BLCKSZ);
+ 
+ 	return free_percent;
+ }
+ 
+ /*------------------------------------------------
+  * GetGinRelationFreeSpace
+  *
+  * Get the free space for a Gin index.
+  * This is a helper function for relation_free_space()
+  *------------------------------------------------
+  */
+ float4
+ GetGinRelationFreeSpace(Relation rel)
+ {
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	BlockNumber totalBlcksCounted = 0;
+ 	Size 		free_space = 0;
+ 	double		free_percent = 0;
+ 
+     BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	
+ 	/* Start just after the first root */
+ 	for (blkno = GIN_METAPAGE_BLKNO + 1; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		Buffer buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		Page page = BufferGetPage(buffer);
+ 
+ 		/* 
+ 		 * Consider pages GIN_DELETED as empty,
+ 		 * besides that consider the free space in GIN_DATA and GIN_LEAF pages
+ 		 */
+ 		if (GinPageIsDeleted(page))
+ 		{
+ 			free_space += PageGetPageSize(page);
+ 			totalBlcksCounted++;
+ 		}
+ 		else if (GinPageIsLeaf(page))
+ 		{
+ 			free_space += PageGetFreeSpace(page);	
+ 			totalBlcksCounted++;
+ 		}
+ 		else if (GinPageIsData(page))
+ 		{
+ 			free_space += GinDataPageGetFreeSpace(page);	
+ 			totalBlcksCounted++;
+ 		}
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksCounted > 0)
+ 		free_percent = ((float4) free_space) / (totalBlcksCounted * BLCKSZ);
+ 
+ 	return free_percent;
+ }
+ 
+ /*------------------------------------------------
+  * GetSPGistRelationFreeSpace
+  *
+  * Get the free space for a SPGist index.
+  * This is a helper function for relation_free_space()
+  *------------------------------------------------
+  */
+ float4
+ GetSpGistRelationFreeSpace(Relation rel)
+ {
+ 	BlockNumber blkno;
+ 	BlockNumber totalBlcksInRelation = RelationGetNumberOfBlocks(rel);
+ 	BlockNumber totalBlcksCounted = 0;
+ 	Size 		free_space = 0;
+ 	double		free_percent = 0;
+ 
+     BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	
+ 	/* Start just after the first root */
+ 	for (blkno = SPGIST_METAPAGE_BLKNO + 1; blkno < totalBlcksInRelation; blkno++)
+ 	{
+ 		Buffer buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 		Page page = BufferGetPage(buffer);
+ 
+ 		/* 
+ 		 * Consider pages GIN_DELETED as empty,
+ 		 * besides that consider the free space in GIN_DATA and GIN_LEAF pages
+ 		 */
+ 		if (SpGistPageIsDeleted(page))
+ 		{
+ 			free_space += PageGetPageSize(page);
+ 			totalBlcksCounted++;
+ 		}
+ 		else if (SpGistPageIsLeaf(page))
+ 		{
+ 			free_space += SpGistPageGetFreeSpace(page, 1);	
+ 			totalBlcksCounted++;
+ 		}
+ 
+ 		ReleaseBuffer(buffer);
+ 	}
+ 
+ 	if (totalBlcksCounted > 0)
+ 		free_percent = ((float4) free_space) / (totalBlcksCounted * BLCKSZ);
+ 
+ 	return free_percent;
+ }
diff --git a/contrib/pageinspect/idxfuncs.h b/contrib/pageinspect/idxfuncs.h
new file mode 100644
index ...788d920
*** a/contrib/pageinspect/idxfuncs.h
--- b/contrib/pageinspect/idxfuncs.h
***************
*** 0 ****
--- 1,10 ----
+ /*
+  * contrib/pageinspect/idxfuncs.h
+  */
+ 
+ float4 GetIndexRelationFreeSpace(Relation);
+ float4 GetBTRelationFreeSpace(Relation);
+ float4 GetHashRelationFreeSpace(Relation);
+ float4 GetGistRelationFreeSpace(Relation);
+ float4 GetGinRelationFreeSpace(Relation);
+ float4 GetSpGistRelationFreeSpace(Relation);
diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
new file mode 100644
index 9a54744..99fccab
*** a/contrib/pageinspect/rawpage.c
--- b/contrib/pageinspect/rawpage.c
***************
*** 23,30 ****
  #include "utils/builtins.h"
  #include "utils/rel.h"
  
- #include "btreefuncs.h"
  #include "heapfuncs.h"
  
  PG_MODULE_MAGIC;
  
--- 23,30 ----
  #include "utils/builtins.h"
  #include "utils/rel.h"
  
  #include "heapfuncs.h"
+ #include "idxfuncs.h"
  
  PG_MODULE_MAGIC;
  
*************** relation_free_space(PG_FUNCTION_ARGS)
*** 276,282 ****
  			free_space = GetHeapRelationFreeSpace(rel);
  			break;
  		case RELKIND_INDEX: 
! 			free_space = GetBTRelationFreeSpace(rel);
  			break;
  	}
  	relation_close(rel, AccessShareLock);
--- 276,282 ----
  			free_space = GetHeapRelationFreeSpace(rel);
  			break;
  		case RELKIND_INDEX: 
! 			free_space = GetIndexRelationFreeSpace(rel);
  			break;
  	}
  	relation_close(rel, AccessShareLock);

#30

noah@leadboat.com

almost 14 years ago

In reply to: Jaime Casanova (#29)

Re: Measuring relation free space

On Tue, Jan 24, 2012 at 11:24:08AM -0500, Jaime Casanova wrote:

On Mon, Jan 23, 2012 at 7:18 PM, Noah Misch <noah@leadboat.com> wrote:

If someone feels like
doing it, +1 for making pgstattuple() count non-leaf free space.

actually i agreed that non-leaf pages are irrelevant... i just
confirmed that in a production system with 300GB none of the indexes
in an 84M rows table nor in a heavily updated one has more than 1 root
page, all the rest are deleted, half_dead or leaf. so the posibility
of bloat coming from non-leaf pages seems very odd

FWIW, the number to look at is internal_pages from pgstatindex():

[local] test=# create table t4 (c) as select * from generate_series(1,1000000);
SELECT 1000000
[local] test=# alter table t4 add primary key(c);
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "t4_pkey" for table "t4"
ALTER TABLE
[local] test=# select * from pgstatindex('t4_pkey');
-[ RECORD 1 ]------+---------
version | 2
tree_level | 2
index_size | 22478848
root_block_no | 290
internal_pages | 10
leaf_pages | 2733
empty_pages | 0
deleted_pages | 0
avg_leaf_density | 90.06
leaf_fragmentation | 0

So, 0.4% of this index. They appear in proportion to the logarithm of the
total index size. I agree that bloat centered on them is unlikely. Counting
them would be justified, but that is a question of formal accuracy rather than
practical importance.

but the possibility of bloat coming from the meta page doesn't exist,
AFAIUI at least

we need the most accurate value about usable free space, because the
idea is to add a sampler mode to the function so we don't scan the
whole relation. that's why we still need the function.

I doubt we'd add this function solely on the basis that a future improvement
will make it useful. For the patch to go in now, it needs to be useful now.
(This is not a universal principle, but it mostly holds for low-complexity
patches like this one.)

All my comments below would also apply to such a broader patch.

btw... pgstattuple also has the problem that it's not using a ring buffer

attached are two patches:
- v5: is the same original patch but only track space in leaf, deleted
and half_dead pages
- v5.1: adds the same for all kind of indexes (problem is that this is
inconsistent with the fact that pageinspect only manages btree indexes
for everything else)

Let's take a step back. Again, what you're proposing is essentially a faster
implementation of "SELECT free_percent FROM pgstattuple(rel)". If this code
belongs in core at all, it belongs in the pgstattuple module. Share as much
infrastructure as is reasonable between the user-visible functions of that
module. For example, I'm suspecting that the pgstat_index() call tree should
be shared, with pgstat_index_page() checking a flag to decide whether to
gather per-tuple stats.

Next, compare the bits of code that differ between pgstattuple() and
relation_free_space(), convincing yourself that the differences are justified.
Each difference will yield one of the following conclusions:

1. Your code contains an innovation that would apply to both functions. Where
not too difficult, merge these improvements into pgstattuple(). In order for
a demonstration of your new code's better performance to be interesting, we
must fix the same low-hanging fruit in its competitor. One example is the use
of the bulk read strategy. Another is the support for SP-GiST.

2. Your code is missing an essential behavior of pgstattuple(). Add it to
your code. One example is the presence of CHECK_FOR_INTERRUPTS() calls.

3. Your code behaves differently from pgstattuple() due to a fundamental
difference in their tasks. These are the only functional differences that
ought to remain in your finished patch; please point them out in comments.
For example, pgstat_heap() visits every tuple in the heap. You'll have no
reason to do that; pgstattuple() only needs it to calculate statistics other
than free_percent.

In particular, I call your attention to the fact that pgstattuple() takes
shared buffer content locks before examining pages. Your proposed patch does
not do so. I do not know with certainty whether that falls under #1 or #2.
The broad convention is to take such locks, because we elsewhere want an exact
answer. These functions are already inexact; they make no effort to observe a
consistent snapshot of the table. If you convince yourself that the error
arising from not locking buffers is reasonably bounded, we can lose the locks
(in both functions -- conclusion #1). Comments would then be in order.

With all that done, run some quick benchmarks: see how "SELECT free_percent
FROM pgstattuple(rel)" fares compared to "SELECT relation_free_space(rel)" for
a large heap and for a large B-tree index. If the timing difference is too
small to be interesting to you, remove relation_free_space() and submit your
pgstattuple() improvements alone. Otherwise, submit as written.

I suppose I didn't make all of this clear enough earlier; sorry for that.

Thanks,
nm

#31

jaime@2ndquadrant.com

almost 14 years ago

In reply to: Noah Misch (#30)

3 attachment(s)

Re: Measuring relation free space

On Wed, Jan 25, 2012 at 9:47 PM, Noah Misch <noah@leadboat.com> wrote:

With all that done, run some quick benchmarks: see how "SELECT free_percent
FROM pgstattuple(rel)" fares compared to "SELECT relation_free_space(rel)" for
a large heap and for a large B-tree index. If the timing difference is too
small to be interesting to you, remove relation_free_space() and submit your
pgstattuple() improvements alone. Otherwise, submit as written.

Ok. I split this in three patches.

1) pgstattuple-gin_spgist.patch
This first patch adds gin and spgist support to pgstattuple, also
makes pgstattuple use a ring buffer when reading tables or indexes.

2) pgstattuple-relation_free_space.patch
This patch adds the relation_free_space function to pgstattuple.

the function relation_free_space() is faster than pgstattuple(), to
test that i initialize pgbench with a scale of 40.
In that context pgstattuple() tooks 1.4s to process pgbench_account
table and relation_free_space() tooks 730ms (half the time!)
In the index the difference is less notorious, 170ms the former and
150ms the latter.

3) pgstattuple-stats_target.patch
This patch adds a stats_target parameter to the relation_free_space()
function, it mimics the way analyze choose the blocks to read and is
faster than plain relation_free_space() but of course could be inexact
if the pages that we don't read are the ones with more free space

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

Attachments:

pgstattuple-gin_spgist.patchtext/x-patch; charset=US-ASCII; name=pgstattuple-gin_spgist.patchDownload

diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index beff1b9..9f2ec1f 100644
*** a/contrib/pgstattuple/pgstatindex.c
--- b/contrib/pgstattuple/pgstatindex.c
*************** pgstatindex(PG_FUNCTION_ARGS)
*** 95,100 ****
--- 95,101 ----
  	BlockNumber nblocks;
  	BlockNumber blkno;
  	BTIndexStat indexStat;
+  	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
  
  	if (!superuser())
  		ereport(ERROR,
*************** pgstatindex(PG_FUNCTION_ARGS)
*** 122,128 ****
  	 * Read metapage
  	 */
  	{
! 		Buffer		buffer = ReadBuffer(rel, 0);
  		Page		page = BufferGetPage(buffer);
  		BTMetaPageData *metad = BTPageGetMeta(page);
  
--- 123,129 ----
  	 * Read metapage
  	 */
  	{
! 		Buffer		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, 0, RBM_NORMAL, bstrategy);
  		Page		page = BufferGetPage(buffer);
  		BTMetaPageData *metad = BTPageGetMeta(page);
  
*************** pgstatindex(PG_FUNCTION_ARGS)
*** 159,165 ****
  		CHECK_FOR_INTERRUPTS();
  
  		/* Read and lock buffer */
! 		buffer = ReadBuffer(rel, blkno);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  
  		page = BufferGetPage(buffer);
--- 160,166 ----
  		CHECK_FOR_INTERRUPTS();
  
  		/* Read and lock buffer */
!  		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  
  		page = BufferGetPage(buffer);
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index e5ddd87..6bbc957 100644
*** a/contrib/pgstattuple/pgstattuple.c
--- b/contrib/pgstattuple/pgstattuple.c
***************
*** 24,33 ****
--- 24,35 ----
  
  #include "postgres.h"
  
+ #include "access/gin_private.h"
  #include "access/gist_private.h"
  #include "access/hash.h"
  #include "access/nbtree.h"
  #include "access/relscan.h"
+ #include "access/spgist_private.h"
  #include "catalog/namespace.h"
  #include "funcapi.h"
  #include "miscadmin.h"
*************** static void pgstat_hash_page(pgstattuple
*** 73,83 ****
--- 75,96 ----
  				 Relation rel, BlockNumber blkno);
  static void pgstat_gist_page(pgstattuple_type *stat,
  				 Relation rel, BlockNumber blkno);
+ static void pgstat_gin_page(pgstattuple_type *stat,
+ 				 Relation rel, BlockNumber blkno);
+ static void pgstat_spgist_page(pgstattuple_type *stat,
+ 				 Relation rel, BlockNumber blkno);
  static Datum pgstat_index(Relation rel, BlockNumber start,
  			 pgstat_page pagefn, FunctionCallInfo fcinfo);
  static void pgstat_index_page(pgstattuple_type *stat, Page page,
  				  OffsetNumber minoff, OffsetNumber maxoff);
  
+ /* 
+  * Buffer access strategy for reading relations, it's simpler to keep it
+  * global because pgstat_*_page() functions read one buffer at a time.
+  * pgstat_heap() and pgstat_index() should initialize it before use.
+  */
+ BufferAccessStrategy bstrategy;
+ 
  /*
   * build_pgstattuple_type -- build a pgstattuple_type tuple
   */
*************** pgstat_relation(Relation rel, FunctionCa
*** 229,235 ****
  					return pgstat_index(rel, GIST_ROOT_BLKNO + 1,
  										pgstat_gist_page, fcinfo);
  				case GIN_AM_OID:
! 					err = "gin index";
  					break;
  				default:
  					err = "unknown index";
--- 242,253 ----
  					return pgstat_index(rel, GIST_ROOT_BLKNO + 1,
  										pgstat_gist_page, fcinfo);
  				case GIN_AM_OID:
! 					return pgstat_index(rel, GIN_METAPAGE_BLKNO + 1,
! 										pgstat_gin_page, fcinfo);
! 					break;
! 				case SPGIST_AM_OID:
! 					return pgstat_index(rel, SPGIST_METAPAGE_BLKNO + 1,
! 										pgstat_spgist_page, fcinfo);
  					break;
  				default:
  					err = "unknown index";
*************** pgstat_heap(Relation rel, FunctionCallIn
*** 276,281 ****
--- 294,302 ----
  
  	nblocks = scan->rs_nblocks; /* # blocks to be scanned */
  
+ 	/* prepare access strategy for this table */
+ 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 
  	/* scan the relation */
  	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
*************** pgstat_heap(Relation rel, FunctionCallIn
*** 309,315 ****
  		{
  			CHECK_FOR_INTERRUPTS();
  
! 			buffer = ReadBuffer(rel, block);
  			LockBuffer(buffer, BUFFER_LOCK_SHARE);
  			stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
  			UnlockReleaseBuffer(buffer);
--- 330,336 ----
  		{
  			CHECK_FOR_INTERRUPTS();
  
! 			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_NORMAL, bstrategy);
  			LockBuffer(buffer, BUFFER_LOCK_SHARE);
  			stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
  			UnlockReleaseBuffer(buffer);
*************** pgstat_heap(Relation rel, FunctionCallIn
*** 322,328 ****
  	{
  		CHECK_FOR_INTERRUPTS();
  
! 		buffer = ReadBuffer(rel, block);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  		stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
  		UnlockReleaseBuffer(buffer);
--- 343,349 ----
  	{
  		CHECK_FOR_INTERRUPTS();
  
! 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_NORMAL, bstrategy);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  		stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
  		UnlockReleaseBuffer(buffer);
*************** pgstat_btree_page(pgstattuple_type *stat
*** 345,351 ****
  	Buffer		buf;
  	Page		page;
  
! 	buf = ReadBuffer(rel, blkno);
  	LockBuffer(buf, BT_READ);
  	page = BufferGetPage(buf);
  
--- 366,372 ----
  	Buffer		buf;
  	Page		page;
  
! 	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
  	LockBuffer(buf, BT_READ);
  	page = BufferGetPage(buf);
  
*************** pgstat_hash_page(pgstattuple_type *stat,
*** 389,395 ****
  	Page		page;
  
  	_hash_getlock(rel, blkno, HASH_SHARE);
! 	buf = _hash_getbuf(rel, blkno, HASH_READ, 0);
  	page = BufferGetPage(buf);
  
  	if (PageGetSpecialSize(page) == MAXALIGN(sizeof(HashPageOpaqueData)))
--- 410,416 ----
  	Page		page;
  
  	_hash_getlock(rel, blkno, HASH_SHARE);
! 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
  	page = BufferGetPage(buf);
  
  	if (PageGetSpecialSize(page) == MAXALIGN(sizeof(HashPageOpaqueData)))
*************** pgstat_gist_page(pgstattuple_type *stat,
*** 431,437 ****
  	Buffer		buf;
  	Page		page;
  
! 	buf = ReadBuffer(rel, blkno);
  	LockBuffer(buf, GIST_SHARE);
  	gistcheckpage(rel, buf);
  	page = BufferGetPage(buf);
--- 452,458 ----
  	Buffer		buf;
  	Page		page;
  
! 	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
  	LockBuffer(buf, GIST_SHARE);
  	gistcheckpage(rel, buf);
  	page = BufferGetPage(buf);
*************** pgstat_gist_page(pgstattuple_type *stat,
*** 450,455 ****
--- 471,538 ----
  }
  
  /*
+  * pgstat_gin_page -- check tuples in a gin page
+  */
+ static void
+ pgstat_gin_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
+ {
+ 	Buffer		buf;
+ 	Page		page;
+ 
+ 	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 	LockBuffer(buf, GIN_SHARE);
+ 	page = BufferGetPage(buf);
+ 
+ 	if (GinPageIsDeleted(page))
+ 	{
+ 		/* recyclable page */
+ 		stat->free_space += BLCKSZ; 
+ 	}
+ 	else if (GinPageIsLeaf(page) || GinPageIsData(page))
+ 	{
+ 		pgstat_index_page(stat, page, FirstOffsetNumber,
+ 						  PageGetMaxOffsetNumber(page));
+ 	}
+ 	else
+ 	{
+ 		/* root or node */
+ 	}
+ 
+ 	UnlockReleaseBuffer(buf);
+ }
+ 
+ /*
+  * pgstat_spgist_page -- check tuples in a spgist page
+  */
+ static void
+ pgstat_spgist_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
+ {
+ 	Buffer		buf;
+ 	Page		page;
+ 
+ 	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 	LockBuffer(buf, BUFFER_LOCK_SHARE);
+ 	page = BufferGetPage(buf);
+ 
+ 	if (SpGistPageIsDeleted(page))
+ 	{
+ 		/* recyclable page */
+ 	    stat->free_space += BLCKSZ;
+ 	}
+ 	else if (SpGistPageIsLeaf(page))
+ 	{
+ 		pgstat_index_page(stat, page, FirstOffsetNumber,
+ 						  PageGetMaxOffsetNumber(page));
+ 	}
+ 	else
+ 	{
+ 		/* root or node */
+ 	}
+ 
+ 	UnlockReleaseBuffer(buf);
+ }
+ 
+ /*
   * pgstat_index -- returns live/dead tuples info in a generic index
   */
  static Datum
*************** pgstat_index(Relation rel, BlockNumber s
*** 460,465 ****
--- 543,551 ----
  	BlockNumber blkno;
  	pgstattuple_type stat = {0};
  
+ 	/* prepare access strategy for this index */
+ 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 
  	blkno = start;
  	for (;;)
  	{

pgstattuple-relation_free_space.patchtext/x-patch; charset=US-ASCII; name=pgstattuple-relation_free_space.patchDownload

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 6ac2775..75c4405 100644
*** a/contrib/pgstattuple/Makefile
--- b/contrib/pgstattuple/Makefile
*************** MODULE_big	= pgstattuple
*** 4,10 ****
  OBJS		= pgstattuple.o pgstatindex.o
  
  EXTENSION = pgstattuple
! DATA = pgstattuple--1.0.sql pgstattuple--unpackaged--1.0.sql
  
  REGRESS = pgstattuple
  
--- 4,13 ----
  OBJS		= pgstattuple.o pgstatindex.o
  
  EXTENSION = pgstattuple
! DATA = pgstattuple--1.0.sql pgstattuple--1.1.sql \
!        pgstattuple--1.0--1.1.sql \
!        pgstattuple--unpackaged--1.0.sql 
! 		
  
  REGRESS = pgstattuple
  
diff --git a/contrib/pgstattuple/pgstattuple--1.0--1.1.sql b/contrib/pgstattuple/pgstattuple--1.0--1.1.sql
index ...d7ad6ca .
*** a/contrib/pgstattuple/pgstattuple--1.0--1.1.sql
--- b/contrib/pgstattuple/pgstattuple--1.0--1.1.sql
***************
*** 0 ****
--- 1,12 ----
+ /* contrib/pgstattuple/pgstattuple--1.0--1.1.sql */
+ 
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "ALTER EXTENSION pgstattuple UPDATE TO '1.1'" to load this file. \quit
+ 
+ --
+ -- relation_free_space()
+ --
+ CREATE FUNCTION relation_free_space(IN relname text)
+ RETURNS real
+ AS 'MODULE_PATHNAME', 'relation_free_space'
+ LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple--1.1.sql b/contrib/pgstattuple/pgstattuple--1.1.sql
index ...d94c20f .
*** a/contrib/pgstattuple/pgstattuple--1.1.sql
--- b/contrib/pgstattuple/pgstattuple--1.1.sql
***************
*** 0 ****
--- 1,57 ----
+ /* contrib/pgstattuple/pgstattuple--1.0.sql */
+ 
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "CREATE EXTENSION pgstattuple" to load this file. \quit
+ 
+ CREATE FUNCTION pgstattuple(IN relname text,
+     OUT table_len BIGINT,		-- physical table length in bytes
+     OUT tuple_count BIGINT,		-- number of live tuples
+     OUT tuple_len BIGINT,		-- total tuples length in bytes
+     OUT tuple_percent FLOAT8,		-- live tuples in %
+     OUT dead_tuple_count BIGINT,	-- number of dead tuples
+     OUT dead_tuple_len BIGINT,		-- total dead tuples length in bytes
+     OUT dead_tuple_percent FLOAT8,	-- dead tuples in %
+     OUT free_space BIGINT,		-- free space in bytes
+     OUT free_percent FLOAT8)		-- free space in %
+ AS 'MODULE_PATHNAME', 'pgstattuple'
+ LANGUAGE C STRICT;
+ 
+ CREATE FUNCTION pgstattuple(IN reloid oid,
+     OUT table_len BIGINT,		-- physical table length in bytes
+     OUT tuple_count BIGINT,		-- number of live tuples
+     OUT tuple_len BIGINT,		-- total tuples length in bytes
+     OUT tuple_percent FLOAT8,		-- live tuples in %
+     OUT dead_tuple_count BIGINT,	-- number of dead tuples
+     OUT dead_tuple_len BIGINT,		-- total dead tuples length in bytes
+     OUT dead_tuple_percent FLOAT8,	-- dead tuples in %
+     OUT free_space BIGINT,		-- free space in bytes
+     OUT free_percent FLOAT8)		-- free space in %
+ AS 'MODULE_PATHNAME', 'pgstattuplebyid'
+ LANGUAGE C STRICT;
+ 
+ CREATE FUNCTION pgstatindex(IN relname text,
+     OUT version INT,
+     OUT tree_level INT,
+     OUT index_size BIGINT,
+     OUT root_block_no BIGINT,
+     OUT internal_pages BIGINT,
+     OUT leaf_pages BIGINT,
+     OUT empty_pages BIGINT,
+     OUT deleted_pages BIGINT,
+     OUT avg_leaf_density FLOAT8,
+     OUT leaf_fragmentation FLOAT8)
+ AS 'MODULE_PATHNAME', 'pgstatindex'
+ LANGUAGE C STRICT;
+ 
+ CREATE FUNCTION pg_relpages(IN relname text)
+ RETURNS BIGINT
+ AS 'MODULE_PATHNAME', 'pg_relpages'
+ LANGUAGE C STRICT;
+ 
+ --
+ -- relation_free_space()
+ --
+ CREATE FUNCTION relation_free_space(IN relname text)
+ RETURNS real
+ AS 'MODULE_PATHNAME', 'relation_free_space'
+ LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 89ec4da..1001035 100644
*** a/contrib/pgstattuple/pgstattuple.c
--- b/contrib/pgstattuple/pgstattuple.c
*************** PG_MODULE_MAGIC;
*** 43,51 ****
--- 43,53 ----
  
  PG_FUNCTION_INFO_V1(pgstattuple);
  PG_FUNCTION_INFO_V1(pgstattuplebyid);
+ PG_FUNCTION_INFO_V1(relation_free_space);
  
  extern Datum pgstattuple(PG_FUNCTION_ARGS);
  extern Datum pgstattuplebyid(PG_FUNCTION_ARGS);
+ extern Datum relation_free_space(PG_FUNCTION_ARGS);
  
  /*
   * struct pgstattuple_type
*************** typedef struct pgstattuple_type
*** 63,93 ****
  	uint64		free_space;		/* free/reusable space in bytes */
  } pgstattuple_type;
  
! typedef void (*pgstat_page) (pgstattuple_type *, Relation, BlockNumber);
  
  static Datum build_pgstattuple_type(pgstattuple_type *stat,
  					   FunctionCallInfo fcinfo);
  static Datum pgstat_relation(Relation rel, FunctionCallInfo fcinfo);
  static Datum pgstat_heap(Relation rel, FunctionCallInfo fcinfo);
  static void pgstat_btree_page(pgstattuple_type *stat,
! 				  Relation rel, BlockNumber blkno);
  static void pgstat_hash_page(pgstattuple_type *stat,
! 				 Relation rel, BlockNumber blkno);
  static void pgstat_gist_page(pgstattuple_type *stat,
! 				 Relation rel, BlockNumber blkno);
  static void pgstat_gin_page(pgstattuple_type *stat,
! 				 Relation rel, BlockNumber blkno);
  static void pgstat_spgist_page(pgstattuple_type *stat,
! 				 Relation rel, BlockNumber blkno);
  static Datum pgstat_index(Relation rel, BlockNumber start,
  			 pgstat_page pagefn, FunctionCallInfo fcinfo);
  static void pgstat_index_page(pgstattuple_type *stat, Page page,
! 				  OffsetNumber minoff, OffsetNumber maxoff);
  
  /* 
   * Buffer access strategy for reading relations, it's simpler to keep it
   * global because pgstat_*_page() functions read one buffer at a time.
!  * pgstat_heap() and pgstat_index() should initialize it before use.
   */
  BufferAccessStrategy bstrategy;
  
--- 65,98 ----
  	uint64		free_space;		/* free/reusable space in bytes */
  } pgstattuple_type;
  
! typedef void (*pgstat_page) (pgstattuple_type *, Relation, BlockNumber, bool);
  
  static Datum build_pgstattuple_type(pgstattuple_type *stat,
  					   FunctionCallInfo fcinfo);
  static Datum pgstat_relation(Relation rel, FunctionCallInfo fcinfo);
  static Datum pgstat_heap(Relation rel, FunctionCallInfo fcinfo);
  static void pgstat_btree_page(pgstattuple_type *stat,
! 				  Relation rel, BlockNumber blkno, bool enable_counters);
  static void pgstat_hash_page(pgstattuple_type *stat,
! 				 Relation rel, BlockNumber blkno, bool enable_counters);
  static void pgstat_gist_page(pgstattuple_type *stat,
! 				 Relation rel, BlockNumber blkno, bool enable_counters);
  static void pgstat_gin_page(pgstattuple_type *stat,
! 				 Relation rel, BlockNumber blkno, bool enable_counters);
  static void pgstat_spgist_page(pgstattuple_type *stat,
! 				 Relation rel, BlockNumber blkno, bool enable_counters);
  static Datum pgstat_index(Relation rel, BlockNumber start,
  			 pgstat_page pagefn, FunctionCallInfo fcinfo);
  static void pgstat_index_page(pgstattuple_type *stat, Page page,
! 				  OffsetNumber minoff, OffsetNumber maxoff, bool enable_counters);
! 
! static float4 GetHeapRelationFreeSpace(Relation rel);
! static float4 GetIndexRelationFreeSpace(Relation rel);
  
  /* 
   * Buffer access strategy for reading relations, it's simpler to keep it
   * global because pgstat_*_page() functions read one buffer at a time.
!  * every consumer of bstrategy should initialize it before use.
   */
  BufferAccessStrategy bstrategy;
  
*************** pgstat_heap(Relation rel, FunctionCallIn
*** 362,368 ****
   * pgstat_btree_page -- check tuples in a btree page
   */
  static void
! pgstat_btree_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
  {
  	Buffer		buf;
  	Page		page;
--- 367,373 ----
   * pgstat_btree_page -- check tuples in a btree page
   */
  static void
! pgstat_btree_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno, bool enable_counters)
  {
  	Buffer		buf;
  	Page		page;
*************** pgstat_btree_page(pgstattuple_type *stat
*** 390,396 ****
  		else if (P_ISLEAF(opaque))
  		{
  			pgstat_index_page(stat, page, P_FIRSTDATAKEY(opaque),
! 							  PageGetMaxOffsetNumber(page));
  		}
  		else
  		{
--- 395,401 ----
  		else if (P_ISLEAF(opaque))
  		{
  			pgstat_index_page(stat, page, P_FIRSTDATAKEY(opaque),
! 							  PageGetMaxOffsetNumber(page), enable_counters);
  		}
  		else
  		{
*************** pgstat_btree_page(pgstattuple_type *stat
*** 405,411 ****
   * pgstat_hash_page -- check tuples in a hash page
   */
  static void
! pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
  {
  	Buffer		buf;
  	Page		page;
--- 410,416 ----
   * pgstat_hash_page -- check tuples in a hash page
   */
  static void
! pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno, bool enable_counters)
  {
  	Buffer		buf;
  	Page		page;
*************** pgstat_hash_page(pgstattuple_type *stat,
*** 427,433 ****
  			case LH_BUCKET_PAGE:
  			case LH_OVERFLOW_PAGE:
  				pgstat_index_page(stat, page, FirstOffsetNumber,
! 								  PageGetMaxOffsetNumber(page));
  				break;
  			case LH_BITMAP_PAGE:
  			case LH_META_PAGE:
--- 432,438 ----
  			case LH_BUCKET_PAGE:
  			case LH_OVERFLOW_PAGE:
  				pgstat_index_page(stat, page, FirstOffsetNumber,
! 								  PageGetMaxOffsetNumber(page), enable_counters);
  				break;
  			case LH_BITMAP_PAGE:
  			case LH_META_PAGE:
*************** pgstat_hash_page(pgstattuple_type *stat,
*** 448,454 ****
   * pgstat_gist_page -- check tuples in a gist page
   */
  static void
! pgstat_gist_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
  {
  	Buffer		buf;
  	Page		page;
--- 453,459 ----
   * pgstat_gist_page -- check tuples in a gist page
   */
  static void
! pgstat_gist_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno, bool enable_counters)
  {
  	Buffer		buf;
  	Page		page;
*************** pgstat_gist_page(pgstattuple_type *stat,
*** 461,467 ****
  	if (GistPageIsLeaf(page))
  	{
  		pgstat_index_page(stat, page, FirstOffsetNumber,
! 						  PageGetMaxOffsetNumber(page));
  	}
  	else
  	{
--- 466,472 ----
  	if (GistPageIsLeaf(page))
  	{
  		pgstat_index_page(stat, page, FirstOffsetNumber,
! 						  PageGetMaxOffsetNumber(page), enable_counters);
  	}
  	else
  	{
*************** pgstat_gist_page(pgstattuple_type *stat,
*** 475,481 ****
   * pgstat_gin_page -- check tuples in a gin page
   */
  static void
! pgstat_gin_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
  {
  	Buffer		buf;
  	Page		page;
--- 480,486 ----
   * pgstat_gin_page -- check tuples in a gin page
   */
  static void
! pgstat_gin_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno, bool enable_counters)
  {
  	Buffer		buf;
  	Page		page;
*************** pgstat_gin_page(pgstattuple_type *stat,
*** 492,498 ****
  	else if (GinPageIsLeaf(page) || GinPageIsData(page))
  	{
  		pgstat_index_page(stat, page, FirstOffsetNumber,
! 						  PageGetMaxOffsetNumber(page));
  	}
  	else
  	{
--- 497,503 ----
  	else if (GinPageIsLeaf(page) || GinPageIsData(page))
  	{
  		pgstat_index_page(stat, page, FirstOffsetNumber,
! 						  PageGetMaxOffsetNumber(page), enable_counters);
  	}
  	else
  	{
*************** pgstat_gin_page(pgstattuple_type *stat,
*** 506,512 ****
   * pgstat_spgist_page -- check tuples in a spgist page
   */
  static void
! pgstat_spgist_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
  {
  	Buffer		buf;
  	Page		page;
--- 511,517 ----
   * pgstat_spgist_page -- check tuples in a spgist page
   */
  static void
! pgstat_spgist_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno, bool enable_counters)
  {
  	Buffer		buf;
  	Page		page;
*************** pgstat_spgist_page(pgstattuple_type *sta
*** 523,529 ****
  	else if (SpGistPageIsLeaf(page))
  	{
  		pgstat_index_page(stat, page, FirstOffsetNumber,
! 						  PageGetMaxOffsetNumber(page));
  	}
  	else
  	{
--- 528,534 ----
  	else if (SpGistPageIsLeaf(page))
  	{
  		pgstat_index_page(stat, page, FirstOffsetNumber,
! 						  PageGetMaxOffsetNumber(page), enable_counters);
  	}
  	else
  	{
*************** pgstat_index(Relation rel, BlockNumber s
*** 567,573 ****
  		{
  			CHECK_FOR_INTERRUPTS();
  
! 			pagefn(&stat, rel, blkno);
  		}
  	}
  
--- 572,578 ----
  		{
  			CHECK_FOR_INTERRUPTS();
  
! 			pagefn(&stat, rel, blkno, true);
  		}
  	}
  
*************** pgstat_index(Relation rel, BlockNumber s
*** 581,605 ****
   */
  static void
  pgstat_index_page(pgstattuple_type *stat, Page page,
! 				  OffsetNumber minoff, OffsetNumber maxoff)
  {
  	OffsetNumber i;
  
  	stat->free_space += PageGetFreeSpace(page);
  
! 	for (i = minoff; i <= maxoff; i = OffsetNumberNext(i))
  	{
! 		ItemId		itemid = PageGetItemId(page, i);
  
! 		if (ItemIdIsDead(itemid))
  		{
! 			stat->dead_tuple_count++;
! 			stat->dead_tuple_len += ItemIdGetLength(itemid);
  		}
! 		else
  		{
! 			stat->tuple_count++;
! 			stat->tuple_len += ItemIdGetLength(itemid);
  		}
  	}
  }
--- 586,808 ----
   */
  static void
  pgstat_index_page(pgstattuple_type *stat, Page page,
! 				  OffsetNumber minoff, OffsetNumber maxoff, bool enable_counters)
  {
  	OffsetNumber i;
  
  	stat->free_space += PageGetFreeSpace(page);
  
! 	if (enable_counters)
  	{
! 		for (i = minoff; i <= maxoff; i = OffsetNumberNext(i))
! 		{
! 			ItemId		itemid = PageGetItemId(page, i);
! 	
! 			if (ItemIdIsDead(itemid))
! 			{
! 				stat->dead_tuple_count++;
! 				stat->dead_tuple_len += ItemIdGetLength(itemid);
! 			}
! 			else
! 			{
! 				stat->tuple_count++;
! 				stat->tuple_len += ItemIdGetLength(itemid);
! 			}
! 		}
! 	}
! }
  
! /*
!  * relation_free_space
!  *
!  * Returns the percentage of free space for a given relation.
!  */
! Datum
! relation_free_space(PG_FUNCTION_ARGS)
! {
! 	text       *relname = PG_GETARG_TEXT_P(0);
! 	RangeVar	*relrv;
! 	Relation	rel;
! 	float4 		free_space = 0;
! 
! 	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
! 	rel = relation_openrv(relrv, AccessShareLock);
! 
! 	/* Only allow tables or indexes */
! 	if (rel->rd_rel->relkind != RELKIND_INDEX && 
! 		rel->rd_rel->relkind != RELKIND_RELATION &&
! 		rel->rd_rel->relkind != RELKIND_TOASTVALUE)
! 		ereport(ERROR,
! 			(errcode(ERRCODE_WRONG_OBJECT_TYPE),
! 				errmsg("cannot get free space from object \"%s\"",
! 					RelationGetRelationName(rel))));
! 
! 	/*
! 	 * Reject attempts to read non-local temporary relations.  We would be
! 	 * likely to get wrong data, since we have no visibility into the owning
! 	 * session's local buffers.
! 	 */
! 	if (RELATION_IS_OTHER_TEMP(rel))
! 		ereport(ERROR,
! 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
! 				errmsg("cannot access temporary tables of other sessions")));
! 
! 	switch (rel->rd_rel->relkind)
! 	{
! 		case RELKIND_RELATION: 
! 		case RELKIND_TOASTVALUE: 
! 			free_space = GetHeapRelationFreeSpace(rel);
! 			break;
! 		case RELKIND_INDEX: 
! 			free_space = GetIndexRelationFreeSpace(rel);
! 			break;
! 		default:
! 			/* we are not supposed to get here */
! 			ereport(ERROR,
! 				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
! 					errmsg("cannot get free space from object \"%s\"",
! 						RelationGetRelationName(rel))));
! 	}
! 	relation_close(rel, AccessShareLock);
! 
! 	PG_RETURN_FLOAT4(free_space);
! }
! 
! /*
!  * GetHeapRelationFreeSpace()
!  *
!  * Get the free space for a heap relation.
!  * This is a helper function for relation_free_space()
!  */
! static float4
! GetHeapRelationFreeSpace(Relation rel)
! {
! 	Buffer      buffer;
! 	Page		page;
! 	BlockNumber blkno = 0;
! 	BlockNumber nblocks; 
! 	Size        free_space = 0;
! 	double		free_percent = 0;
! 
! 	/* prepare access strategy for reading the table */
! 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
! 
! 	for (;;)
! 	{
! 		/* Get the current relation length */
! 		LockRelationForExtension(rel, ExclusiveLock);
! 		nblocks = RelationGetNumberOfBlocks(rel);
! 		UnlockRelationForExtension(rel, ExclusiveLock);
! 
!         /* Quit if we've scanned the whole relation */
! 		if (blkno >= nblocks)
  		{
! 			break;
  		}
! 
! 		for (; blkno < nblocks; blkno++)
  		{
! 			CHECK_FOR_INTERRUPTS();
! 
! 			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
! 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
! 
! 			page   = BufferGetPage(buffer);
! 			free_space += PageGetHeapFreeSpace(page);
! 
! 			UnlockReleaseBuffer(buffer);
  		}
  	}
+ 
+ 	if (nblocks > 0)
+ 		free_percent = ((float4) free_space) / (nblocks * BLCKSZ);
+ 
+ 	return free_percent;
+ }
+ 
+ /*------------------------------------------------
+  * GetIndexRelationFreeSpace
+  *
+  * Look the type of the index and call the appropiate
+  * function to get the free space
+  *------------------------------------------------
+  */
+ static float4
+ GetIndexRelationFreeSpace(Relation rel)
+ {
+ 	BlockNumber nblocks;
+ 	BlockNumber blkno;
+     pgstattuple_type stat = {0};
+ 	double		free_percent = 0;
+ 
+ 	/* prepare access strategy for reading the index */
+ 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 
+ 	/* start at block */
+ 	switch (rel->rd_rel->relam)
+ 	{
+ 		case BTREE_AM_OID:
+ 			blkno = BTREE_METAPAGE + 1;
+ 			break;
+ 		case HASH_AM_OID:
+ 			blkno = HASH_METAPAGE + 1;
+ 			break;
+ 		case GIST_AM_OID:
+ 			blkno = GIST_ROOT_BLKNO + 1;
+ 			break;
+ 		case GIN_AM_OID:
+ 			blkno = GIN_METAPAGE_BLKNO + 1;
+ 			break;
+ 		case SPGIST_AM_OID:
+ 			blkno = SPGIST_METAPAGE_BLKNO + 1;
+ 			break;
+ 		default:
+ 			blkno = 0;
+ 	}
+ 
+     for (;;)
+     {
+         /* Get the current relation length */
+         LockRelationForExtension(rel, ExclusiveLock);
+         nblocks = RelationGetNumberOfBlocks(rel);
+         UnlockRelationForExtension(rel, ExclusiveLock);
+ 
+         /* Quit if we've scanned the whole relation */
+ 		if (blkno >= nblocks)
+ 		{
+ 			break;
+ 		}
+ 
+         for (; blkno < nblocks; blkno++)
+         {
+             CHECK_FOR_INTERRUPTS();
+ 
+             switch (rel->rd_rel->relam)
+             {
+                 case BTREE_AM_OID:
+ 					pgstat_btree_page(&stat, rel, blkno, false);
+                     break;
+                 case HASH_AM_OID:
+ 					pgstat_hash_page(&stat, rel, blkno, false);
+                     break;
+                 case GIST_AM_OID:
+ 					pgstat_gist_page(&stat, rel, blkno, false);
+                     break;
+                 case GIN_AM_OID:
+ 					pgstat_gin_page(&stat, rel, blkno, false);
+                     break;
+                 case SPGIST_AM_OID:
+ 					pgstat_spgist_page(&stat, rel, blkno, false);
+                     break;
+                 default:
+ 					elog(ERROR, "Unknown index type %d", rel->rd_rel->relam);
+                     break;
+             }
+         }
+     }
+ 
+ 	if (nblocks > 0)
+ 		free_percent = ((float4) stat.free_space) / (nblocks * BLCKSZ);
+ 
+ 	return free_percent;
  }
diff --git a/contrib/pgstattuple/pgstattuple.control b/contrib/pgstattuple/pgstattuple.control
index 7b5129b..fcfd36f 100644
*** a/contrib/pgstattuple/pgstattuple.control
--- b/contrib/pgstattuple/pgstattuple.control
***************
*** 1,5 ****
  # pgstattuple extension
  comment = 'show tuple-level statistics'
! default_version = '1.0'
  module_pathname = '$libdir/pgstattuple'
  relocatable = true
--- 1,5 ----
  # pgstattuple extension
  comment = 'show tuple-level statistics'
! default_version = '1.1'
  module_pathname = '$libdir/pgstattuple'
  relocatable = true
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index a55b35c..4e99a54 100644
*** a/doc/src/sgml/pgstattuple.sgml
--- b/doc/src/sgml/pgstattuple.sgml
*************** leaf_fragmentation | 0
*** 256,261 ****
--- 256,277 ----
       </para>
      </listitem>
     </varlistentry>
+ 
+    <varlistentry>
+     <term>
+      <function>relation_free_space(relname text) returns float</function>
+     </term>
+ 
+     <listitem>
+      <para>
+       <function>relation_free_space</function> returns a factor that represents 
+       the total amount of free space in a table or index.  The entire relation 
+       is scanned in order to compute this amount, which may generate a large 
+       amount of disk reads.
+      </para>
+     </listitem>
+    </varlistentry>
+ 
    </variablelist>
   </sect2>

pgstattuple-stats_target.patchtext/x-patch; charset=US-ASCII; name=pgstattuple-stats_target.patchDownload

diff --git a/contrib/pgstattuple/pgstattuple--1.1.sql b/contrib/pgstattuple/pgstattuple--1.1.sql
index d94c20f..08507c1 100644
*** a/contrib/pgstattuple/pgstattuple--1.1.sql
--- b/contrib/pgstattuple/pgstattuple--1.1.sql
*************** LANGUAGE C STRICT;
*** 51,57 ****
  --
  -- relation_free_space()
  --
! CREATE FUNCTION relation_free_space(IN relname text)
  RETURNS real
  AS 'MODULE_PATHNAME', 'relation_free_space'
  LANGUAGE C STRICT;
--- 51,57 ----
  --
  -- relation_free_space()
  --
! CREATE FUNCTION relation_free_space(IN relname text, IN stats_target int DEFAULT 100)
  RETURNS real
  AS 'MODULE_PATHNAME', 'relation_free_space'
  LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 1001035..8649b6a 100644
*** a/contrib/pgstattuple/pgstattuple.c
--- b/contrib/pgstattuple/pgstattuple.c
*************** static Datum pgstat_index(Relation rel,
*** 86,93 ****
  static void pgstat_index_page(pgstattuple_type *stat, Page page,
  				  OffsetNumber minoff, OffsetNumber maxoff, bool enable_counters);
  
! static float4 GetHeapRelationFreeSpace(Relation rel);
! static float4 GetIndexRelationFreeSpace(Relation rel);
  
  /* 
   * Buffer access strategy for reading relations, it's simpler to keep it
--- 86,94 ----
  static void pgstat_index_page(pgstattuple_type *stat, Page page,
  				  OffsetNumber minoff, OffsetNumber maxoff, bool enable_counters);
  
! static float4 GetHeapRelationFreeSpace(Relation rel, int32 stats_target);
! static float4 GetIndexRelationFreeSpace(Relation rel, int32 stats_target);
! static BlockNumber get_next_block(BlockNumber blkno, BlockNumber nblocks, int blocks_read, int32 stats_target);
  
  /* 
   * Buffer access strategy for reading relations, it's simpler to keep it
*************** Datum
*** 621,630 ****
--- 622,637 ----
  relation_free_space(PG_FUNCTION_ARGS)
  {
  	text       *relname = PG_GETARG_TEXT_P(0);
+ 	int32		stats_target = PG_GETARG_INT32(1);
  	RangeVar	*relrv;
  	Relation	rel;
  	float4 		free_space = 0;
  
+ 	if (stats_target < 1 || stats_target > 100)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ 				errmsg("parameter stats_target should be between 1 and 100")));
+ 
  	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
  	rel = relation_openrv(relrv, AccessShareLock);
  
*************** relation_free_space(PG_FUNCTION_ARGS)
*** 651,660 ****
  	{
  		case RELKIND_RELATION: 
  		case RELKIND_TOASTVALUE: 
! 			free_space = GetHeapRelationFreeSpace(rel);
  			break;
  		case RELKIND_INDEX: 
! 			free_space = GetIndexRelationFreeSpace(rel);
  			break;
  		default:
  			/* we are not supposed to get here */
--- 658,667 ----
  	{
  		case RELKIND_RELATION: 
  		case RELKIND_TOASTVALUE: 
! 			free_space = GetHeapRelationFreeSpace(rel, stats_target);
  			break;
  		case RELKIND_INDEX: 
! 			free_space = GetIndexRelationFreeSpace(rel, stats_target);
  			break;
  		default:
  			/* we are not supposed to get here */
*************** relation_free_space(PG_FUNCTION_ARGS)
*** 675,686 ****
   * This is a helper function for relation_free_space()
   */
  static float4
! GetHeapRelationFreeSpace(Relation rel)
  {
  	Buffer      buffer;
  	Page		page;
  	BlockNumber blkno = 0;
  	BlockNumber nblocks; 
  	Size        free_space = 0;
  	double		free_percent = 0;
  
--- 682,695 ----
   * This is a helper function for relation_free_space()
   */
  static float4
! GetHeapRelationFreeSpace(Relation rel, int32 stats_target)
  {
  	Buffer      buffer;
  	Page		page;
  	BlockNumber blkno = 0;
  	BlockNumber nblocks; 
+  	BlockNumber totalBlocksInRelation; 
+ 	int			blocks_read = 0;
  	Size        free_space = 0;
  	double		free_percent = 0;
  
*************** GetHeapRelationFreeSpace(Relation rel)
*** 691,716 ****
  	{
  		/* Get the current relation length */
  		LockRelationForExtension(rel, ExclusiveLock);
! 		nblocks = RelationGetNumberOfBlocks(rel);
  		UnlockRelationForExtension(rel, ExclusiveLock);
  
          /* Quit if we've scanned the whole relation */
  		if (blkno >= nblocks)
  		{
  			break;
  		}
  
! 		for (; blkno < nblocks; blkno++)
  		{
  			CHECK_FOR_INTERRUPTS();
  
  			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
  			LockBuffer(buffer, BUFFER_LOCK_SHARE);
  
  			page   = BufferGetPage(buffer);
  			free_space += PageGetHeapFreeSpace(page);
  
  			UnlockReleaseBuffer(buffer);
  		}
  	}
  
--- 700,733 ----
  	{
  		/* Get the current relation length */
  		LockRelationForExtension(rel, ExclusiveLock);
! 		totalBlocksInRelation = RelationGetNumberOfBlocks(rel);
  		UnlockRelationForExtension(rel, ExclusiveLock);
  
+ 		nblocks = totalBlocksInRelation * stats_target / 100;
+ 
          /* Quit if we've scanned the whole relation */
  		if (blkno >= nblocks)
  		{
  			break;
  		}
  
! 		for (; blkno < nblocks; )
  		{
  			CHECK_FOR_INTERRUPTS();
  
  			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
  			LockBuffer(buffer, BUFFER_LOCK_SHARE);
+ 			blocks_read++;
  
  			page   = BufferGetPage(buffer);
  			free_space += PageGetHeapFreeSpace(page);
  
  			UnlockReleaseBuffer(buffer);
+ 
+ 			if (stats_target == 100)
+ 				blkno++;
+ 			else
+ 				blkno = get_next_block(blkno, totalBlocksInRelation, blocks_read, stats_target);
  		}
  	}
  
*************** GetHeapRelationFreeSpace(Relation rel)
*** 728,737 ****
   *------------------------------------------------
   */
  static float4
! GetIndexRelationFreeSpace(Relation rel)
  {
- 	BlockNumber nblocks;
  	BlockNumber blkno;
      pgstattuple_type stat = {0};
  	double		free_percent = 0;
  
--- 745,756 ----
   *------------------------------------------------
   */
  static float4
! GetIndexRelationFreeSpace(Relation rel, int32 stats_target)
  {
  	BlockNumber blkno;
+ 	BlockNumber nblocks;
+ 	BlockNumber totalBlocksInRelation; 
+ 	int 		blocks_read = 0;
      pgstattuple_type stat = {0};
  	double		free_percent = 0;
  
*************** GetIndexRelationFreeSpace(Relation rel)
*** 764,779 ****
      {
          /* Get the current relation length */
          LockRelationForExtension(rel, ExclusiveLock);
!         nblocks = RelationGetNumberOfBlocks(rel);
          UnlockRelationForExtension(rel, ExclusiveLock);
  
          /* Quit if we've scanned the whole relation */
  		if (blkno >= nblocks)
  		{
  			break;
  		}
  
!         for (; blkno < nblocks; blkno++)
          {
              CHECK_FOR_INTERRUPTS();
  
--- 783,800 ----
      {
          /* Get the current relation length */
          LockRelationForExtension(rel, ExclusiveLock);
! 		totalBlocksInRelation = RelationGetNumberOfBlocks(rel);
          UnlockRelationForExtension(rel, ExclusiveLock);
  
+ 		nblocks = totalBlocksInRelation * stats_target / 100;
+ 
          /* Quit if we've scanned the whole relation */
  		if (blkno >= nblocks)
  		{
  			break;
  		}
  
!         for (; blkno < nblocks; )
          {
              CHECK_FOR_INTERRUPTS();
  
*************** GetIndexRelationFreeSpace(Relation rel)
*** 798,803 ****
--- 819,830 ----
  					elog(ERROR, "Unknown index type %d", rel->rd_rel->relam);
                      break;
              }
+ 			blocks_read++;
+ 
+ 			if (stats_target == 100)
+ 				blkno++;
+ 			else
+ 				get_next_block(blkno, totalBlocksInRelation, blocks_read, stats_target);
          }
      }
  
*************** GetIndexRelationFreeSpace(Relation rel)
*** 806,808 ****
--- 833,870 ----
  
  	return free_percent;
  }
+ 
+ static BlockNumber
+ get_next_block(BlockNumber blkno, BlockNumber nblocks, int blocks_read, int32 stats_target)
+ {
+ 	BlockNumber K = nblocks - blkno;								/* remaining blocks */
+ 	int         k = (nblocks * stats_target / 100) - blocks_read;	/* blocks still to sample */
+ 	double      p;              									/* probability to skip block */
+ 	double      V; 
+ 
+ 	if (k > K)
+ 	{	
+ 		/* need all the rest */
+ 		return blkno + 1;
+ 	}
+ 
+ 	V = ((double) random() + 1) / ((double) MAX_RANDOM_VALUE + 2);
+     p = 1.0 - (double) k / (double) K;
+     while (V < p)
+     {
+ 		if (k > K)
+ 		{	
+ 			/* need all the rest */
+ 			return blkno + 1;
+ 		}
+ 
+         /* skip */
+         blkno++;
+         K--;                    /* keep K == N - t */
+ 
+         /* adjust p to be new cutoff point in reduced range */
+         p *= 1.0 - (double) k / (double) K;
+     }
+ 
+ 	return blkno + 1;
+ }

#32

noah@leadboat.com

almost 14 years ago

In reply to: Jaime Casanova (#31)

Re: Measuring relation free space

On Tue, Feb 14, 2012 at 02:04:26AM -0500, Jaime Casanova wrote:

On Wed, Jan 25, 2012 at 9:47 PM, Noah Misch <noah@leadboat.com> wrote:

With all that done, run some quick benchmarks: see how "SELECT free_percent
FROM pgstattuple(rel)" fares compared to "SELECT relation_free_space(rel)" for
a large heap and for a large B-tree index. ?If the timing difference is too
small to be interesting to you, remove relation_free_space() and submit your
pgstattuple() improvements alone. ?Otherwise, submit as written.

Ok. I split this in three patches.

1) pgstattuple-gin_spgist.patch
This first patch adds gin and spgist support to pgstattuple, also
makes pgstattuple use a ring buffer when reading tables or indexes.

The buffer access strategy usage bits look fine to commit. The gin and spgist
support has problems, detailed below.

2) pgstattuple-relation_free_space.patch
This patch adds the relation_free_space function to pgstattuple.

the function relation_free_space() is faster than pgstattuple(), to
test that i initialize pgbench with a scale of 40.
In that context pgstattuple() tooks 1.4s to process pgbench_account
table and relation_free_space() tooks 730ms (half the time!)
In the index the difference is less notorious, 170ms the former and
150ms the latter.

Benchmarks lasting on the order of one second are far too short. I tried the
first two patches on this 6914 MiB table and 4284 MiB index:

create table t(n) as select * from generate_series(1,200000000);
create index on t(n);

This machine has about 1 GiB of memory available for disk cache, and I used
shared_buffers = 128MB. I used a regular assert-enabled build with
debug_assertions = off. Timings:

pgstattuple.free_percent, heap: runtime 166.2s; answer 0.34
pgstattuple.free_percent, index: runtime 110.9s; answer 9.83
relation_free_space, heap: runtime 165.1s; answer 0.00838721
relation_free_space, index: runtime 108.7s; answer 2.23692

Note the disagreement in answers and the nonsensical answer from the last
test. The numbers do line up for smaller tables and indexes that I tried.

During the pgstattuple() runs on the heap, CPU usage divided evenly between
user and iowait time. With relation_free_space(), iowait kept above 80%. For
the index, pgstattuple() managed 60-80% iowait and relation_free_space() again
kept above 80%. Even so, that did not produce any significant change in
runtime. I'm guessing that readahead was highly effective here, so the I/O
bound dictated elapsed time.

Bottom line, this patch can probably achieve 50% speedups on already-in-memory
relations. It can reduce the contribution to CPU load, but not the elapsed
runtime, for relations we largely pull from disk. Do those benefits justify
the additional user-visible interface? I suppose the sort of installation
that would benefit most is one just short of the tipping point of the database
size exceeding memory size. Larger databases could not afford either
function, and smaller databases don't need to watch bloat so closely.
Offhand, I think that the I/O savings of sampling will be the real win, and
it's not worth an extra user-visible function to get the CPU usage savings
this patch offers. Other opinions welcome.

3) pgstattuple-stats_target.patch
This patch adds a stats_target parameter to the relation_free_space()
function, it mimics the way analyze choose the blocks to read and is
faster than plain relation_free_space() but of course could be inexact
if the pages that we don't read are the ones with more free space

This part is a fresh submission. It is simple enough that I have reviewed it.
It gives the expected speedup. However, the results are wrong:

3 runs of pgstattuple('t', 5): 0.000171412, 0.000171876, 0.000169326
3 runs of pgstattuple('t', 10): 0.000336525, 0.000344404, 0.000341314

I can get an apparent infinite loop:

create table t0(n) as select * from generate_series(1,4000000);
create index on t0(n);

[local] test=# select relation_free_space('t0_n_idx', 100);
relation_free_space
---------------------
0.0982675

Time: 133.788 ms

[local] test=# select relation_free_space('t0_n_idx', 5);
Cancel request sent
ERROR: canceling statement due to user request
Time: 24655.481 ms

As a way forward, I suggest abandoning relation_free_space() entirely and
adding a sampling capability to pgstattuple(). There are clear gains to be
had from that method. The gains of splitting out the free percent calculation
from the other pgstattuple() calculations seem meager.

If you would like, submit the buffer strategy bits as a separate patch with
its own CF entry, noting that it arose from here. That part can be Ready for
Committer. I'm marking the original CF entry Returned with Feedback.

Patch 1:

*** a/contrib/pgstattuple/pgstatindex.c
--- b/contrib/pgstattuple/pgstatindex.c

+ /* 
+  * Buffer access strategy for reading relations, it's simpler to keep it
+  * global because pgstat_*_page() functions read one buffer at a time.
+  * pgstat_heap() and pgstat_index() should initialize it before use.
+  */
+ BufferAccessStrategy bstrategy;

This variable should be static.

*** 450,455 ****
--- 471,538 ----
}

/*
+  * pgstat_gin_page -- check tuples in a gin page
+  */
+ static void
+ pgstat_gin_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
+ {
+ 	Buffer		buf;
+ 	Page		page;
+ 
+ 	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 	LockBuffer(buf, GIN_SHARE);
+ 	page = BufferGetPage(buf);
+ 
+ 	if (GinPageIsDeleted(page))
+ 	{
+ 		/* recyclable page */
+ 		stat->free_space += BLCKSZ; 
+ 	}
+ 	else if (GinPageIsLeaf(page) || GinPageIsData(page))
+ 	{
+ 		pgstat_index_page(stat, page, FirstOffsetNumber,
+ 						  PageGetMaxOffsetNumber(page));
+ 	}
+ 	else
+ 	{
+ 		/* root or node */
+ 	}
+ 
+ 	UnlockReleaseBuffer(buf);
+ }

Would you discuss, preferably in comments, the various page types found in GIN
indexes and why this code is correct for all of them? At a minimum, the
comment "root or node" is incorrect.

Another thing to consider and document is how you wish to count tuples. Your
implementation counts every key as a tuple. I think it would be more useful
to count every posting tree entry as a tuple, but I haven't given it a great
deal of thought.

+ 
+ /*
+  * pgstat_spgist_page -- check tuples in a spgist page
+  */
+ static void
+ pgstat_spgist_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno)
+ {
+ 	Buffer		buf;
+ 	Page		page;
+ 
+ 	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
+ 	LockBuffer(buf, BUFFER_LOCK_SHARE);
+ 	page = BufferGetPage(buf);
+ 
+ 	if (SpGistPageIsDeleted(page))
+ 	{
+ 		/* recyclable page */
+ 	    stat->free_space += BLCKSZ;
+ 	}
+ 	else if (SpGistPageIsLeaf(page))
+ 	{
+ 		pgstat_index_page(stat, page, FirstOffsetNumber,
+ 						  PageGetMaxOffsetNumber(page));
+ 	}
+ 	else
+ 	{
+ 		/* root or node */
+ 	}
+ 
+ 	UnlockReleaseBuffer(buf);
+ }

pgstat_index_page will not do the right thing for SP-GiST indexes. Only
SPGIST_LIVE tuples should count as live; the other tupstates should count as
dead. Also, pgstattuple gives me a tuple count of 119288 for an SP-GiST index
of a table containing only 40000 tuples. (Disclaimer: I took my first look at
the SP-GiST code today.)

Patch 3:

*** a/contrib/pgstattuple/pgstattuple.c
--- b/contrib/pgstattuple/pgstattuple.c
*************** GetHeapRelationFreeSpace(Relation rel)
*** 691,716 ****
{
/* Get the current relation length */
LockRelationForExtension(rel, ExclusiveLock);
! 		nblocks = RelationGetNumberOfBlocks(rel);
UnlockRelationForExtension(rel, ExclusiveLock);
/* Quit if we've scanned the whole relation */
if (blkno >= nblocks)
{
break;
}

! for (; blkno < nblocks; blkno++)
{
CHECK_FOR_INTERRUPTS();

buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
LockBuffer(buffer, BUFFER_LOCK_SHARE);

page = BufferGetPage(buffer);
free_space += PageGetHeapFreeSpace(page);

UnlockReleaseBuffer(buffer);
}
}
--- 700,733 ----
{
/* Get the current relation length */
LockRelationForExtension(rel, ExclusiveLock);
! 		totalBlocksInRelation = RelationGetNumberOfBlocks(rel);
UnlockRelationForExtension(rel, ExclusiveLock);
+ nblocks = totalBlocksInRelation * stats_target / 100;

You have defined stats_target as a percentage of the relation to sample.
That's a poor basis for sample size. Use a constant number of pages, just as
a given default_statistics_target leads ANALYZE to take a constant number of
tuples from each table. Further background:
http://en.wikipedia.org/wiki/Sample_size_determination#Estimating_proportions_and_means

Excepting those few copied from analyze.c, this patch adds zero comments. You
even omit comments present in the corresponding analyze.c code.

#33

jaime@2ndquadrant.com

almost 14 years ago

In reply to: Noah Misch (#32)

1 attachment(s)

Re: Measuring relation free space

On Wed, Feb 22, 2012 at 12:27 AM, Noah Misch <noah@leadboat.com> wrote:

On Tue, Feb 14, 2012 at 02:04:26AM -0500, Jaime Casanova wrote:

1) pgstattuple-gin_spgist.patch
This first patch adds gin and spgist support to pgstattuple, also
makes pgstattuple use a ring buffer when reading tables or indexes.

The buffer access strategy usage bits look fine to commit.

ok. i extracted that part. which basically makes pgstattuple usable in
production (i mean, by not bloating shared buffers when using the
function)

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

Attachments:

pgstattuple_bas.patchtext/x-patch; charset=US-ASCII; name=pgstattuple_bas.patchDownload

diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
new file mode 100644
index beff1b9..9f2ec1f
*** a/contrib/pgstattuple/pgstatindex.c
--- b/contrib/pgstattuple/pgstatindex.c
*************** pgstatindex(PG_FUNCTION_ARGS)
*** 95,100 ****
--- 95,101 ----
  	BlockNumber nblocks;
  	BlockNumber blkno;
  	BTIndexStat indexStat;
+  	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
  
  	if (!superuser())
  		ereport(ERROR,
*************** pgstatindex(PG_FUNCTION_ARGS)
*** 122,128 ****
  	 * Read metapage
  	 */
  	{
! 		Buffer		buffer = ReadBuffer(rel, 0);
  		Page		page = BufferGetPage(buffer);
  		BTMetaPageData *metad = BTPageGetMeta(page);
  
--- 123,129 ----
  	 * Read metapage
  	 */
  	{
! 		Buffer		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, 0, RBM_NORMAL, bstrategy);
  		Page		page = BufferGetPage(buffer);
  		BTMetaPageData *metad = BTPageGetMeta(page);
  
*************** pgstatindex(PG_FUNCTION_ARGS)
*** 159,165 ****
  		CHECK_FOR_INTERRUPTS();
  
  		/* Read and lock buffer */
! 		buffer = ReadBuffer(rel, blkno);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  
  		page = BufferGetPage(buffer);
--- 160,166 ----
  		CHECK_FOR_INTERRUPTS();
  
  		/* Read and lock buffer */
!  		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  
  		page = BufferGetPage(buffer);
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
new file mode 100644
index e5ddd87..580e24e
*** a/contrib/pgstattuple/pgstattuple.c
--- b/contrib/pgstattuple/pgstattuple.c
*************** static Datum pgstat_index(Relation rel,
*** 78,83 ****
--- 78,90 ----
  static void pgstat_index_page(pgstattuple_type *stat, Page page,
  				  OffsetNumber minoff, OffsetNumber maxoff);
  
+ /* 
+  * Buffer access strategy for reading relations, it's simpler to keep it
+  * global because pgstat_*_page() functions read one buffer at a time.
+  * pgstat_heap() and pgstat_index() should initialize it before use.
+  */
+ BufferAccessStrategy bstrategy;
+ 
  /*
   * build_pgstattuple_type -- build a pgstattuple_type tuple
   */
*************** pgstat_relation(Relation rel, FunctionCa
*** 231,236 ****
--- 238,246 ----
  				case GIN_AM_OID:
  					err = "gin index";
  					break;
+ 				case SPGIST_AM_OID:
+ 					err = "spgist index";
+ 					break;
  				default:
  					err = "unknown index";
  					break;
*************** pgstat_heap(Relation rel, FunctionCallIn
*** 276,281 ****
--- 286,295 ----
  
  	nblocks = scan->rs_nblocks; /* # blocks to be scanned */
  
+ 	/* prepare access strategy for this table */
+ 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 	scan->rs_strategy = bstrategy;
+ 
  	/* scan the relation */
  	while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
  	{
*************** pgstat_heap(Relation rel, FunctionCallIn
*** 309,315 ****
  		{
  			CHECK_FOR_INTERRUPTS();
  
! 			buffer = ReadBuffer(rel, block);
  			LockBuffer(buffer, BUFFER_LOCK_SHARE);
  			stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
  			UnlockReleaseBuffer(buffer);
--- 323,329 ----
  		{
  			CHECK_FOR_INTERRUPTS();
  
! 			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_NORMAL, bstrategy);
  			LockBuffer(buffer, BUFFER_LOCK_SHARE);
  			stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
  			UnlockReleaseBuffer(buffer);
*************** pgstat_heap(Relation rel, FunctionCallIn
*** 322,328 ****
  	{
  		CHECK_FOR_INTERRUPTS();
  
! 		buffer = ReadBuffer(rel, block);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  		stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
  		UnlockReleaseBuffer(buffer);
--- 336,342 ----
  	{
  		CHECK_FOR_INTERRUPTS();
  
! 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block, RBM_NORMAL, bstrategy);
  		LockBuffer(buffer, BUFFER_LOCK_SHARE);
  		stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
  		UnlockReleaseBuffer(buffer);
*************** pgstat_btree_page(pgstattuple_type *stat
*** 345,351 ****
  	Buffer		buf;
  	Page		page;
  
! 	buf = ReadBuffer(rel, blkno);
  	LockBuffer(buf, BT_READ);
  	page = BufferGetPage(buf);
  
--- 359,365 ----
  	Buffer		buf;
  	Page		page;
  
! 	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
  	LockBuffer(buf, BT_READ);
  	page = BufferGetPage(buf);
  
*************** pgstat_hash_page(pgstattuple_type *stat,
*** 389,395 ****
  	Page		page;
  
  	_hash_getlock(rel, blkno, HASH_SHARE);
! 	buf = _hash_getbuf(rel, blkno, HASH_READ, 0);
  	page = BufferGetPage(buf);
  
  	if (PageGetSpecialSize(page) == MAXALIGN(sizeof(HashPageOpaqueData)))
--- 403,409 ----
  	Page		page;
  
  	_hash_getlock(rel, blkno, HASH_SHARE);
! 	buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
  	page = BufferGetPage(buf);
  
  	if (PageGetSpecialSize(page) == MAXALIGN(sizeof(HashPageOpaqueData)))
*************** pgstat_gist_page(pgstattuple_type *stat,
*** 431,437 ****
  	Buffer		buf;
  	Page		page;
  
! 	buf = ReadBuffer(rel, blkno);
  	LockBuffer(buf, GIST_SHARE);
  	gistcheckpage(rel, buf);
  	page = BufferGetPage(buf);
--- 445,451 ----
  	Buffer		buf;
  	Page		page;
  
! 	buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy);
  	LockBuffer(buf, GIST_SHARE);
  	gistcheckpage(rel, buf);
  	page = BufferGetPage(buf);
*************** pgstat_index(Relation rel, BlockNumber s
*** 460,465 ****
--- 474,482 ----
  	BlockNumber blkno;
  	pgstattuple_type stat = {0};
  
+ 	/* prepare access strategy for this index */
+ 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+ 
  	blkno = start;
  	for (;;)
  	{

#34

noah@leadboat.com

almost 14 years ago

In reply to: Jaime Casanova (#33)

Re: Measuring relation free space

On Fri, Mar 09, 2012 at 02:18:02AM -0500, Jaime Casanova wrote:

On Wed, Feb 22, 2012 at 12:27 AM, Noah Misch <noah@leadboat.com> wrote:

On Tue, Feb 14, 2012 at 02:04:26AM -0500, Jaime Casanova wrote:

1) pgstattuple-gin_spgist.patch
This first patch adds gin and spgist support to pgstattuple, also
makes pgstattuple use a ring buffer when reading tables or indexes.

The buffer access strategy usage bits look fine to commit.

ok. i extracted that part. which basically makes pgstattuple usable in
production (i mean, by not bloating shared buffers when using the
function)

I created a CF entry for this and marked it Ready for Committer. You left the
bstrategy variable non-static, but that didn't seem important enough to
justify another round trip.

#35

jaime@2ndquadrant.com

almost 14 years ago

In reply to: Noah Misch (#34)

Re: Measuring relation free space

On Mon, Mar 12, 2012 at 9:41 PM, Noah Misch <noah@leadboat.com> wrote:

I created a CF entry for this and marked it Ready for Committer.

i wasn't sure if create an entry this late was a good idea or not...
but now i feel better because is less probable that it will fall out
on the cracks, thanks

You left the
bstrategy variable non-static, but that didn't seem important enough to
justify another round trip.

ah! i forgot that...

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

#36

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Jaime Casanova (#35)

Re: Measuring relation free space

On Mon, Mar 12, 2012 at 11:10 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Mon, Mar 12, 2012 at 9:41 PM, Noah Misch <noah@leadboat.com> wrote:

I created a CF entry for this and marked it Ready for Committer.

i wasn't sure if create an entry this late was a good idea or not...
but now i feel better because is less probable that it will fall out
on the cracks, thanks

You left the
bstrategy variable non-static, but that didn't seem important enough to
justify another round trip.

ah! i forgot that...

I committed this, but I didn't like the global variable, so I adjusted
it to pass bstrategy as a parameter where needed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37

ams@2ndQuadrant.com

almost 12 years ago

In reply to: Greg Smith (#1)

5 attachment(s)

a fast bloat measurement tool (was Re: Measuring relation free space)

This is a follow-up to the thread at
/messages/by-id/4EB5FA1B.1090305@2ndQuadrant.com

A quick summary: that thread proposed adding a relation_free_space()
function to the pageinspect extension. Various review comments were
received, among which was the suggestion that the code belonged in
pg_stattuple as a faster way to calculate free_percent.

===

I've attached an extension that produces largely pgstattuple-compatible
numbers for a table without doing a full-table scan.

It scans through the table, skipping blocks that have their visibility
map bit set. For such pages, it gets the free space from the free space
map, and assumes that all remaining space on the page is taken by live
tuples. It scans other pages tuple-by-tuple and counts live and dead
tuples and free space.

Here's a comparison of fastbloat and pgstattuple output on a 50-million
row table with some holes created with a single big DELETE statement:

ams=# select * from fastbloat('x');
table_len | scanned_percent | approx_tuple_count | approx_tuple_len | approx_tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent
------------+-----------------+--------------------+------------------+----------------------+------------------+----------------+--------------------+------------+--------------
6714761216 | 17 | 41318301 | 5483815648 | 81.67 | 8681708 | 1111258624 | 16.55 | 80972912 | 1.21
(1 row)

Time: 639.455 ms

ams=# select * from pgstattuple('x');
table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent
------------+-------------+------------+---------------+------------------+----------------+--------------------+------------+--------------
6714761216 | 41318292 | 5288741376 | 78.76 | 8681708 | 1111258624 | 16.55 | 91810372 | 1.37
(1 row)

Time: 15610.651 ms

In the above, the table_len is nblocks*BLCKSZ, and the dead_tuple_count,
dead_tuple_len, dead_tuple_percent, free_space, and free_percent are all
exact. scanned_percent shows the percentage of pages that were scanned
tuple-by-tuple (the others having been skipped based on the VM bit).
The live tuple count, size, and percentage are all estimates.

The approx_tuple_count is calculated using vac_estimate_reltuples based
on the pages/tuples that were scanned. The approx_tuple_len is the exact
size of the live tuples on scanned pages, plus the approximate size from
skipped pages (BLCKSZ-GetRecordedFreeSpace()). This is an overestimate,
because it's counting the line pointer array as live tuple space.

Even in the worst case, when every page has dead tuples, fastbloat is
marginally faster than pgstattuple. The same table as the first example,
but with every even-numbered row deleted:

ams=# select * from fastbloat('x');
table_len | scanned_percent | approx_tuple_count | approx_tuple_len | approx_tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent
------------+-----------------+--------------------+------------------+----------------------+------------------+----------------+--------------------+------------+--------------
6714761216 | 100 | 20659150 | 2644371200 | 39.38 | 20659142 | 2644370176 | 39.38 | 1203068996 | 17.92
(1 row)

Time: 8924.511 ms

ams=# select * from pgstattuple('x');
table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent
------------+-------------+------------+---------------+------------------+----------------+--------------------+------------+--------------
6714761216 | 20659150 | 2644371200 | 39.38 | 20659142 | 2644370176 | 39.38 | 1203068996 | 17.92
(1 row)

Time: 13338.712 ms

Since the code depends on the visibility map to determine which pages to
skip, it does not support indexes (which have no visibility map).

(Just drop the attached files into contrib/fastbloat, and "make install"
should just work. Then just "create extension fastbloat".)

Questions and suggestions welcome.

-- Abhijit

#38

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Abhijit Menon-Sen (#37)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On Wed, Apr 2, 2014 at 5:41 PM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:

I've attached an extension that produces largely pgstattuple-compatible
numbers for a table without doing a full-table scan.

It scans through the table, skipping blocks that have their visibility
map bit set. For such pages, it gets the free space from the free space
map, and assumes that all remaining space on the page is taken by live
tuples. It scans other pages tuple-by-tuple and counts live and dead
tuples and free space.

That's clever. I think it might underestimate free space relative to
tuples because the free space map isn't guaranteed to be completely
correct. But I guess you knew that already...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

ams@2ndquadrant.com

almost 12 years ago

In reply to: Robert Haas (#38)

Re: a fast bloat measurement tool

At 2014-04-02 20:10:54 -0400, robertmhaas@gmail.com wrote:

I think it might underestimate free space relative to tuples because
the free space map isn't guaranteed to be completely correct. But I
guess you knew that already...

Yes, and tuple_len is already a slight overestimate (because it counts
the line pointer array as live tuple space for skipped pages). But for
the speedup it provides, I believe the result is still useful.

I'll mention the potential for free-space inaccuracy in the README.

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

amit.kapila16@gmail.com

over 11 years ago

In reply to: Abhijit Menon-Sen (#37)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On Thu, Apr 3, 2014 at 3:11 AM, Abhijit Menon-Sen <ams@2ndquadrant.com>
wrote:

This is a follow-up to the thread at
/messages/by-id/4EB5FA1B.1090305@2ndQuadrant.com

A quick summary: that thread proposed adding a relation_free_space()
function to the pageinspect extension.

Various review comments were
received, among which was the suggestion that the code belonged in
pg_stattuple as a faster way to calculate free_percent.

You haven't mentioned why you didn't follow that way. After looking
at code, I also felt that it is better to add this as a version of
pg_stattuple.

===

I've attached an extension that produces largely pgstattuple-compatible
numbers for a table without doing a full-table scan.

It scans through the table, skipping blocks that have their visibility
map bit set. For such pages, it gets the free space from the free space
map, and assumes that all remaining space on the page is taken by live
tuples. It scans other pages tuple-by-tuple and counts live and dead
tuples and free space.

Is this assumption based on the reason that if the visibility map bit of
page is set, then there is high chance that vacuum would have pruned
the dead tuples and updated FSM with freespace?
In anycase, I think it will be better if you update README and or
code comments to mention the reason of such an assumption.

1. compilation errors
1>contrib\fastbloat\fastbloat.c(450): error C2198: 'MultiXactIdIsRunning' :
too few arguments for call
1>contrib\fastbloat\fastbloat.c(467): error C2198: 'MultiXactIdIsRunning' :
too few arguments for call

Recent commit 05315 added new parameter to this API, so this usage
of API needs to be updated accordingly.

2.
/* Returns a tuple with live/dead tuple statistics for the named table.
*/

I think this is not a proper multi-line comment.

3.
fbstat_heap()
{
..
for (blkno = 0; blkno < nblocks; blkno++)
{
..
}

It is better to have CHECK_FOR_INTERRUPTS() in above loop.

4.
if (PageIsNew(page))
{
ReleaseBuffer(buf);
continue;
}

Incase of new page don't you need to account for freespace.
Why can't we safely assume that all the space in page is
free and add it to freespace.

5. Don't we need the handling for empty page (PageIsEmpty()) case?

6.
ItemIdIsDead(itemid))
{
continue;
}

What is the reason for not counting ItemIdIsDead() case for
accounting of dead tuples.

I think for Vaccum, it is okay to skip that case because same
is counted via heap_page_prune() call.

7.
HeapTupleSatisfiesVacuumNoHint()
a. Why can't we use HeapTupleSatisfiesVisibility() with dirty snapshot
as we use for pgstattuple? I think it will have the advantage of
keeping the code for fastbloat similar to pgstattuple.
b. If we want to use HeapTupleSatisfiesVacuum(), why can't we
add one parameter to function which can avoid setting of hintbit.
Are you worried about the performance impact it might cause in
other flows, if yes then which flow your are mainly worried.
c. I agree that there is advantage of avoiding hintbit code, but as
this operation will not be frequent, so not sure if its good idea
to add a separate version of tuplevisibility function

In general, I think the main advantage of this patch comes from the
fact that it can skip scanning pages based on visibilitymap, so
it seems to me that it is better if we keep other part of code similar
to pgstattuple.

8.
HeapTupleSatisfiesVacuumNoHint(HeapTuple htup, TransactionId OldestXmin)

There is some new code added in corresponding function
HeapTupleSatisfiesVacuum(), so if we want to use it, then
update the changes in this function as well and I think it
is better to move this into tqual.c along with other similar
functions.

9.
* This routine is shared by VACUUM and ANALYZE.
*/
double
vac_estimate_reltuples(Relation relation, bool is_analyze,
BlockNumber total_pages,
BlockNumber scanned_pages,
double scanned_tuples)

function header of vac_estimate_reltuples() suggests that it is
used by VACUUM and ANALYZE, I think it will be better to update
the comment w.r.t new usage as well.

10. I think it should generate resource file as is done for other modules
if you want to keep it as a separate module.
Example:
MODULE_big = btree_gin
OBJS = btree_gin.o $(WIN32RES)

11.
README.txt

Here is an example comparing the output of fastbloat and pgstattuple for

the same table:

Do you really need to specify examples in README. I think it would be
better to specify examples in documentation (.sgml).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#41

ams@2ndQuadrant.com

over 11 years ago

In reply to: Amit Kapila (#40)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

Hi Amit.

Thanks for your comments, and I'm sorry it's taken me so long to
respond.

At 2014-08-03 11:18:57 +0530, amit.kapila16@gmail.com wrote:

After looking at code, I also felt that it is better to add this as a
version of pg_stattuple.

I started off trying to do that, but now I'm afraid I strongly disagree.
First, pg_stattuple works on relations and indexes, whereas fastbloat
only works on relations (because of the VM/FSM use). Second, the code
began to look hideous when mushed together, with too many ifs to avoid
locking I didn't need and so on. The logic is just too different.

Is this assumption based on the reason that if the visibility map bit
of page is set, then there is high chance that vacuum would have
pruned the dead tuples and updated FSM with freespace?

Right. I'm not convinced an explanation of the VM/FSM belongs in the
fastbloat documentation, but I'll see if I can make it clearer.

1. compilation errors […]
I think this is not a proper multi-line comment. […]
It is better to have CHECK_FOR_INTERRUPTS() in above loop. […]
Incase of new page don't you need to account for freespace. […]
5. Don't we need the handling for empty page (PageIsEmpty()) case?

Thanks, have fixed, will push updates soon.

What is the reason for not counting ItemIdIsDead() case for
accounting of dead tuples.

Will reconsider and fix.

7.
HeapTupleSatisfiesVacuumNoHint()
a. Why can't we use HeapTupleSatisfiesVisibility() with dirty snapshot
as we use for pgstattuple?

Heavier locking. I tried to make do with the existing HTS* functions,
but fastbloat was noticeably faster in tests with the current setup.

b. If we want to use HeapTupleSatisfiesVacuum(), why can't we add one
parameter to function which can avoid setting of hintbit.

This is certainly a possibility, and as you say it would be better in
terms of maintenance. I didn't do it that way because I wanted to keep
the extension self-contained rather than make it depend on core changes.
If there's consensus on adding a nohint parameter, sure, I can do that.
(But the fastbloat won't work on current versions, which would suck.)

In the meantime, I'll merge the later updates from HTSVacuum into my
code. Thanks for the heads-up.

function header of vac_estimate_reltuples() suggests that it is
used by VACUUM and ANALYZE, I think it will be better to update
the comment w.r.t new usage as well.

OK.

10. I think it should generate resource file as is done for other
modules if you want to keep it as a separate module.

Thanks, will do.

Do you really need to specify examples in README.

I don't *need* to, but I like it.

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

andres@2ndquadrant.com

over 11 years ago

In reply to: Abhijit Menon-Sen (#41)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

Hi,

On 2014-09-24 14:26:37 +0530, Abhijit Menon-Sen wrote:

Thanks for your comments, and I'm sorry it's taken me so long to
respond.

At 2014-08-03 11:18:57 +0530, amit.kapila16@gmail.com wrote:

After looking at code, I also felt that it is better to add this as a
version of pg_stattuple.

I started off trying to do that, but now I'm afraid I strongly disagree.
First, pg_stattuple works on relations and indexes, whereas fastbloat
only works on relations (because of the VM/FSM use). Second, the code
began to look hideous when mushed together, with too many ifs to avoid
locking I didn't need and so on. The logic is just too different.

Why not add it to the stattuple extension, but as an independent
function (and file)? I don't really see a need for a completely new
extension, but a separate extension seems wasteful.

7.
HeapTupleSatisfiesVacuumNoHint()
a. Why can't we use HeapTupleSatisfiesVisibility() with dirty snapshot
as we use for pgstattuple?

I think it's completely unacceptable to copy a visibility routine. And I
don't believe for a second that avoiding hint bit setting makes it legal
to not acquire a content lock for this. What's your reasoning for that
being safe? If you argue that all possible corruption has limited impact
you need to do that *much* more explicitly and verbosely.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

ams@2ndQuadrant.com

over 11 years ago

In reply to: Andres Freund (#42)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2014-09-24 11:09:24 +0200, andres@2ndquadrant.com wrote:

Why not add it to the stattuple extension, but as an independent
function (and file)?

Thanks, that's a good idea. I'll do that.

I think it's completely unacceptable to copy a visibility routine.

OK. Which visibility routine should I use, and should I try to create a
variant that doesn't set hint bits?

I don't have any reasoning for why it's safe to not hold a content lock.
If there is one, I need help to find it. If not, I'll acquire a content
lock. (If anyone can explain why it isn't safe, I would appreciate it.)

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

andres@2ndquadrant.com

over 11 years ago

In reply to: Abhijit Menon-Sen (#43)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 2014-09-25 10:24:39 +0530, Abhijit Menon-Sen wrote:

At 2014-09-24 11:09:24 +0200, andres@2ndquadrant.com wrote:

I think it's completely unacceptable to copy a visibility routine.

OK. Which visibility routine should I use, and should I try to create a
variant that doesn't set hint bits?

I've not yet followed your premise that you actually need one that
doesn't set hint bits...

I don't have any reasoning for why it's safe to not hold a content lock.
If there is one, I need help to find it. If not, I'll acquire a content
lock. (If anyone can explain why it isn't safe, I would appreciate it.)

I don't see why it'd be safe. Without the content lock you can come
across half written tuple headers and similar things. You can try to
argue that all the possible faults of that are harmless, but I think
that'd require a tremendous amount of code review and it'd be hard to
continue to guarantee. There's a reason we acquire locks, you know :)

I think it's saner to first get this working & committed properly *with*
the lock. If you afterwards have energy to improve it further we can
another look.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

ams@2ndQuadrant.com

over 11 years ago

In reply to: Andres Freund (#44)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2014-09-25 11:41:29 +0200, andres@2ndquadrant.com wrote:

I've not yet followed your premise that you actually need one that
doesn't set hint bits...

Oh.

All right, then I'll post a version that addresses Amit's other points,
adds a new file/function to pgstattuple, acquires content locks, and
uses HeapTupleSatisfiesVacuum, hint-bit setting and all. Maybe I'll
call it not_too_slow_bloat().

Thank you for the feedback.

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Simon Riggs

simon@2ndquadrant.com

over 11 years ago

In reply to: Andres Freund (#44)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 25 September 2014 10:41, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-09-25 10:24:39 +0530, Abhijit Menon-Sen wrote:

At 2014-09-24 11:09:24 +0200, andres@2ndquadrant.com wrote:

I think it's completely unacceptable to copy a visibility routine.

OK. Which visibility routine should I use, and should I try to create a
variant that doesn't set hint bits?

I've not yet followed your premise that you actually need one that
doesn't set hint bits...

Not least because I'm trying to solve a similar problem on another
thread, so no need to make a special case here.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

andres@2ndquadrant.com

over 11 years ago

In reply to: Simon Riggs (#46)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 2014-09-25 14:43:14 +0100, Simon Riggs wrote:

On 25 September 2014 10:41, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-09-25 10:24:39 +0530, Abhijit Menon-Sen wrote:

At 2014-09-24 11:09:24 +0200, andres@2ndquadrant.com wrote:

I think it's completely unacceptable to copy a visibility routine.

OK. Which visibility routine should I use, and should I try to create a
variant that doesn't set hint bits?

I've not yet followed your premise that you actually need one that
doesn't set hint bits...

Not least because I'm trying to solve a similar problem on another
thread, so no need to make a special case here.

That's mostly unrelated though - Abhijit wants to avoid them because he
tried to avoid having *any* form of lock on the buffer. That's the
reason he tried avoid hint bit setting. Since I don't believe that's
safe (at least there's by far not enough evidence about it), there's
simply no reason to avoid it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

amit.kapila16@gmail.com

over 11 years ago

In reply to: Abhijit Menon-Sen (#41)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On Wed, Sep 24, 2014 at 2:26 PM, Abhijit Menon-Sen <ams@2ndquadrant.com>
wrote:

Hi Amit.

Thanks for your comments, and I'm sorry it's taken me so long to
respond.

No issues.

At 2014-08-03 11:18:57 +0530, amit.kapila16@gmail.com wrote:

7.
HeapTupleSatisfiesVacuumNoHint()
a. Why can't we use HeapTupleSatisfiesVisibility() with dirty snapshot
as we use for pgstattuple?

Heavier locking. I tried to make do with the existing HTS* functions,
but fastbloat was noticeably faster in tests with the current setup.

I am not sure for the purpose of this functionality, why we need to
use a different HTS (HeapTupleSatisfiesVacuum) routine as compare
to pgstat_heap(). Unless you have some specific purpose to achieve,
I think it is better to use HeapTupleSatisfiesVisibility().

Maybe I'll call it not_too_slow_bloat().

How about pgfaststattuple() or pgquickstattuple() or pgfuzzystattuple()?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#49

ams@2ndQuadrant.com

about 11 years ago

In reply to: Abhijit Menon-Sen (#45)

1 attachment(s)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2014-09-25 15:40:11 +0530, ams@2ndQuadrant.com wrote:

All right, then I'll post a version that addresses Amit's other
points, adds a new file/function to pgstattuple, acquires content
locks, and uses HeapTupleSatisfiesVacuum, hint-bit setting and all.

Sorry, I forgot to post this patch. It does what I listed above, and
seems to work fine (it's still quite a lot faster than pgstattuple
in many cases).

A couple of remaining questions:

1. I didn't change the handling of LP_DEAD items, because the way it is
now matches what pgstattuple counts. I'm open to changing it, though.
Maybe it makes sense to count LP_DEAD items iff lp_len != 0? Or just
leave it as it is? I think it doesn't matter much.

2. I changed the code to acquire the content locks on the buffer, as
discussed, but it still uses HeapTupleSatisfiesVacuum. Amit suggested
using HeapTupleSatisfiesVisibility, but it's not clear to me why that
would be better. I welcome advice in this matter.

(If anything, I should use HeapTupleIsSurelyDead, which doesn't set
any hint bits, but which I earlier rejected as too conservative in
its dead/not-dead decisions for this purpose.)

(I've actually patched the pgstattuple.sgml docs as well, but I want to
re-read that to make sure it's up to date, and didn't want to wait to
post the code changes.)

I also didn't change the name. I figure it's easy enough to change it
everywhere once all the remaining pieces are in place.

Comments welcome.

-- Abhijit

Attachments:

fastbloat.difftext/x-diff; charset=us-asciiDownload

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..bae80c9 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o fastbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/fastbloat.c b/contrib/pgstattuple/fastbloat.c
new file mode 100644
index 0000000..b33db14
--- /dev/null
+++ b/contrib/pgstattuple/fastbloat.c
@@ -0,0 +1,367 @@
+/*
+ * contrib/pgstattuple/fastbloat.c
+ *
+ * Abhijit Menon-Sen <ams@2ndQuadrant.com>
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pg_stattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
+ * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/multixact.h"
+#include "access/htup_details.h"
+#include "catalog/namespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/procarray.h"
+#include "storage/lmgr.h"
+#include "utils/builtins.h"
+#include "utils/tqual.h"
+#include "commands/vacuum.h"
+
+PG_FUNCTION_INFO_V1(fastbloat);
+PG_FUNCTION_INFO_V1(fastbloatbyid);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct fastbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} fastbloat_output_type;
+
+static Datum build_output_type(fastbloat_output_type *stat,
+							   FunctionCallInfo fcinfo);
+static Datum fbstat_relation(Relation rel, FunctionCallInfo fcinfo);
+static Datum fbstat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a fastbloat_output_type tuple
+ */
+static Datum
+build_output_type(fastbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/*
+	 * Generate attribute metadata needed later to produce tuples from raw C
+	 * strings
+	 */
+	attinmeta = TupleDescGetAttInMetadata(tupdesc);
+
+	if (stat->table_len == 0)
+	{
+		tuple_percent = 0.0;
+		dead_tuple_percent = 0.0;
+		free_percent = 0.0;
+	}
+	else
+	{
+		tuple_percent = 100.0 * stat->tuple_len / stat->table_len;
+		dead_tuple_percent = 100.0 * stat->dead_tuple_len / stat->table_len;
+		free_percent = 100.0 * stat->free_space / stat->table_len;
+	}
+
+	scanned_percent = 0.0;
+	if (stat->total_pages != 0)
+	{
+		scanned_percent = 100 * stat->scanned_pages / stat->total_pages;
+	}
+
+	for (i = 0; i < NCOLUMNS; i++)
+		values[i] = values_buf[i];
+	i = 0;
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->table_len);
+	snprintf(values[i++], NCHARS, "%.2f", scanned_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", dead_tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->free_space);
+	snprintf(values[i++], NCHARS, "%.2f", free_percent);
+
+	tuple = BuildTupleFromCStrings(attinmeta, values);
+
+	return HeapTupleGetDatum(tuple);
+}
+
+/* Returns live/dead tuple statistics for the named table. */
+
+Datum
+fastbloat(PG_FUNCTION_ARGS)
+{
+	text	   *relname = PG_GETARG_TEXT_P(0);
+	RangeVar   *relrv;
+	Relation	rel;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to use fastbloat functions"))));
+
+	relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+	rel = relation_openrv(relrv, AccessShareLock);
+
+	PG_RETURN_DATUM(fbstat_relation(rel, fcinfo));
+}
+
+/* As above, but takes a reloid instead of a relation name. */
+
+Datum
+fastbloatbyid(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to use fastbloat functions"))));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	PG_RETURN_DATUM(fbstat_relation(rel, fcinfo));
+}
+
+/*
+ * A helper function to reject unsupported relation types. We depend on
+ * the visibility map to decide which pages we can skip, so we can't
+ * support indexes, for example, which don't have a VM.
+ */
+
+static Datum
+fbstat_relation(Relation rel, FunctionCallInfo fcinfo)
+{
+	const char *err;
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	switch (rel->rd_rel->relkind)
+	{
+		case RELKIND_RELATION:
+		case RELKIND_MATVIEW:
+			return fbstat_heap(rel, fcinfo);
+		case RELKIND_TOASTVALUE:
+			err = "toast value";
+			break;
+		case RELKIND_SEQUENCE:
+			err = "sequence";
+			break;
+		case RELKIND_INDEX:
+			err = "index";
+			break;
+		case RELKIND_VIEW:
+			err = "view";
+			break;
+		case RELKIND_COMPOSITE_TYPE:
+			err = "composite type";
+			break;
+		case RELKIND_FOREIGN_TABLE:
+			err = "foreign table";
+			break;
+		default:
+			err = "unknown";
+			break;
+	}
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("\"%s\" (%s) is not supported",
+					RelationGetRelationName(rel), err)));
+	return 0;
+}
+
+/*
+ * This function takes an already open relation and scans its pages,
+ * skipping those that have the corresponding visibility map bit set.
+ * For pages we skip, we find the free space from the free space map
+ * and approximate tuple_len on that basis. For the others, we count
+ * the exact number of dead tuples etc.
+ *
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
+ * we do not try to avoid skipping single pages.
+ */
+
+static Datum
+fbstat_heap(Relation rel, FunctionCallInfo fcinfo)
+{
+	BlockNumber scanned,
+				nblocks,
+				blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	fastbloat_output_type stat = {0};
+	BufferAccessStrategy bstrategy;
+	TransactionId OldestXmin;
+
+	OldestXmin = GetOldestXmin(rel, true);
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	scanned = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		Buffer		buf;
+		Page		page;
+		OffsetNumber offnum,
+					maxoff;
+		Size		freespace;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the page has only visible tuples, then we can find out the
+		 * free space from the FSM and move on.
+		 */
+
+		if (visibilitymap_test(rel, blkno, &vmbuffer))
+		{
+			freespace = GetRecordedFreeSpace(rel, blkno);
+			stat.tuple_len += BLCKSZ - freespace;
+			stat.free_space += freespace;
+			continue;
+		}
+
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, bstrategy);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buf);
+
+		stat.free_space += PageGetHeapFreeSpace(page);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		scanned++;
+
+		/*
+		 * Look at each tuple on the page and decide whether it's live
+		 * or dead, then count it and its size. Unlike lazy_scan_heap,
+		 * we can afford to ignore problems and special cases.
+		 */
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+			HeapTupleData tuple;
+
+			itemid = PageGetItemId(page, offnum);
+
+			if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid) ||
+				ItemIdIsDead(itemid))
+			{
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationGetRelid(rel);
+
+			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			{
+				case HEAPTUPLE_DEAD:
+				case HEAPTUPLE_RECENTLY_DEAD:
+					stat.dead_tuple_len += tuple.t_len;
+					stat.dead_tuple_count++;
+					break;
+				case HEAPTUPLE_LIVE:
+					stat.tuple_len += tuple.t_len;
+					stat.tuple_count++;
+					break;
+				case HEAPTUPLE_INSERT_IN_PROGRESS:
+				case HEAPTUPLE_DELETE_IN_PROGRESS:
+					break;
+				default:
+					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+					break;
+			}
+		}
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	stat.table_len = (uint64) nblocks * BLCKSZ;
+	stat.tuple_count = vac_estimate_reltuples(rel, false, nblocks, scanned,
+											  stat.tuple_count);
+	stat.total_pages = nblocks;
+	stat.scanned_pages = scanned;
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	relation_close(rel, AccessShareLock);
+
+	return build_output_type(&stat, fcinfo);
+}
diff --git a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
new file mode 100644
index 0000000..8a4064e
--- /dev/null
+++ b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
@@ -0,0 +1,32 @@
+/* contrib/pgstattuple/pgstattuple--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pgstattuple UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION fastbloat(IN relname text,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'fastbloat'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION fastbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'fastbloatbyid'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.3.sql
similarity index 59%
rename from contrib/pgstattuple/pgstattuple--1.2.sql
rename to contrib/pgstattuple/pgstattuple--1.3.sql
index e5fa2f5..0a111e6 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.3.sql
@@ -1,4 +1,4 @@
-/* contrib/pgstattuple/pgstattuple--1.2.sql */
+/* contrib/pgstattuple/pgstattuple--1.3.sql */
 
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION pgstattuple" to load this file. \quit
@@ -77,3 +77,33 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+/* New stuff in 1.3 begins here */
+
+CREATE FUNCTION fastbloat(IN relname text,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'fastbloat'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION fastbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'fastbloatbyid'
+LANGUAGE C STRICT;

#50

Jim.Nasby@BlueTreble.com

almost 11 years ago

In reply to: Abhijit Menon-Sen (#49)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 12/26/14 1:38 AM, Abhijit Menon-Sen wrote:

At 2014-09-25 15:40:11 +0530,ams@2ndQuadrant.com wrote:

All right, then I'll post a version that addresses Amit's other
points, adds a new file/function to pgstattuple, acquires content
locks, and uses HeapTupleSatisfiesVacuum, hint-bit setting and all.

Sorry, I forgot to post this patch. It does what I listed above, and
seems to work fine (it's still quite a lot faster than pgstattuple
in many cases).

Anything happen with this?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

ams@2ndQuadrant.com

almost 11 years ago

In reply to: Jim Nasby (#50)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2015-01-26 18:47:29 -0600, Jim.Nasby@BlueTreble.com wrote:

Anything happen with this?

Nothing so far.

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Jim.Nasby@BlueTreble.com

almost 11 years ago

In reply to: Abhijit Menon-Sen (#51)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 1/27/15 3:56 AM, Abhijit Menon-Sen wrote:

At 2015-01-26 18:47:29 -0600, Jim.Nasby@BlueTreble.com wrote:

Anything happen with this?

Nothing so far.

It would be best to get this into a commit fest so it's not lost.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

ams@2ndQuadrant.com

almost 11 years ago

In reply to: Jim Nasby (#52)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2015-01-27 17:00:27 -0600, Jim.Nasby@BlueTreble.com wrote:

It would be best to get this into a commit fest so it's not lost.

It's there already.

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

tomas.vondra@2ndquadrant.com

almost 11 years ago

In reply to: Abhijit Menon-Sen (#53)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

Hi!

On 28.1.2015 05:03, Abhijit Menon-Sen wrote:

At 2015-01-27 17:00:27 -0600, Jim.Nasby@BlueTreble.com wrote:

It would be best to get this into a commit fest so it's not lost.

It's there already.

-- Abhijit

I looked at this patch today, so a few comments from me:

1) I believe the patch is slightly broken, because the version was
bumped to 1.3 but only partially, so I get this on "make install"

$ make -j9 -s install
/usr/bin/install: cannot stat ï¿½./pgstattuple--1.3.sqlï¿½: No such
file or directory
../../src/makefiles/pgxs.mk:129: recipe for target 'install' failed
make[1]: *** [install] Error 1
Makefile:85: recipe for target 'install-pgstattuple-recurse' failed
make: *** [install-pgstattuple-recurse] Error 2
make: *** Waiting for unfinished jobs....

The problem seems to be that Makefile already lists --1.3.sql in the
DATA section, but the file was not renamed (there's just --1.2.sql).

After fixing that, it's also necessary to fix the version in the
control file, otherwise the CREATE EXTENSION fails.

default_version = '1.2' -> '1.3'

At least that fixed the install for me.

2) Returning Datum from fbstat_heap, which is a static function seems a
bit strange to me, I'd use the HeapTuple directly, but meh ...

3) The way the tuple is built by first turning the values into strings
and then turned back into Datums by calling BuildTupleFromCStrings
is more serious I guess. Why not to just keep the Datums and call
heap_form_tuple directly?

I see the other functions in pgstattuple.c do use the string way, so
maybe there's a reason for that? Or is this just to keep the code
consistent in the extension?

4) I really doubt 'fastbloat' is a good name. I propose 'pgstatbloat'
to make that consistent with the rest of pgstattuple functions. Or
something similar, but 'fastbloat' is just plain confusing.

Otherwise, the code looks OK to me. Now, there are a few features I'd
like to have for production use (to minimize the impact):

1) no index support :-(

I'd like to see support for more relation types (at least btree
indexes). Are there any plans for that? Do we have an idea on how to
compute that?

2) sampling just a portion of the table

For example, being able to sample just 5% of blocks, making it less
obtrusive, especially on huge tables. Interestingly, there's a
TABLESAMPLE patch in this CF, so maybe it's possible to reuse some
of the methods (e.g. functions behind SYSTEM sampling)?

3) throttling

Another feature minimizing impact of running this on production might
be some sort of throttling, e.g. saying 'limit the scan to 4 MB/s'
or something along those lines.

4) prefetch

fbstat_heap is using visibility map to skip fully-visible pages,
which is nice, but if we skip too many pages it breaks readahead
similarly to bitmap heap scan. I believe this is another place where
effective_io_concurrency (i.e. prefetch) would be appropriate.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Jim.Nasby@BlueTreble.com

almost 11 years ago

In reply to: Tomas Vondra (#54)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 2/22/15 5:41 PM, Tomas Vondra wrote:

Otherwise, the code looks OK to me. Now, there are a few features I'd
like to have for production use (to minimize the impact):

1) no index support:-(

I'd like to see support for more relation types (at least btree
indexes). Are there any plans for that? Do we have an idea on how to
compute that?

It'd be cleaner if had actual an actual am function for this, but see below.

2) sampling just a portion of the table

For example, being able to sample just 5% of blocks, making it less
obtrusive, especially on huge tables. Interestingly, there's a
TABLESAMPLE patch in this CF, so maybe it's possible to reuse some
of the methods (e.g. functions behind SYSTEM sampling)?

3) throttling

Another feature minimizing impact of running this on production might
be some sort of throttling, e.g. saying 'limit the scan to 4 MB/s'
or something along those lines.

4) prefetch

fbstat_heap is using visibility map to skip fully-visible pages,
which is nice, but if we skip too many pages it breaks readahead
similarly to bitmap heap scan. I believe this is another place where
effective_io_concurrency (i.e. prefetch) would be appropriate.

All of those wishes are solved in one way or another by vacuum and/or
analyze. If we had a hook in the tuple scanning loop and at the end of
vacuum you could just piggy-back on it. But really all we'd need for
vacuum to be able to report this info is one more field in LVRelStats, a
call to GetRecordedFreeSpace for all-visible pages, and some logic to
deal with pages skipped because we couldn't get the vacuum lock.

Should we just add this to vacuum instead?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

tomas.vondra@2ndquadrant.com

almost 11 years ago

In reply to: Jim Nasby (#55)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 23.2.2015 03:20, Jim Nasby wrote:

On 2/22/15 5:41 PM, Tomas Vondra wrote:

Otherwise, the code looks OK to me. Now, there are a few features I'd
like to have for production use (to minimize the impact):

1) no index support:-(

I'd like to see support for more relation types (at least btree
indexes). Are there any plans for that? Do we have an idea on how to
compute that?

It'd be cleaner if had actual an actual am function for this, but see
below.

2) sampling just a portion of the table

For example, being able to sample just 5% of blocks, making it less
obtrusive, especially on huge tables. Interestingly, there's a
TABLESAMPLE patch in this CF, so maybe it's possible to reuse some
of the methods (e.g. functions behind SYSTEM sampling)?

3) throttling

Another feature minimizing impact of running this on production might
be some sort of throttling, e.g. saying 'limit the scan to 4 MB/s'
or something along those lines.

4) prefetch

fbstat_heap is using visibility map to skip fully-visible pages,
which is nice, but if we skip too many pages it breaks readahead
similarly to bitmap heap scan. I believe this is another place where
effective_io_concurrency (i.e. prefetch) would be appropriate.

All of those wishes are solved in one way or another by vacuum and/or
analyze. If we had a hook in the tuple scanning loop and at the end of
vacuum you could just piggy-back on it. But really all we'd need for
vacuum to be able to report this info is one more field in LVRelStats, a
call to GetRecordedFreeSpace for all-visible pages, and some logic to
deal with pages skipped because we couldn't get the vacuum lock.

Should we just add this to vacuum instead?

Possibly. I think the ultimate goal is to be able to get this info
easily and without disrupting the system performance too much (which is
difficult without sampling/throttling). If we can stuff that into
autovacuum reasonably, and then get the info from catalogs, I'm OK with
that.

However I'm not sure putting this into autovacuum is actually possible,
because how do you merge data from multiple partial runs (when each of
them skipped different pages)? Also, autovacuum is not the only place
where we free space - we'd have to handle HOT for example, I guess.

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Jim.Nasby@BlueTreble.com

almost 11 years ago

In reply to: Tomas Vondra (#56)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 2/22/15 8:32 PM, Tomas Vondra wrote:

On 23.2.2015 03:20, Jim Nasby wrote:

On 2/22/15 5:41 PM, Tomas Vondra wrote:

Otherwise, the code looks OK to me. Now, there are a few features I'd
like to have for production use (to minimize the impact):

1) no index support:-(

I'd like to see support for more relation types (at least btree
indexes). Are there any plans for that? Do we have an idea on how to
compute that?

It'd be cleaner if had actual an actual am function for this, but see
below.

2) sampling just a portion of the table

For example, being able to sample just 5% of blocks, making it less
obtrusive, especially on huge tables. Interestingly, there's a
TABLESAMPLE patch in this CF, so maybe it's possible to reuse some
of the methods (e.g. functions behind SYSTEM sampling)?

3) throttling

Another feature minimizing impact of running this on production might
be some sort of throttling, e.g. saying 'limit the scan to 4 MB/s'
or something along those lines.

4) prefetch

fbstat_heap is using visibility map to skip fully-visible pages,
which is nice, but if we skip too many pages it breaks readahead
similarly to bitmap heap scan. I believe this is another place where
effective_io_concurrency (i.e. prefetch) would be appropriate.

All of those wishes are solved in one way or another by vacuum and/or
analyze. If we had a hook in the tuple scanning loop and at the end of
vacuum you could just piggy-back on it. But really all we'd need for
vacuum to be able to report this info is one more field in LVRelStats, a
call to GetRecordedFreeSpace for all-visible pages, and some logic to
deal with pages skipped because we couldn't get the vacuum lock.

Should we just add this to vacuum instead?

Possibly. I think the ultimate goal is to be able to get this info
easily and without disrupting the system performance too much (which is
difficult without sampling/throttling). If we can stuff that into
autovacuum reasonably, and then get the info from catalogs, I'm OK with
that.

Doing the counting in vacuum/analyze (auto or not) is quite easy, and it
would happen at the same time we're doing useful work. We would
automatically get the benefit of the throttling and sampling work that
those routines already do.

However I'm not sure putting this into autovacuum is actually possible,
because how do you merge data from multiple partial runs (when each of
them skipped different pages)?

ISTM that's just a form of sampling, no?

Besides, we don't need the same lock for figuring out bloat. We could
still measure bloat even if we can't vacuum the page, but I think that's
overkill. If we're skipping enough pages to mess with the bloat
measurement then we most likely need to teach vacuum how to revisit pages.

Also, autovacuum is not the only place
where we free space - we'd have to handle HOT for example, I guess.

I wasn't thinking about trying to keep live bloat statistics, so HOT
wouldn't affect this.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

tomas.vondra@2ndquadrant.com

almost 11 years ago

In reply to: Jim Nasby (#57)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 24.2.2015 19:08, Jim Nasby wrote:

On 2/22/15 8:32 PM, Tomas Vondra wrote:

On 23.2.2015 03:20, Jim Nasby wrote:

On 2/22/15 5:41 PM, Tomas Vondra wrote:

Otherwise, the code looks OK to me. Now, there are a few features I'd
like to have for production use (to minimize the impact):

1) no index support:-(

I'd like to see support for more relation types (at least btree
indexes). Are there any plans for that? Do we have an idea on
how to
compute that?

It'd be cleaner if had actual an actual am function for this, but see
below.

2) sampling just a portion of the table

For example, being able to sample just 5% of blocks, making it
less
obtrusive, especially on huge tables. Interestingly, there's a
TABLESAMPLE patch in this CF, so maybe it's possible to reuse some
of the methods (e.g. functions behind SYSTEM sampling)?

3) throttling

Another feature minimizing impact of running this on production
might
be some sort of throttling, e.g. saying 'limit the scan to 4 MB/s'
or something along those lines.

4) prefetch

fbstat_heap is using visibility map to skip fully-visible pages,
which is nice, but if we skip too many pages it breaks readahead
similarly to bitmap heap scan. I believe this is another place
where
effective_io_concurrency (i.e. prefetch) would be appropriate.

All of those wishes are solved in one way or another by vacuum and/or
analyze. If we had a hook in the tuple scanning loop and at the end of
vacuum you could just piggy-back on it. But really all we'd need for
vacuum to be able to report this info is one more field in LVRelStats, a
call to GetRecordedFreeSpace for all-visible pages, and some logic to
deal with pages skipped because we couldn't get the vacuum lock.

Should we just add this to vacuum instead?

Possibly. I think the ultimate goal is to be able to get this info
easily and without disrupting the system performance too much (which is
difficult without sampling/throttling). If we can stuff that into
autovacuum reasonably, and then get the info from catalogs, I'm OK with
that.

Doing the counting in vacuum/analyze (auto or not) is quite easy, and it
would happen at the same time we're doing useful work. We would
automatically get the benefit of the throttling and sampling work that
those routines already do.

However I'm not sure putting this into autovacuum is actually possible,
because how do you merge data from multiple partial runs (when each of
them skipped different pages)?

ISTM that's just a form of sampling, no?

Maybe.

I was thinking about collecting the necessary info during the VACUUM
phase, and somehow keeping track of free space in the whole table. I
thought there would be trouble exactly because this phase only processes
pages that possibly need vacuuming (so it wouldn't be a truly random
sample, making the estimation tricky).

But maybe that's not really true and it is possible to do that somehow.
For example what if we kept track of how much space each VACUUM freed,
and keeping running sum?

It might also be done during the ANALYZE, but that seems a bit
complicated because that's based on a sample of rows, not pages. Also,
the autovacuum_analyze_factor is 0.2 by default, so could end up with up
to 20% bloat without knowing it (vacuum_factor=0.1 is not great either,
but it's better).

Besides, we don't need the same lock for figuring out bloat. We
could still measure bloat even if we can't vacuum the page, but I
think that's overkill. If we're skipping enough pages to mess with
the bloat measurement then we most likely need to teach vacuum how to
revisit pages.

Also, autovacuum is not the only place
where we free space - we'd have to handle HOT for example, I guess.

I wasn't thinking about trying to keep live bloat statistics, so HOT
wouldn't affect this.

I'm afraid this might cause the estimate to drift away over time, but I
guess it depends on implementation - e.g. if doing this in ANALYZE, it'd
be mostly immune to this, while with collecting incremental info during
VACUUM it might be a problem I guess.

Anyway, we don't have a patch trying to do that (certainly not in this
CF). I think it makes sense to add fastbloat() to pageinspect. Maybe
we'll get autovacuum-based solution in the future, but we don't have
that right now.

Actually, wouldn't that be a nice GSoC project? The scope seems about
right, not touching too many parts of the code base, etc.

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Jim.Nasby@BlueTreble.com

almost 11 years ago

In reply to: Tomas Vondra (#58)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 2/25/15 2:56 PM, Tomas Vondra wrote:

On 24.2.2015 19:08, Jim Nasby wrote:

On 2/22/15 8:32 PM, Tomas Vondra wrote:

On 23.2.2015 03:20, Jim Nasby wrote:

On 2/22/15 5:41 PM, Tomas Vondra wrote:

Otherwise, the code looks OK to me. Now, there are a few features I'd
like to have for production use (to minimize the impact):

1) no index support:-(

I'd like to see support for more relation types (at least btree
indexes). Are there any plans for that? Do we have an idea on
how to
compute that?

It'd be cleaner if had actual an actual am function for this, but see
below.

2) sampling just a portion of the table

For example, being able to sample just 5% of blocks, making it
less
obtrusive, especially on huge tables. Interestingly, there's a
TABLESAMPLE patch in this CF, so maybe it's possible to reuse some
of the methods (e.g. functions behind SYSTEM sampling)?

3) throttling

Another feature minimizing impact of running this on production
might
be some sort of throttling, e.g. saying 'limit the scan to 4 MB/s'
or something along those lines.

4) prefetch

fbstat_heap is using visibility map to skip fully-visible pages,
which is nice, but if we skip too many pages it breaks readahead
similarly to bitmap heap scan. I believe this is another place
where
effective_io_concurrency (i.e. prefetch) would be appropriate.

All of those wishes are solved in one way or another by vacuum and/or
analyze. If we had a hook in the tuple scanning loop and at the end of
vacuum you could just piggy-back on it. But really all we'd need for
vacuum to be able to report this info is one more field in LVRelStats, a
call to GetRecordedFreeSpace for all-visible pages, and some logic to
deal with pages skipped because we couldn't get the vacuum lock.

Should we just add this to vacuum instead?

Possibly. I think the ultimate goal is to be able to get this info
easily and without disrupting the system performance too much (which is
difficult without sampling/throttling). If we can stuff that into
autovacuum reasonably, and then get the info from catalogs, I'm OK with
that.

Doing the counting in vacuum/analyze (auto or not) is quite easy, and it
would happen at the same time we're doing useful work. We would
automatically get the benefit of the throttling and sampling work that
those routines already do.

However I'm not sure putting this into autovacuum is actually possible,
because how do you merge data from multiple partial runs (when each of
them skipped different pages)?

ISTM that's just a form of sampling, no?

Maybe.

I was thinking about collecting the necessary info during the VACUUM
phase, and somehow keeping track of free space in the whole table. I
thought there would be trouble exactly because this phase only processes
pages that possibly need vacuuming (so it wouldn't be a truly random
sample, making the estimation tricky).

Well, all-visible pages can still be listed in the FSM, so we'd have at
least that info. I don't think there can be dead tuples on an
all-visible page; if so, that means free space is all we care about. So
when pulling bloat info we'd just have to scan the VSM and FSM;
presumably that's pretty cheap.

But maybe that's not really true and it is possible to do that somehow.
For example what if we kept track of how much space each VACUUM freed,
and keeping running sum?

Well, then we'd also have to keep track of space we used. I don't think
we want to do that.

On a completely unrelated note, the idea of logging details about vacuum
runs in a table is appealing. Forcing users to something like pgBadger
for that data seems pretty silly to me. But that's obviously a
completely separate discussion from this one.

It might also be done during the ANALYZE, but that seems a bit
complicated because that's based on a sample of rows, not pages.

Right, but you've got the page anyway so I doubt it'd cost much extra to
measure the bloat on it. Might want to guard against counting a page
twice, but...

Also,
the autovacuum_analyze_factor is 0.2 by default, so could end up with up
to 20% bloat without knowing it (vacuum_factor=0.1 is not great either,
but it's better).

... I don't think it's practical to get this bloat measurement terribly
precise, no do I think we need to. Anyone that needs better than 20%
accuracy can probably afford to fire off a manual vacuum at the same
time (or run a scan-only version).

Besides, we don't need the same lock for figuring out bloat. We
could still measure bloat even if we can't vacuum the page, but I
think that's overkill. If we're skipping enough pages to mess with
the bloat measurement then we most likely need to teach vacuum how to
revisit pages.

Also, autovacuum is not the only place
where we free space - we'd have to handle HOT for example, I guess.

I wasn't thinking about trying to keep live bloat statistics, so HOT
wouldn't affect this.

I'm afraid this might cause the estimate to drift away over time, but I
guess it depends on implementation - e.g. if doing this in ANALYZE, it'd
be mostly immune to this, while with collecting incremental info during
VACUUM it might be a problem I guess.

Yeah, I just don't see the need for that level of accuracy.

Anyway, we don't have a patch trying to do that (certainly not in this
CF). I think it makes sense to add fastbloat() to pageinspect. Maybe
we'll get autovacuum-based solution in the future, but we don't have
that right now.

I still think we could add this as an option to at least vacuum in this
CF. It should be far less code and gets us throttling for free. It
wouldn't be hard to add a "BLOAT ONLY" option to bypass all the other
work; it'd still be less code.

That said, I don't want to block this; I think it's useful. Though,
perhaps it would be better as an extension instead of in contrib? I
don't think it should be very version dependent?

Actually, wouldn't that be a nice GSoC project? The scope seems about
right, not touching too many parts of the code base, etc.

In it's simplest form I think it'd be too small, but if we got more
advanced than simply adding some counters to vacuum then I agree.

I think I'd rather gave a system for logging important stuff (like
vacuum stats) in tables with a way to prevent infinite growth though. ;)
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Simon Riggs

simon@2ndQuadrant.com

almost 11 years ago

In reply to: Jim Nasby (#59)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 25 February 2015 at 23:30, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

That said, I don't want to block this; I think it's useful. Though, perhaps
it would be better as an extension instead of in contrib? I don't think it
should be very version dependent?

The whole point of this is to get it into contrib.

It could have published it as an extension months ago.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

almost 11 years ago

In reply to: Simon Riggs (#60)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 26/02/15 13:32, Simon Riggs wrote:

On 25 February 2015 at 23:30, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

That said, I don't want to block this; I think it's useful. Though, perhaps
it would be better as an extension instead of in contrib? I don't think it
should be very version dependent?

The whole point of this is to get it into contrib.

It could have published it as an extension months ago.

Is this intended to replace pg_freespacemap? Not trying to be
overprotective or anything :-) but is this new extension/module
better/faster/stronger/more accurate etc? If so then excellent! Ohth was
wondering if people either a) didn't know about pg_freespacemap or b)
consider that it is crap and won't use it.

Cheers

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Alvaro Herrera

alvherre@2ndquadrant.com

almost 11 years ago

In reply to: Mark Kirkwood (#61)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

Mark Kirkwood wrote:

On 26/02/15 13:32, Simon Riggs wrote:

On 25 February 2015 at 23:30, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

That said, I don't want to block this; I think it's useful. Though, perhaps
it would be better as an extension instead of in contrib? I don't think it
should be very version dependent?

The whole point of this is to get it into contrib.

It could have published it as an extension months ago.

Is this intended to replace pg_freespacemap? Not trying to be overprotective
or anything :-) but is this new extension/module better/faster/stronger/more
accurate etc? If so then excellent! Ohth was wondering if people either a)
didn't know about pg_freespacemap or b) consider that it is crap and won't
use it.

It's a replacement for pgstattuple as far as I recall, not
pg_freespacemap. The problem with pgstattuple is that it reads every
single page of the table; this fast tool skips reading pages that are
marked all-visible in the visibility map. The disadvantage of
pg_freespacemap is that it doesn't have page data more recent than the
last vacuum, IIRC.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Abhijit Menon-Sen (#49)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On Fri, Dec 26, 2014 at 1:08 PM, Abhijit Menon-Sen <ams@2ndquadrant.com>
wrote:

At 2014-09-25 15:40:11 +0530, ams@2ndQuadrant.com wrote:

All right, then I'll post a version that addresses Amit's other
points, adds a new file/function to pgstattuple, acquires content
locks, and uses HeapTupleSatisfiesVacuum, hint-bit setting and all.

Sorry, I forgot to post this patch. It does what I listed above, and
seems to work fine (it's still quite a lot faster than pgstattuple
in many cases).

I think one of the comment is not handled in latest patch.
5. Don't we need the handling for empty page (PageIsEmpty()) case?

I think we should have siimilar for PageIsEmpty()
as you have done for PageIsNew() in your patch.

+ stat.tuple_count = vac_estimate_reltuples(rel, false, nblocks, scanned,
+  stat.tuple_count);

Here for scanned tuples, you have used the live tuple counter
whereas I think we need to count HEAPTUPLE_INSERT_IN_PROGRESS
and HEAPTUPLE_DELETE_IN_PROGRESS and
HEAPTUPLE_RECENTLY_DEAD.

Refer the similar logic in Vacuum.

Although I am not in favor of using HeapTupleSatisfiesVacuum(), however
if you want to use it then lets be consistent with what Vacuum does
for estimation of tuples.

A couple of remaining questions:

1. I didn't change the handling of LP_DEAD items, because the way it is
now matches what pgstattuple counts. I'm open to changing it, though.
Maybe it makes sense to count LP_DEAD items iff lp_len != 0? Or just
leave it as it is? I think it doesn't matter much.

2. I changed the code to acquire the content locks on the buffer, as
discussed, but it still uses HeapTupleSatisfiesVacuum. Amit suggested
using HeapTupleSatisfiesVisibility, but it's not clear to me why that
would be better. I welcome advice in this matter.

The reason to use HeapTupleSatisfiesVacuum() is that it helps us
in freezing and some other similar stuff which is required by
Vacuum. Also it could be beneficial if we want take specific
action for various result codes of Vaccum. I think as this routine
mostly gives the estimated count, so it might not matter much even
if we use HeapTupleSatisfiesVacuum(), however it is better to be
consistent with pgstat_heap() in this case.
Do you have any reason for using HeapTupleSatisfiesVacuum() rather
than what is used in pgstat_heap()?

(If anything, I should use HeapTupleIsSurelyDead, which doesn't set
any hint bits, but which I earlier rejected as too conservative in
its dead/not-dead decisions for this purpose.)

(I've actually patched the pgstattuple.sgml docs as well, but I want to
re-read that to make sure it's up to date, and didn't want to wait to
post the code changes.)

Sure, post the same when it is ready.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#64

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Tomas Vondra (#54)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On Mon, Feb 23, 2015 at 7:11 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 28.1.2015 05:03, Abhijit Menon-Sen wrote:

At 2015-01-27 17:00:27 -0600, Jim.Nasby@BlueTreble.com wrote:

Otherwise, the code looks OK to me. Now, there are a few features I'd
like to have for production use (to minimize the impact):

1) no index support :-(

I'd like to see support for more relation types (at least btree
indexes). Are there any plans for that? Do we have an idea on how to
compute that?

2) sampling just a portion of the table

For example, being able to sample just 5% of blocks, making it less
obtrusive, especially on huge tables. Interestingly, there's a
TABLESAMPLE patch in this CF, so maybe it's possible to reuse some
of the methods (e.g. functions behind SYSTEM sampling)?

3) throttling

Another feature minimizing impact of running this on production might
be some sort of throttling, e.g. saying 'limit the scan to 4 MB/s'
or something along those lines.

I think these features could be done separately if anybody is interested.
The patch in its proposed form seems useful to me.

4) prefetch

fbstat_heap is using visibility map to skip fully-visible pages,
which is nice, but if we skip too many pages it breaks readahead
similarly to bitmap heap scan. I believe this is another place where
effective_io_concurrency (i.e. prefetch) would be appropriate.

Good point. We can even think of using the technique used by Vacuum
which is skip only when we can skip atleast SKIP_PAGES_THRESHOLD
pages.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#65

ams@2ndQuadrant.com

almost 11 years ago

In reply to: Amit Kapila (#64)

1 attachment(s)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

Hi.

I'm just posting this WIP patch where I've renamed fastbloat to
pgstatbloat as suggested by Tomas, and added in the documentation, and
so on. I still have to incorporate Amit's comments about the estimation
of reltuples according to the way vacuum does it, and I expect to post
that tomorrow (I just need to test a little more).

In the meantime, if anyone else was having trouble installing the
extension due to the incorrect version in the control file, this is the
patch you should be using.

-- Abhijit

Attachments:

0001-Add-pgstatbloat-to-pgstattuple.patchtext/x-diff; charset=us-asciiDownload

>From f809e070e8ea13b74c6206ca67a7eaf2a32e60fa Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile                       |   4 +-
 contrib/pgstattuple/pgstatbloat.c                  | 346 +++++++++++++++++++++
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql      |  18 ++
 .../{pgstattuple--1.2.sql => pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control            |   2 +-
 doc/src/sgml/pgstattuple.sgml                      | 135 ++++++++
 6 files changed, 519 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql => pgstattuple--1.3.sql} (73%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 0000000..15c2cb9
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,346 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen <ams@2ndQuadrant.com>
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pg_stattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
+ * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/multixact.h"
+#include "access/htup_details.h"
+#include "catalog/namespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/procarray.h"
+#include "storage/lmgr.h"
+#include "utils/builtins.h"
+#include "utils/tqual.h"
+#include "commands/vacuum.h"
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} pgstatbloat_output_type;
+
+static Datum build_output_type(pgstatbloat_output_type *stat,
+							   FunctionCallInfo fcinfo);
+static Datum fbstat_relation(Relation rel, FunctionCallInfo fcinfo);
+static Datum fbstat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a pgstatbloat_output_type tuple
+ */
+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/*
+	 * Generate attribute metadata needed later to produce tuples from raw C
+	 * strings
+	 */
+	attinmeta = TupleDescGetAttInMetadata(tupdesc);
+
+	if (stat->table_len == 0)
+	{
+		tuple_percent = 0.0;
+		dead_tuple_percent = 0.0;
+		free_percent = 0.0;
+	}
+	else
+	{
+		tuple_percent = 100.0 * stat->tuple_len / stat->table_len;
+		dead_tuple_percent = 100.0 * stat->dead_tuple_len / stat->table_len;
+		free_percent = 100.0 * stat->free_space / stat->table_len;
+	}
+
+	scanned_percent = 0.0;
+	if (stat->total_pages != 0)
+	{
+		scanned_percent = 100 * stat->scanned_pages / stat->total_pages;
+	}
+
+	for (i = 0; i < NCOLUMNS; i++)
+		values[i] = values_buf[i];
+	i = 0;
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->table_len);
+	snprintf(values[i++], NCHARS, "%.2f", scanned_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", dead_tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->free_space);
+	snprintf(values[i++], NCHARS, "%.2f", free_percent);
+
+	tuple = BuildTupleFromCStrings(attinmeta, values);
+
+	return HeapTupleGetDatum(tuple);
+}
+
+/* Returns live/dead tuple statistics for the given relid. */
+
+Datum
+pgstatbloat(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to use pgstatbloat functions"))));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	PG_RETURN_DATUM(fbstat_relation(rel, fcinfo));
+}
+
+/*
+ * A helper function to reject unsupported relation types. We depend on
+ * the visibility map to decide which pages we can skip, so we can't
+ * support indexes, for example, which don't have a VM.
+ */
+
+static Datum
+fbstat_relation(Relation rel, FunctionCallInfo fcinfo)
+{
+	const char *err;
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	switch (rel->rd_rel->relkind)
+	{
+		case RELKIND_RELATION:
+		case RELKIND_MATVIEW:
+			return fbstat_heap(rel, fcinfo);
+		case RELKIND_TOASTVALUE:
+			err = "toast value";
+			break;
+		case RELKIND_SEQUENCE:
+			err = "sequence";
+			break;
+		case RELKIND_INDEX:
+			err = "index";
+			break;
+		case RELKIND_VIEW:
+			err = "view";
+			break;
+		case RELKIND_COMPOSITE_TYPE:
+			err = "composite type";
+			break;
+		case RELKIND_FOREIGN_TABLE:
+			err = "foreign table";
+			break;
+		default:
+			err = "unknown";
+			break;
+	}
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("\"%s\" (%s) is not supported",
+					RelationGetRelationName(rel), err)));
+	return 0;
+}
+
+/*
+ * This function takes an already open relation and scans its pages,
+ * skipping those that have the corresponding visibility map bit set.
+ * For pages we skip, we find the free space from the free space map
+ * and approximate tuple_len on that basis. For the others, we count
+ * the exact number of dead tuples etc.
+ *
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
+ * we do not try to avoid skipping single pages.
+ */
+
+static Datum
+fbstat_heap(Relation rel, FunctionCallInfo fcinfo)
+{
+	BlockNumber scanned,
+				nblocks,
+				blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	pgstatbloat_output_type stat = {0};
+	BufferAccessStrategy bstrategy;
+	TransactionId OldestXmin;
+
+	OldestXmin = GetOldestXmin(rel, true);
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	scanned = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		Buffer		buf;
+		Page		page;
+		OffsetNumber offnum,
+					maxoff;
+		Size		freespace;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the page has only visible tuples, then we can find out the
+		 * free space from the FSM and move on.
+		 */
+
+		if (visibilitymap_test(rel, blkno, &vmbuffer))
+		{
+			freespace = GetRecordedFreeSpace(rel, blkno);
+			stat.tuple_len += BLCKSZ - freespace;
+			stat.free_space += freespace;
+			continue;
+		}
+
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, bstrategy);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buf);
+
+		stat.free_space += PageGetHeapFreeSpace(page);
+
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		scanned++;
+
+		/*
+		 * Look at each tuple on the page and decide whether it's live
+		 * or dead, then count it and its size. Unlike lazy_scan_heap,
+		 * we can afford to ignore problems and special cases.
+		 */
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+			HeapTupleData tuple;
+
+			itemid = PageGetItemId(page, offnum);
+
+			if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid) ||
+				ItemIdIsDead(itemid))
+			{
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationGetRelid(rel);
+
+			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			{
+				case HEAPTUPLE_DEAD:
+				case HEAPTUPLE_RECENTLY_DEAD:
+					stat.dead_tuple_len += tuple.t_len;
+					stat.dead_tuple_count++;
+					break;
+				case HEAPTUPLE_LIVE:
+					stat.tuple_len += tuple.t_len;
+					stat.tuple_count++;
+					break;
+				case HEAPTUPLE_INSERT_IN_PROGRESS:
+				case HEAPTUPLE_DELETE_IN_PROGRESS:
+					break;
+				default:
+					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+					break;
+			}
+		}
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	stat.table_len = (uint64) nblocks * BLCKSZ;
+	stat.tuple_count = vac_estimate_reltuples(rel, false, nblocks, scanned,
+											  stat.tuple_count);
+	stat.total_pages = nblocks;
+	stat.scanned_pages = scanned;
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	relation_close(rel, AccessShareLock);
+
+	return build_output_type(&stat, fcinfo);
+}
diff --git a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
new file mode 100644
index 0000000..dba7993
--- /dev/null
+++ b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
@@ -0,0 +1,18 @@
+/* contrib/pgstattuple/pgstattuple--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pgstattuple UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'pgstatbloat'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.3.sql
similarity index 73%
rename from contrib/pgstattuple/pgstattuple--1.2.sql
rename to contrib/pgstattuple/pgstattuple--1.3.sql
index e5fa2f5..93593ee 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.3.sql
@@ -1,4 +1,4 @@
-/* contrib/pgstattuple/pgstattuple--1.2.sql */
+/* contrib/pgstattuple/pgstattuple--1.3.sql */
 
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION pgstattuple" to load this file. \quit
@@ -77,3 +77,19 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+/* New stuff in 1.3 begins here */
+
+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'pgstatbloat'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple.control b/contrib/pgstattuple/pgstattuple.control
index a7cf47f..c03b180 100644
--- a/contrib/pgstattuple/pgstattuple.control
+++ b/contrib/pgstattuple/pgstattuple.control
@@ -1,5 +1,5 @@
 # pgstattuple extension
 comment = 'show tuple-level statistics'
-default_version = '1.2'
+default_version = '1.3'
 module_pathname = '$libdir/pgstattuple'
 relocatable = true
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 9cabd71..2cd6cdc 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -358,6 +358,140 @@ pending_tuples | 0
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <indexterm>
+      <primary>pgstatbloat</primary>
+     </indexterm>
+     <function>pgstatbloat(regclass) returns record</>
+    </term>
+
+    <listitem>
+     <para>
+      <function>pgstatbloat</function> is a faster alternative to
+      <function>pgstattuple</function> that returns approximate results.
+      The argument is the target relation's OID.
+      For example:
+<programlisting>
+test=> SELECT * FROM pgstatbloat('pg_catalog.pg_proc'::regclass);
+-[ RECORD 1 ]--------+-------
+table_len            | 573440
+scanned_percent      | 2
+approx_tuple_count   | 2740
+approx_tuple_len     | 561210
+approx_tuple_percent | 97.87
+dead_tuple_count     | 0
+dead_tuple_len       | 0
+dead_tuple_percent   | 0
+free_space           | 11996
+free_percent         | 2.09
+</programlisting>
+      The output columns are described in <xref linkend="pgstatbloat-columns">.
+     </para>
+
+     <para>
+     Whereas <function>pgstattuple</function> always performs a
+     full-table scan and returns an exact count of live and dead tuples
+     (and their sizes) and free space, <function>pgstatbloat</function>
+     tries to avoid the full-table scan and returns exact dead tuple and
+     free space statistics along with an approximation of the number and
+     size of live tuples.
+     </para>
+
+     <para>
+     It does this by skipping pages that have only visible tuples
+     according to the visibility map (if a page has the corresponding VM
+     bit set, then it is assumed to contain no dead tuples). For such
+     pages, it derives the free space value from the free space map, and
+     assumes that the rest of the space on the page is taken up by live
+     tuples.
+     </para>
+ 
+     <para>
+     For pages that cannot be skipped, it scans each tuple, recording its
+     presence and size in the appropriate counters, and adding up the
+     free space on the page. At the end, it estimates the total number of
+     live tuples based on the number of pages and tuples scanned (in the
+     same way that VACUUM estimates pg_class.reltuples).
+     </para>
+
+     <table id="pgstatbloat-columns">
+      <title><function>pgstatbloat</function> Output Columns</title>
+      <tgroup cols="3">
+       <thead>
+        <row>
+         <entry>Column</entry>
+         <entry>Type</entry>
+         <entry>Description</entry>
+        </row>
+       </thead>
+ 
+       <tbody>
+        <row>
+         <entry><structfield>table_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Physical relation length in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>scanned_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of table scanned</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of live tuples (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of live tuples in bytes (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of live tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of dead tuples (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of dead tuples in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of dead tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>free_space</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total free space in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>free_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of free space (exact)</entry>
+        </row>
+ 
+       </tbody>
+      </tgroup>
+     </table>
+ 
+     <para>
+     In the above output, the free space figures may not match the
+     <function>pgstattuple</function> output exactly, because the free
+     space map gives us an exact figure, but is not guaranteed to be
+     accurate.
+     </para>
+
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </sect2>
 
@@ -366,6 +500,7 @@ pending_tuples | 0
 
   <para>
    Tatsuo Ishii and Satoshi Nagayasu
+   Abhijit Menon-Sen
   </para>
  </sect2>
 
-- 
1.9.1

#66

ams@2ndQuadrant.com

almost 11 years ago

In reply to: Abhijit Menon-Sen (#65)

1 attachment(s)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2015-03-31 22:43:49 +0530, ams@2ndQuadrant.com wrote:

I'm just posting this WIP patch where I've renamed fastbloat to
pgstatbloat as suggested by Tomas, and added in the documentation, and
so on.

Here's the revised version that also adds the count of RECENTLY_DEAD and
INSERT/DELETE_IN_PROGRESS tuples to the call to vac_estimate_reltuples.

Tomas, I agree that the build_output_tuple function could use cleaning
up, but it's the same as the corresponding code in pgstattuple, and it
seems to me reasonable to clean both up together in a separate patch.

-- Abhijit

Attachments:

0001-Add-pgstatbloat-to-pgstattuple-20150403.patchtext/x-diff; charset=us-asciiDownload

>From d11b84627438fca12955dfad77042be0a27c9acc Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile                       |   4 +-
 contrib/pgstattuple/pgstatbloat.c                  | 340 +++++++++++++++++++++
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql      |  18 ++
 .../{pgstattuple--1.2.sql => pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control            |   2 +-
 doc/src/sgml/pgstattuple.sgml                      | 135 ++++++++
 6 files changed, 513 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql => pgstattuple--1.3.sql} (73%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 0000000..2dd1900
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,340 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen <ams@2ndQuadrant.com>
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pgstattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
+ * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/multixact.h"
+#include "access/htup_details.h"
+#include "catalog/namespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/procarray.h"
+#include "storage/lmgr.h"
+#include "utils/builtins.h"
+#include "utils/tqual.h"
+#include "commands/vacuum.h"
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		misc_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} pgstatbloat_output_type;
+
+static Datum build_output_type(pgstatbloat_output_type *stat,
+							   FunctionCallInfo fcinfo);
+static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a pgstatbloat_output_type tuple
+ */
+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/*
+	 * Generate attribute metadata needed later to produce tuples from raw C
+	 * strings
+	 */
+	attinmeta = TupleDescGetAttInMetadata(tupdesc);
+
+	if (stat->table_len == 0)
+	{
+		tuple_percent = 0.0;
+		dead_tuple_percent = 0.0;
+		free_percent = 0.0;
+	}
+	else
+	{
+		tuple_percent = 100.0 * stat->tuple_len / stat->table_len;
+		dead_tuple_percent = 100.0 * stat->dead_tuple_len / stat->table_len;
+		free_percent = 100.0 * stat->free_space / stat->table_len;
+	}
+
+	scanned_percent = 0.0;
+	if (stat->total_pages != 0)
+	{
+		scanned_percent = 100 * stat->scanned_pages / stat->total_pages;
+	}
+
+	for (i = 0; i < NCOLUMNS; i++)
+		values[i] = values_buf[i];
+	i = 0;
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->table_len);
+	snprintf(values[i++], NCHARS, "%.2f", scanned_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", dead_tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->free_space);
+	snprintf(values[i++], NCHARS, "%.2f", free_percent);
+
+	tuple = BuildTupleFromCStrings(attinmeta, values);
+
+	return HeapTupleGetDatum(tuple);
+}
+
+/* Returns live/dead tuple statistics for the given relid. */
+
+Datum
+pgstatbloat(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	const char *err;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to use pgstatbloat functions"))));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	switch (rel->rd_rel->relkind)
+	{
+		case RELKIND_RELATION:
+		case RELKIND_MATVIEW:
+			PG_RETURN_DATUM(pgstatbloat_heap(rel, fcinfo));
+		case RELKIND_TOASTVALUE:
+			err = "toast value";
+			break;
+		case RELKIND_SEQUENCE:
+			err = "sequence";
+			break;
+		case RELKIND_INDEX:
+			err = "index";
+			break;
+		case RELKIND_VIEW:
+			err = "view";
+			break;
+		case RELKIND_COMPOSITE_TYPE:
+			err = "composite type";
+			break;
+		case RELKIND_FOREIGN_TABLE:
+			err = "foreign table";
+			break;
+		default:
+			err = "unknown";
+			break;
+	}
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("\"%s\" (%s) is not supported",
+					RelationGetRelationName(rel), err)));
+}
+
+/*
+ * This function takes an already open relation and scans its pages,
+ * skipping those that have the corresponding visibility map bit set.
+ * For pages we skip, we find the free space from the free space map
+ * and approximate tuple_len on that basis. For the others, we count
+ * the exact number of dead tuples etc.
+ *
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
+ * we do not try to avoid skipping single pages.
+ */
+
+static Datum
+pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo)
+{
+	BlockNumber scanned,
+				nblocks,
+				blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	pgstatbloat_output_type stat = {0};
+	BufferAccessStrategy bstrategy;
+	TransactionId OldestXmin;
+
+	OldestXmin = GetOldestXmin(rel, true);
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	scanned = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		Buffer		buf;
+		Page		page;
+		OffsetNumber offnum,
+					maxoff;
+		Size		freespace;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the page has only visible tuples, then we can find out the
+		 * free space from the FSM and move on.
+		 */
+
+		if (visibilitymap_test(rel, blkno, &vmbuffer))
+		{
+			freespace = GetRecordedFreeSpace(rel, blkno);
+			stat.tuple_len += BLCKSZ - freespace;
+			stat.free_space += freespace;
+			continue;
+		}
+
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, bstrategy);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buf);
+
+		stat.free_space += PageGetHeapFreeSpace(page);
+
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		scanned++;
+
+		/*
+		 * Look at each tuple on the page and decide whether it's live
+		 * or dead, then count it and its size. Unlike lazy_scan_heap,
+		 * we can afford to ignore problems and special cases.
+		 */
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+			HeapTupleData tuple;
+
+			itemid = PageGetItemId(page, offnum);
+
+			if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid) ||
+				ItemIdIsDead(itemid))
+			{
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationGetRelid(rel);
+
+			/*
+			 * We count live and dead tuples, but we also need to add up
+			 * others in order to feed vac_estimate_reltuples.
+			 */
+
+			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			{
+				case HEAPTUPLE_RECENTLY_DEAD:
+					stat.misc_count++;
+					/* Fall through */
+				case HEAPTUPLE_DEAD:
+					stat.dead_tuple_len += tuple.t_len;
+					stat.dead_tuple_count++;
+					break;
+				case HEAPTUPLE_LIVE:
+					stat.tuple_len += tuple.t_len;
+					stat.tuple_count++;
+					break;
+				case HEAPTUPLE_INSERT_IN_PROGRESS:
+				case HEAPTUPLE_DELETE_IN_PROGRESS:
+					stat.misc_count++;
+					break;
+				default:
+					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+					break;
+			}
+		}
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	stat.table_len = (uint64) nblocks * BLCKSZ;
+	stat.tuple_count = vac_estimate_reltuples(rel, false, nblocks, scanned,
+											  stat.tuple_count+stat.misc_count);
+	stat.total_pages = nblocks;
+	stat.scanned_pages = scanned;
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	relation_close(rel, AccessShareLock);
+
+	return build_output_type(&stat, fcinfo);
+}
diff --git a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
new file mode 100644
index 0000000..dba7993
--- /dev/null
+++ b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
@@ -0,0 +1,18 @@
+/* contrib/pgstattuple/pgstattuple--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pgstattuple UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'pgstatbloat'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.3.sql
similarity index 73%
rename from contrib/pgstattuple/pgstattuple--1.2.sql
rename to contrib/pgstattuple/pgstattuple--1.3.sql
index e5fa2f5..93593ee 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.3.sql
@@ -1,4 +1,4 @@
-/* contrib/pgstattuple/pgstattuple--1.2.sql */
+/* contrib/pgstattuple/pgstattuple--1.3.sql */
 
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION pgstattuple" to load this file. \quit
@@ -77,3 +77,19 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+/* New stuff in 1.3 begins here */
+
+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'pgstatbloat'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple.control b/contrib/pgstattuple/pgstattuple.control
index a7cf47f..c03b180 100644
--- a/contrib/pgstattuple/pgstattuple.control
+++ b/contrib/pgstattuple/pgstattuple.control
@@ -1,5 +1,5 @@
 # pgstattuple extension
 comment = 'show tuple-level statistics'
-default_version = '1.2'
+default_version = '1.3'
 module_pathname = '$libdir/pgstattuple'
 relocatable = true
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 9cabd71..2cd6cdc 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -358,6 +358,140 @@ pending_tuples | 0
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <indexterm>
+      <primary>pgstatbloat</primary>
+     </indexterm>
+     <function>pgstatbloat(regclass) returns record</>
+    </term>
+
+    <listitem>
+     <para>
+      <function>pgstatbloat</function> is a faster alternative to
+      <function>pgstattuple</function> that returns approximate results.
+      The argument is the target relation's OID.
+      For example:
+<programlisting>
+test=> SELECT * FROM pgstatbloat('pg_catalog.pg_proc'::regclass);
+-[ RECORD 1 ]--------+-------
+table_len            | 573440
+scanned_percent      | 2
+approx_tuple_count   | 2740
+approx_tuple_len     | 561210
+approx_tuple_percent | 97.87
+dead_tuple_count     | 0
+dead_tuple_len       | 0
+dead_tuple_percent   | 0
+free_space           | 11996
+free_percent         | 2.09
+</programlisting>
+      The output columns are described in <xref linkend="pgstatbloat-columns">.
+     </para>
+
+     <para>
+     Whereas <function>pgstattuple</function> always performs a
+     full-table scan and returns an exact count of live and dead tuples
+     (and their sizes) and free space, <function>pgstatbloat</function>
+     tries to avoid the full-table scan and returns exact dead tuple and
+     free space statistics along with an approximation of the number and
+     size of live tuples.
+     </para>
+
+     <para>
+     It does this by skipping pages that have only visible tuples
+     according to the visibility map (if a page has the corresponding VM
+     bit set, then it is assumed to contain no dead tuples). For such
+     pages, it derives the free space value from the free space map, and
+     assumes that the rest of the space on the page is taken up by live
+     tuples.
+     </para>
+ 
+     <para>
+     For pages that cannot be skipped, it scans each tuple, recording its
+     presence and size in the appropriate counters, and adding up the
+     free space on the page. At the end, it estimates the total number of
+     live tuples based on the number of pages and tuples scanned (in the
+     same way that VACUUM estimates pg_class.reltuples).
+     </para>
+
+     <table id="pgstatbloat-columns">
+      <title><function>pgstatbloat</function> Output Columns</title>
+      <tgroup cols="3">
+       <thead>
+        <row>
+         <entry>Column</entry>
+         <entry>Type</entry>
+         <entry>Description</entry>
+        </row>
+       </thead>
+ 
+       <tbody>
+        <row>
+         <entry><structfield>table_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Physical relation length in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>scanned_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of table scanned</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of live tuples (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of live tuples in bytes (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of live tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of dead tuples (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of dead tuples in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of dead tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>free_space</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total free space in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>free_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of free space (exact)</entry>
+        </row>
+ 
+       </tbody>
+      </tgroup>
+     </table>
+ 
+     <para>
+     In the above output, the free space figures may not match the
+     <function>pgstattuple</function> output exactly, because the free
+     space map gives us an exact figure, but is not guaranteed to be
+     accurate.
+     </para>
+
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </sect2>
 
@@ -366,6 +500,7 @@ pending_tuples | 0
 
   <para>
    Tatsuo Ishii and Satoshi Nagayasu
+   Abhijit Menon-Sen
   </para>
  </sect2>
 
-- 
1.9.1

#67

amit.kapila16@gmail.com

over 10 years ago

In reply to: Abhijit Menon-Sen (#66)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On Fri, Apr 3, 2015 at 9:07 PM, Abhijit Menon-Sen <ams@2ndquadrant.com>
wrote:

At 2015-03-31 22:43:49 +0530, ams@2ndQuadrant.com wrote:

I'm just posting this WIP patch where I've renamed fastbloat to
pgstatbloat as suggested by Tomas, and added in the documentation, and
so on.

Here's the revised version that also adds the count of RECENTLY_DEAD and
INSERT/DELETE_IN_PROGRESS tuples to the call to vac_estimate_reltuples.

I think you have missed to address the below point:

4) prefetch

fbstat_heap is using visibility map to skip fully-visible pages,
which is nice, but if we skip too many pages it breaks readahead
similarly to bitmap heap scan. I believe this is another place where
effective_io_concurrency (i.e. prefetch) would be appropriate.

Good point. We can even think of using the technique used by Vacuum
which is skip only when we can skip atleast SKIP_PAGES_THRESHOLD
pages.

Apart from this, one minor point:
+typedef struct pgstatbloat_output_type
+{
+ uint64 table_len;
+ uint64 tuple_count;
+ uint64 misc_count;

misc_count sounds out of place, may be non_live_count?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#68

ams@2ndQuadrant.com

over 10 years ago

In reply to: Amit Kapila (#67)

2 attachment(s)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2015-04-18 12:28:36 +0530, amit.kapila16@gmail.com wrote:

I think you have missed to address the below point:

4) prefetch

Updated patch attached, as well as the diff against the earlier version
just to make it easier to see the exact change I made (which is to copy
the skip logic from lazy_scan_heap, as you suggested).

I'm not entirely convinced that this change is worthwhile, but it's easy
to make and difficult to argue against, so here it is. (I did test, and
it seems to work OK after the change.)

Amit, Tomas, thanks again for your review comments.

-- Abhijit

Attachments:

0001-20150422-Add-pgstatbloat-to-pgstattuple.patchtext/x-diff; charset=us-asciiDownload

>From 3edb5426292d6097cb66339b865e99bf4f766646 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile                       |   4 +-
 contrib/pgstattuple/pgstatbloat.c                  | 387 +++++++++++++++++++++
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql      |  18 +
 .../{pgstattuple--1.2.sql => pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control            |   2 +-
 doc/src/sgml/pgstattuple.sgml                      | 135 +++++++
 6 files changed, 560 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql => pgstattuple--1.3.sql} (73%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 0000000..b19459a
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,387 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen <ams@2ndQuadrant.com>
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pgstattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
+ * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/multixact.h"
+#include "access/htup_details.h"
+#include "catalog/namespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/procarray.h"
+#include "storage/lmgr.h"
+#include "utils/builtins.h"
+#include "utils/tqual.h"
+#include "commands/vacuum.h"
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		misc_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} pgstatbloat_output_type;
+
+static Datum build_output_type(pgstatbloat_output_type *stat,
+							   FunctionCallInfo fcinfo);
+static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a pgstatbloat_output_type tuple
+ */
+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/*
+	 * Generate attribute metadata needed later to produce tuples from raw C
+	 * strings
+	 */
+	attinmeta = TupleDescGetAttInMetadata(tupdesc);
+
+	if (stat->table_len == 0)
+	{
+		tuple_percent = 0.0;
+		dead_tuple_percent = 0.0;
+		free_percent = 0.0;
+	}
+	else
+	{
+		tuple_percent = 100.0 * stat->tuple_len / stat->table_len;
+		dead_tuple_percent = 100.0 * stat->dead_tuple_len / stat->table_len;
+		free_percent = 100.0 * stat->free_space / stat->table_len;
+	}
+
+	scanned_percent = 0.0;
+	if (stat->total_pages != 0)
+	{
+		scanned_percent = 100 * stat->scanned_pages / stat->total_pages;
+	}
+
+	for (i = 0; i < NCOLUMNS; i++)
+		values[i] = values_buf[i];
+	i = 0;
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->table_len);
+	snprintf(values[i++], NCHARS, "%.2f", scanned_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", dead_tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->free_space);
+	snprintf(values[i++], NCHARS, "%.2f", free_percent);
+
+	tuple = BuildTupleFromCStrings(attinmeta, values);
+
+	return HeapTupleGetDatum(tuple);
+}
+
+/* Returns live/dead tuple statistics for the given relid. */
+
+Datum
+pgstatbloat(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	const char *err;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to use pgstatbloat functions"))));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	switch (rel->rd_rel->relkind)
+	{
+		case RELKIND_RELATION:
+		case RELKIND_MATVIEW:
+			PG_RETURN_DATUM(pgstatbloat_heap(rel, fcinfo));
+		case RELKIND_TOASTVALUE:
+			err = "toast value";
+			break;
+		case RELKIND_SEQUENCE:
+			err = "sequence";
+			break;
+		case RELKIND_INDEX:
+			err = "index";
+			break;
+		case RELKIND_VIEW:
+			err = "view";
+			break;
+		case RELKIND_COMPOSITE_TYPE:
+			err = "composite type";
+			break;
+		case RELKIND_FOREIGN_TABLE:
+			err = "foreign table";
+			break;
+		default:
+			err = "unknown";
+			break;
+	}
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("\"%s\" (%s) is not supported",
+					RelationGetRelationName(rel), err)));
+}
+
+/*
+ * Before we consider skipping a page that's marked as clean in
+ * visibility map, we must've seen at least this many clean pages.
+ */
+#define SKIP_PAGES_THRESHOLD    ((BlockNumber) 32)
+
+/*
+ * This function takes an already open relation and scans its pages,
+ * skipping those that have the corresponding visibility map bit set.
+ * For pages we skip, we find the free space from the free space map
+ * and approximate tuple_len on that basis. For the others, we count
+ * the exact number of dead tuples etc.
+ *
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
+ * we do not try to avoid skipping single pages.
+ */
+
+static Datum
+pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo)
+{
+	BlockNumber scanned,
+				nblocks,
+				blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	pgstatbloat_output_type stat = {0};
+	BufferAccessStrategy bstrategy;
+	TransactionId OldestXmin;
+	BlockNumber next_not_all_visible_block;
+	bool		skipping_all_visible_blocks;
+
+	OldestXmin = GetOldestXmin(rel, true);
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	scanned = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * We use the visibility map to skip over SKIP_PAGES_THRESHOLD or
+	 * more contiguous all-visible pages. See the explanation in
+	 * lazy_scan_heap for the rationale.
+	 */
+
+	for (next_not_all_visible_block = 0;
+		 next_not_all_visible_block < nblocks;
+		 next_not_all_visible_block++)
+	{
+		if (!visibilitymap_test(rel, next_not_all_visible_block, &vmbuffer))
+			break;
+	}
+	if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
+		skipping_all_visible_blocks = true;
+	else
+		skipping_all_visible_blocks = false;
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		Buffer		buf;
+		Page		page;
+		OffsetNumber offnum,
+					maxoff;
+		Size		freespace;
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (blkno == next_not_all_visible_block)
+		{
+			/* Time to advance next_not_all_visible_block */
+			for (next_not_all_visible_block++;
+				 next_not_all_visible_block < nblocks;
+				 next_not_all_visible_block++)
+			{
+				if (!visibilitymap_test(rel, next_not_all_visible_block,
+										&vmbuffer))
+					break;
+			}
+
+			/*
+			 * We know we can't skip the current block.  But set up
+			 * skipping_all_visible_blocks to do the right thing at the
+			 * following blocks.
+			 */
+			if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
+				skipping_all_visible_blocks = true;
+			else
+				skipping_all_visible_blocks = false;
+		}
+		else if (skipping_all_visible_blocks)
+		{
+			/*
+			 * We can skip this all-visible page, so we just find out
+			 * the free space from the FSM.
+			 */
+			freespace = GetRecordedFreeSpace(rel, blkno);
+			stat.tuple_len += BLCKSZ - freespace;
+			stat.free_space += freespace;
+			continue;
+		}
+
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, bstrategy);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buf);
+
+		stat.free_space += PageGetHeapFreeSpace(page);
+
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		scanned++;
+
+		/*
+		 * Look at each tuple on the page and decide whether it's live
+		 * or dead, then count it and its size. Unlike lazy_scan_heap,
+		 * we can afford to ignore problems and special cases.
+		 */
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+			HeapTupleData tuple;
+
+			itemid = PageGetItemId(page, offnum);
+
+			if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid) ||
+				ItemIdIsDead(itemid))
+			{
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationGetRelid(rel);
+
+			/*
+			 * We count live and dead tuples, but we also need to add up
+			 * others in order to feed vac_estimate_reltuples.
+			 */
+
+			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			{
+				case HEAPTUPLE_RECENTLY_DEAD:
+					stat.misc_count++;
+					/* Fall through */
+				case HEAPTUPLE_DEAD:
+					stat.dead_tuple_len += tuple.t_len;
+					stat.dead_tuple_count++;
+					break;
+				case HEAPTUPLE_LIVE:
+					stat.tuple_len += tuple.t_len;
+					stat.tuple_count++;
+					break;
+				case HEAPTUPLE_INSERT_IN_PROGRESS:
+				case HEAPTUPLE_DELETE_IN_PROGRESS:
+					stat.misc_count++;
+					break;
+				default:
+					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+					break;
+			}
+		}
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	stat.table_len = (uint64) nblocks * BLCKSZ;
+	stat.tuple_count = vac_estimate_reltuples(rel, false, nblocks, scanned,
+											  stat.tuple_count+stat.misc_count);
+	stat.total_pages = nblocks;
+	stat.scanned_pages = scanned;
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	relation_close(rel, AccessShareLock);
+
+	return build_output_type(&stat, fcinfo);
+}
diff --git a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
new file mode 100644
index 0000000..dba7993
--- /dev/null
+++ b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
@@ -0,0 +1,18 @@
+/* contrib/pgstattuple/pgstattuple--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pgstattuple UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'pgstatbloat'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.3.sql
similarity index 73%
rename from contrib/pgstattuple/pgstattuple--1.2.sql
rename to contrib/pgstattuple/pgstattuple--1.3.sql
index e5fa2f5..93593ee 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.3.sql
@@ -1,4 +1,4 @@
-/* contrib/pgstattuple/pgstattuple--1.2.sql */
+/* contrib/pgstattuple/pgstattuple--1.3.sql */
 
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION pgstattuple" to load this file. \quit
@@ -77,3 +77,19 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+/* New stuff in 1.3 begins here */
+
+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'pgstatbloat'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple.control b/contrib/pgstattuple/pgstattuple.control
index a7cf47f..c03b180 100644
--- a/contrib/pgstattuple/pgstattuple.control
+++ b/contrib/pgstattuple/pgstattuple.control
@@ -1,5 +1,5 @@
 # pgstattuple extension
 comment = 'show tuple-level statistics'
-default_version = '1.2'
+default_version = '1.3'
 module_pathname = '$libdir/pgstattuple'
 relocatable = true
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 9cabd71..2cd6cdc 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -358,6 +358,140 @@ pending_tuples | 0
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <indexterm>
+      <primary>pgstatbloat</primary>
+     </indexterm>
+     <function>pgstatbloat(regclass) returns record</>
+    </term>
+
+    <listitem>
+     <para>
+      <function>pgstatbloat</function> is a faster alternative to
+      <function>pgstattuple</function> that returns approximate results.
+      The argument is the target relation's OID.
+      For example:
+<programlisting>
+test=> SELECT * FROM pgstatbloat('pg_catalog.pg_proc'::regclass);
+-[ RECORD 1 ]--------+-------
+table_len            | 573440
+scanned_percent      | 2
+approx_tuple_count   | 2740
+approx_tuple_len     | 561210
+approx_tuple_percent | 97.87
+dead_tuple_count     | 0
+dead_tuple_len       | 0
+dead_tuple_percent   | 0
+free_space           | 11996
+free_percent         | 2.09
+</programlisting>
+      The output columns are described in <xref linkend="pgstatbloat-columns">.
+     </para>
+
+     <para>
+     Whereas <function>pgstattuple</function> always performs a
+     full-table scan and returns an exact count of live and dead tuples
+     (and their sizes) and free space, <function>pgstatbloat</function>
+     tries to avoid the full-table scan and returns exact dead tuple and
+     free space statistics along with an approximation of the number and
+     size of live tuples.
+     </para>
+
+     <para>
+     It does this by skipping pages that have only visible tuples
+     according to the visibility map (if a page has the corresponding VM
+     bit set, then it is assumed to contain no dead tuples). For such
+     pages, it derives the free space value from the free space map, and
+     assumes that the rest of the space on the page is taken up by live
+     tuples.
+     </para>
+ 
+     <para>
+     For pages that cannot be skipped, it scans each tuple, recording its
+     presence and size in the appropriate counters, and adding up the
+     free space on the page. At the end, it estimates the total number of
+     live tuples based on the number of pages and tuples scanned (in the
+     same way that VACUUM estimates pg_class.reltuples).
+     </para>
+
+     <table id="pgstatbloat-columns">
+      <title><function>pgstatbloat</function> Output Columns</title>
+      <tgroup cols="3">
+       <thead>
+        <row>
+         <entry>Column</entry>
+         <entry>Type</entry>
+         <entry>Description</entry>
+        </row>
+       </thead>
+ 
+       <tbody>
+        <row>
+         <entry><structfield>table_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Physical relation length in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>scanned_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of table scanned</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of live tuples (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of live tuples in bytes (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of live tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of dead tuples (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of dead tuples in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of dead tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>free_space</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total free space in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>free_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of free space (exact)</entry>
+        </row>
+ 
+       </tbody>
+      </tgroup>
+     </table>
+ 
+     <para>
+     In the above output, the free space figures may not match the
+     <function>pgstattuple</function> output exactly, because the free
+     space map gives us an exact figure, but is not guaranteed to be
+     accurate.
+     </para>
+
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </sect2>
 
@@ -366,6 +500,7 @@ pending_tuples | 0
 
   <para>
    Tatsuo Ishii and Satoshi Nagayasu
+   Abhijit Menon-Sen
   </para>
  </sect2>
 
-- 
1.9.1

inter.difftext/x-diff; charset=us-asciiDownload

diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
index 2dd1900..b19459a 100644
--- a/contrib/pgstattuple/pgstatbloat.c
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -193,6 +193,12 @@ pgstatbloat(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Before we consider skipping a page that's marked as clean in
+ * visibility map, we must've seen at least this many clean pages.
+ */
+#define SKIP_PAGES_THRESHOLD    ((BlockNumber) 32)
+
+/*
  * This function takes an already open relation and scans its pages,
  * skipping those that have the corresponding visibility map bit set.
  * For pages we skip, we find the free space from the free space map
@@ -213,6 +219,8 @@ pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo)
 	pgstatbloat_output_type stat = {0};
 	BufferAccessStrategy bstrategy;
 	TransactionId OldestXmin;
+	BlockNumber next_not_all_visible_block;
+	bool		skipping_all_visible_blocks;
 
 	OldestXmin = GetOldestXmin(rel, true);
 	bstrategy = GetAccessStrategy(BAS_BULKREAD);
@@ -220,6 +228,24 @@ pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo)
 	scanned = 0;
 	nblocks = RelationGetNumberOfBlocks(rel);
 
+	/*
+	 * We use the visibility map to skip over SKIP_PAGES_THRESHOLD or
+	 * more contiguous all-visible pages. See the explanation in
+	 * lazy_scan_heap for the rationale.
+	 */
+
+	for (next_not_all_visible_block = 0;
+		 next_not_all_visible_block < nblocks;
+		 next_not_all_visible_block++)
+	{
+		if (!visibilitymap_test(rel, next_not_all_visible_block, &vmbuffer))
+			break;
+	}
+	if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
+		skipping_all_visible_blocks = true;
+	else
+		skipping_all_visible_blocks = false;
+
 	for (blkno = 0; blkno < nblocks; blkno++)
 	{
 		Buffer		buf;
@@ -230,13 +256,34 @@ pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo)
 
 		CHECK_FOR_INTERRUPTS();
 
-		/*
-		 * If the page has only visible tuples, then we can find out the
-		 * free space from the FSM and move on.
-		 */
+		if (blkno == next_not_all_visible_block)
+		{
+			/* Time to advance next_not_all_visible_block */
+			for (next_not_all_visible_block++;
+				 next_not_all_visible_block < nblocks;
+				 next_not_all_visible_block++)
+			{
+				if (!visibilitymap_test(rel, next_not_all_visible_block,
+										&vmbuffer))
+					break;
+			}
 
-		if (visibilitymap_test(rel, blkno, &vmbuffer))
+			/*
+			 * We know we can't skip the current block.  But set up
+			 * skipping_all_visible_blocks to do the right thing at the
+			 * following blocks.
+			 */
+			if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
+				skipping_all_visible_blocks = true;
+			else
+				skipping_all_visible_blocks = false;
+		}
+		else if (skipping_all_visible_blocks)
 		{
+			/*
+			 * We can skip this all-visible page, so we just find out
+			 * the free space from the FSM.
+			 */
 			freespace = GetRecordedFreeSpace(rel, blkno);
 			stat.tuple_len += BLCKSZ - freespace;
 			stat.free_space += freespace;

#69

amit.kapila16@gmail.com

over 10 years ago

In reply to: Abhijit Menon-Sen (#68)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On Wed, Apr 22, 2015 at 6:33 PM, Abhijit Menon-Sen <ams@2ndquadrant.com>
wrote:

At 2015-04-18 12:28:36 +0530, amit.kapila16@gmail.com wrote:

I think you have missed to address the below point:

4) prefetch

Updated patch attached, as well as the diff against the earlier version
just to make it easier to see the exact change I made (which is to copy
the skip logic from lazy_scan_heap, as you suggested).

Few minor issues:
1.
warning on windows

\pgstatbloat.c(193): warning C4715: 'pgstatbloat' : not all control paths
return a value

2.
Failed to install pgstattuple--1.3.sql

You have to name sql file as pgstattuple--1.3.sql rather
than pgstattuple--1.2.sql

Other than above 2 points, patch looks good to me.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#70

ams@2ndQuadrant.com

over 10 years ago

In reply to: Amit Kapila (#69)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2015-04-24 07:22:27 +0530, amit.kapila16@gmail.com wrote:

Few minor issues:
1.
warning on windows

\pgstatbloat.c(193): warning C4715: 'pgstatbloat' : not all control paths
return a value

This is because the function ends with an ereport(ERROR, …). What would
you suggest here? Just stick a PG_RETURN_NULL() at the end?

2.
Failed to install pgstattuple--1.3.sql

You have to name sql file as pgstattuple--1.3.sql rather
than pgstattuple--1.2.sql

How are you applying the patch? It already contains this:

    diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.3.sql
    similarity index 73%
    rename from contrib/pgstattuple/pgstattuple--1.2.sql
    rename to contrib/pgstattuple/pgstattuple--1.3.sql
    index e5fa2f5..93593ee 100644
    --- a/contrib/pgstattuple/pgstattuple--1.2.sql
    +++ b/contrib/pgstattuple/pgstattuple--1.3.sql

So it should do the right thing with git apply.

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

amit.kapila16@gmail.com

over 10 years ago

In reply to: Abhijit Menon-Sen (#70)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On Fri, Apr 24, 2015 at 8:04 AM, Abhijit Menon-Sen <ams@2ndquadrant.com>
wrote:

At 2015-04-24 07:22:27 +0530, amit.kapila16@gmail.com wrote:

Few minor issues:
1.
warning on windows

\pgstatbloat.c(193): warning C4715: 'pgstatbloat' : not all control

paths

return a value

This is because the function ends with an ereport(ERROR, …). What would
you suggest here?

For similar code in function (pgstattuple.c->pgstat_relation()), it simply
return 0;

Just stick a PG_RETURN_NULL() at the end?

That should also work.

2.
Failed to install pgstattuple--1.3.sql

You have to name sql file as pgstattuple--1.3.sql rather
than pgstattuple--1.2.sql

How are you applying the patch? It already contains this:
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql

b/contrib/pgstattuple/pgstattuple--1.3.sql

similarity index 73%
rename from contrib/pgstattuple/pgstattuple--1.2.sql
rename to contrib/pgstattuple/pgstattuple--1.3.sql
index e5fa2f5..93593ee 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.3.sql

So it should do the right thing with git apply.

I was using patch -p1 ...

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#72

ams@2ndQuadrant.com

over 10 years ago

In reply to: Amit Kapila (#71)

1 attachment(s)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2015-04-24 08:35:40 +0530, amit.kapila16@gmail.com wrote:

Just stick a PG_RETURN_NULL() at the end?

That should also work.

OK, updated patch attached with just that one change.

I'm not doing anything about the rename. I don't know if it's possible
to make patch(1) rename a file at all (it wasn't, but maybe something
has changed recently?).

Note to friendly neighbourhood committers: the patch is expected to
rename pgstattuple--1.2.sql to pgstattuple--1.3.sql, which it does
if applied using git apply.

-- Abhijit

Attachments:

0001-20150424-Add-pgstatbloat-to-pgstattuple.patchtext/x-diff; charset=us-asciiDownload

>From f67ddc68ab03fe8237058dcdca1e410ef9e95b5e Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile                       |   4 +-
 contrib/pgstattuple/pgstatbloat.c                  | 389 +++++++++++++++++++++
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql      |  18 +
 .../{pgstattuple--1.2.sql => pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control            |   2 +-
 doc/src/sgml/pgstattuple.sgml                      | 135 +++++++
 6 files changed, 562 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql => pgstattuple--1.3.sql} (73%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 0000000..612e22b
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,389 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen <ams@2ndQuadrant.com>
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pgstattuple)
+ *
+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
+ * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/multixact.h"
+#include "access/htup_details.h"
+#include "catalog/namespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/procarray.h"
+#include "storage/lmgr.h"
+#include "utils/builtins.h"
+#include "utils/tqual.h"
+#include "commands/vacuum.h"
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		tuple_count;
+	uint64		misc_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		free_space;
+	uint64		total_pages;
+	uint64		scanned_pages;
+} pgstatbloat_output_type;
+
+static Datum build_output_type(pgstatbloat_output_type *stat,
+							   FunctionCallInfo fcinfo);
+static Datum pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo);
+
+/*
+ * build a pgstatbloat_output_type tuple
+ */
+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/*
+	 * Generate attribute metadata needed later to produce tuples from raw C
+	 * strings
+	 */
+	attinmeta = TupleDescGetAttInMetadata(tupdesc);
+
+	if (stat->table_len == 0)
+	{
+		tuple_percent = 0.0;
+		dead_tuple_percent = 0.0;
+		free_percent = 0.0;
+	}
+	else
+	{
+		tuple_percent = 100.0 * stat->tuple_len / stat->table_len;
+		dead_tuple_percent = 100.0 * stat->dead_tuple_len / stat->table_len;
+		free_percent = 100.0 * stat->free_space / stat->table_len;
+	}
+
+	scanned_percent = 0.0;
+	if (stat->total_pages != 0)
+	{
+		scanned_percent = 100 * stat->scanned_pages / stat->total_pages;
+	}
+
+	for (i = 0; i < NCOLUMNS; i++)
+		values[i] = values_buf[i];
+	i = 0;
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->table_len);
+	snprintf(values[i++], NCHARS, "%.2f", scanned_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", dead_tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->free_space);
+	snprintf(values[i++], NCHARS, "%.2f", free_percent);
+
+	tuple = BuildTupleFromCStrings(attinmeta, values);
+
+	return HeapTupleGetDatum(tuple);
+}
+
+/* Returns live/dead tuple statistics for the given relid. */
+
+Datum
+pgstatbloat(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	const char *err;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to use pgstatbloat functions"))));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	switch (rel->rd_rel->relkind)
+	{
+		case RELKIND_RELATION:
+		case RELKIND_MATVIEW:
+			PG_RETURN_DATUM(pgstatbloat_heap(rel, fcinfo));
+		case RELKIND_TOASTVALUE:
+			err = "toast value";
+			break;
+		case RELKIND_SEQUENCE:
+			err = "sequence";
+			break;
+		case RELKIND_INDEX:
+			err = "index";
+			break;
+		case RELKIND_VIEW:
+			err = "view";
+			break;
+		case RELKIND_COMPOSITE_TYPE:
+			err = "composite type";
+			break;
+		case RELKIND_FOREIGN_TABLE:
+			err = "foreign table";
+			break;
+		default:
+			err = "unknown";
+			break;
+	}
+
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("\"%s\" (%s) is not supported",
+					RelationGetRelationName(rel), err)));
+
+	PG_RETURN_NULL();
+}
+
+/*
+ * Before we consider skipping a page that's marked as clean in
+ * visibility map, we must've seen at least this many clean pages.
+ */
+#define SKIP_PAGES_THRESHOLD    ((BlockNumber) 32)
+
+/*
+ * This function takes an already open relation and scans its pages,
+ * skipping those that have the corresponding visibility map bit set.
+ * For pages we skip, we find the free space from the free space map
+ * and approximate tuple_len on that basis. For the others, we count
+ * the exact number of dead tuples etc.
+ *
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
+ * we do not try to avoid skipping single pages.
+ */
+
+static Datum
+pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo)
+{
+	BlockNumber scanned,
+				nblocks,
+				blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	pgstatbloat_output_type stat = {0};
+	BufferAccessStrategy bstrategy;
+	TransactionId OldestXmin;
+	BlockNumber next_not_all_visible_block;
+	bool		skipping_all_visible_blocks;
+
+	OldestXmin = GetOldestXmin(rel, true);
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	scanned = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	/*
+	 * We use the visibility map to skip over SKIP_PAGES_THRESHOLD or
+	 * more contiguous all-visible pages. See the explanation in
+	 * lazy_scan_heap for the rationale.
+	 */
+
+	for (next_not_all_visible_block = 0;
+		 next_not_all_visible_block < nblocks;
+		 next_not_all_visible_block++)
+	{
+		if (!visibilitymap_test(rel, next_not_all_visible_block, &vmbuffer))
+			break;
+	}
+	if (next_not_all_visible_block >= SKIP_PAGES_THRESHOLD)
+		skipping_all_visible_blocks = true;
+	else
+		skipping_all_visible_blocks = false;
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		Buffer		buf;
+		Page		page;
+		OffsetNumber offnum,
+					maxoff;
+		Size		freespace;
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (blkno == next_not_all_visible_block)
+		{
+			/* Time to advance next_not_all_visible_block */
+			for (next_not_all_visible_block++;
+				 next_not_all_visible_block < nblocks;
+				 next_not_all_visible_block++)
+			{
+				if (!visibilitymap_test(rel, next_not_all_visible_block,
+										&vmbuffer))
+					break;
+			}
+
+			/*
+			 * We know we can't skip the current block.  But set up
+			 * skipping_all_visible_blocks to do the right thing at the
+			 * following blocks.
+			 */
+			if (next_not_all_visible_block - blkno > SKIP_PAGES_THRESHOLD)
+				skipping_all_visible_blocks = true;
+			else
+				skipping_all_visible_blocks = false;
+		}
+		else if (skipping_all_visible_blocks)
+		{
+			/*
+			 * We can skip this all-visible page, so we just find out
+			 * the free space from the FSM.
+			 */
+			freespace = GetRecordedFreeSpace(rel, blkno);
+			stat.tuple_len += BLCKSZ - freespace;
+			stat.free_space += freespace;
+			continue;
+		}
+
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, bstrategy);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buf);
+
+		stat.free_space += PageGetHeapFreeSpace(page);
+
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		scanned++;
+
+		/*
+		 * Look at each tuple on the page and decide whether it's live
+		 * or dead, then count it and its size. Unlike lazy_scan_heap,
+		 * we can afford to ignore problems and special cases.
+		 */
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+			HeapTupleData tuple;
+
+			itemid = PageGetItemId(page, offnum);
+
+			if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid) ||
+				ItemIdIsDead(itemid))
+			{
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationGetRelid(rel);
+
+			/*
+			 * We count live and dead tuples, but we also need to add up
+			 * others in order to feed vac_estimate_reltuples.
+			 */
+
+			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			{
+				case HEAPTUPLE_RECENTLY_DEAD:
+					stat.misc_count++;
+					/* Fall through */
+				case HEAPTUPLE_DEAD:
+					stat.dead_tuple_len += tuple.t_len;
+					stat.dead_tuple_count++;
+					break;
+				case HEAPTUPLE_LIVE:
+					stat.tuple_len += tuple.t_len;
+					stat.tuple_count++;
+					break;
+				case HEAPTUPLE_INSERT_IN_PROGRESS:
+				case HEAPTUPLE_DELETE_IN_PROGRESS:
+					stat.misc_count++;
+					break;
+				default:
+					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+					break;
+			}
+		}
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	stat.table_len = (uint64) nblocks * BLCKSZ;
+	stat.tuple_count = vac_estimate_reltuples(rel, false, nblocks, scanned,
+											  stat.tuple_count+stat.misc_count);
+	stat.total_pages = nblocks;
+	stat.scanned_pages = scanned;
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	relation_close(rel, AccessShareLock);
+
+	return build_output_type(&stat, fcinfo);
+}
diff --git a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
new file mode 100644
index 0000000..dba7993
--- /dev/null
+++ b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
@@ -0,0 +1,18 @@
+/* contrib/pgstattuple/pgstattuple--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pgstattuple UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'pgstatbloat'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.3.sql
similarity index 73%
rename from contrib/pgstattuple/pgstattuple--1.2.sql
rename to contrib/pgstattuple/pgstattuple--1.3.sql
index e5fa2f5..93593ee 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.3.sql
@@ -1,4 +1,4 @@
-/* contrib/pgstattuple/pgstattuple--1.2.sql */
+/* contrib/pgstattuple/pgstattuple--1.3.sql */
 
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION pgstattuple" to load this file. \quit
@@ -77,3 +77,19 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+/* New stuff in 1.3 begins here */
+
+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %
+AS 'MODULE_PATHNAME', 'pgstatbloat'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple.control b/contrib/pgstattuple/pgstattuple.control
index a7cf47f..c03b180 100644
--- a/contrib/pgstattuple/pgstattuple.control
+++ b/contrib/pgstattuple/pgstattuple.control
@@ -1,5 +1,5 @@
 # pgstattuple extension
 comment = 'show tuple-level statistics'
-default_version = '1.2'
+default_version = '1.3'
 module_pathname = '$libdir/pgstattuple'
 relocatable = true
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 9cabd71..2cd6cdc 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -358,6 +358,140 @@ pending_tuples | 0
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <indexterm>
+      <primary>pgstatbloat</primary>
+     </indexterm>
+     <function>pgstatbloat(regclass) returns record</>
+    </term>
+
+    <listitem>
+     <para>
+      <function>pgstatbloat</function> is a faster alternative to
+      <function>pgstattuple</function> that returns approximate results.
+      The argument is the target relation's OID.
+      For example:
+<programlisting>
+test=> SELECT * FROM pgstatbloat('pg_catalog.pg_proc'::regclass);
+-[ RECORD 1 ]--------+-------
+table_len            | 573440
+scanned_percent      | 2
+approx_tuple_count   | 2740
+approx_tuple_len     | 561210
+approx_tuple_percent | 97.87
+dead_tuple_count     | 0
+dead_tuple_len       | 0
+dead_tuple_percent   | 0
+free_space           | 11996
+free_percent         | 2.09
+</programlisting>
+      The output columns are described in <xref linkend="pgstatbloat-columns">.
+     </para>
+
+     <para>
+     Whereas <function>pgstattuple</function> always performs a
+     full-table scan and returns an exact count of live and dead tuples
+     (and their sizes) and free space, <function>pgstatbloat</function>
+     tries to avoid the full-table scan and returns exact dead tuple and
+     free space statistics along with an approximation of the number and
+     size of live tuples.
+     </para>
+
+     <para>
+     It does this by skipping pages that have only visible tuples
+     according to the visibility map (if a page has the corresponding VM
+     bit set, then it is assumed to contain no dead tuples). For such
+     pages, it derives the free space value from the free space map, and
+     assumes that the rest of the space on the page is taken up by live
+     tuples.
+     </para>
+ 
+     <para>
+     For pages that cannot be skipped, it scans each tuple, recording its
+     presence and size in the appropriate counters, and adding up the
+     free space on the page. At the end, it estimates the total number of
+     live tuples based on the number of pages and tuples scanned (in the
+     same way that VACUUM estimates pg_class.reltuples).
+     </para>
+
+     <table id="pgstatbloat-columns">
+      <title><function>pgstatbloat</function> Output Columns</title>
+      <tgroup cols="3">
+       <thead>
+        <row>
+         <entry>Column</entry>
+         <entry>Type</entry>
+         <entry>Description</entry>
+        </row>
+       </thead>
+ 
+       <tbody>
+        <row>
+         <entry><structfield>table_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Physical relation length in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>scanned_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of table scanned</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of live tuples (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of live tuples in bytes (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of live tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of dead tuples (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of dead tuples in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of dead tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>free_space</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total free space in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>free_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of free space (exact)</entry>
+        </row>
+ 
+       </tbody>
+      </tgroup>
+     </table>
+ 
+     <para>
+     In the above output, the free space figures may not match the
+     <function>pgstattuple</function> output exactly, because the free
+     space map gives us an exact figure, but is not guaranteed to be
+     accurate.
+     </para>
+
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </sect2>
 
@@ -366,6 +500,7 @@ pending_tuples | 0
 
   <para>
    Tatsuo Ishii and Satoshi Nagayasu
+   Abhijit Menon-Sen
   </para>
  </sect2>
 
-- 
1.9.1

#73

amit.kapila16@gmail.com

over 10 years ago

In reply to: Abhijit Menon-Sen (#72)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On Fri, Apr 24, 2015 at 8:46 AM, Abhijit Menon-Sen <ams@2ndquadrant.com>
wrote:

At 2015-04-24 08:35:40 +0530, amit.kapila16@gmail.com wrote:

Just stick a PG_RETURN_NULL() at the end?

That should also work.

OK, updated patch attached with just that one change.

Patch looks good to me, will mark it as Ready for Committer if Tomas
or anyone else doesn't have more to add in terms of review.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#74

tomas.vondra@2ndquadrant.com

over 10 years ago

In reply to: Amit Kapila (#73)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 04/24/15 14:58, Amit Kapila wrote:

On Fri, Apr 24, 2015 at 8:46 AM, Abhijit Menon-Sen <ams@2ndquadrant.com
<mailto:ams@2ndquadrant.com>> wrote:

At 2015-04-24 08:35:40 +0530, amit.kapila16@gmail.com

<mailto:amit.kapila16@gmail.com> wrote:

Just stick a PG_RETURN_NULL() at the end?

That should also work.

OK, updated patch attached with just that one change.

Patch looks good to me, will mark it as Ready for Committer if Tomas
or anyone else doesn't have more to add in terms of review.

FWIW, I'm OK with the patch as is. Your reviews were spot on.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

amit.kapila16@gmail.com

over 10 years ago

In reply to: Tomas Vondra (#74)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On Fri, Apr 24, 2015 at 8:05 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 04/24/15 14:58, Amit Kapila wrote:

On Fri, Apr 24, 2015 at 8:46 AM, Abhijit Menon-Sen <ams@2ndquadrant.com
<mailto:ams@2ndquadrant.com>> wrote:

At 2015-04-24 08:35:40 +0530, amit.kapila16@gmail.com

<mailto:amit.kapila16@gmail.com> wrote:

Just stick a PG_RETURN_NULL() at the end?

That should also work.

OK, updated patch attached with just that one change.

Patch looks good to me, will mark it as Ready for Committer if Tomas
or anyone else doesn't have more to add in terms of review.

FWIW, I'm OK with the patch as is. Your reviews were spot on.

Thanks, I have marked this patch as Ready For Committer.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#76

andres@anarazel.de

over 10 years ago

In reply to: Abhijit Menon-Sen (#72)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

Hi,

On 2015-04-24 08:46:48 +0530, Abhijit Menon-Sen wrote:

diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 0000000..612e22b
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,389 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ *
+ * Abhijit Menon-Sen <ams@2ndQuadrant.com>
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pgstattuple)

I think for new extension we don't really add authors and such anymore.

+ * Permission to use, copy, modify, and distribute this software and
+ * its documentation for any purpose, without fee, and without a
+ * written agreement is hereby granted, provided that the above
+ * copyright notice and this paragraph and the following two
+ * paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
+ * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS
+ * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
+ * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+ */

Shouldn't be here in a contrib module.

+PG_FUNCTION_INFO_V1(pgstatbloat);

+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT free_space BIGINT,              -- exact free space in bytes
+    OUT free_percent FLOAT8)            -- free space in %

Hm. I do wonder if this should really be called 'statbloat'. Perhaps
it'd more appropriately be called pg_estimate_bloat or somesuch?

Also, is free_space really exact? The fsm isn't byte exact.

+static Datum
+build_output_type(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+#define NCOLUMNS	10
+#define NCHARS		32
+
+	HeapTuple	tuple;
+	char	   *values[NCOLUMNS];
+	char		values_buf[NCOLUMNS][NCHARS];
+	int			i;
+	double		tuple_percent;
+	double		dead_tuple_percent;
+	double		free_percent;	/* free/reusable space in % */
+	double		scanned_percent;
+	TupleDesc	tupdesc;
+	AttInMetadata *attinmeta;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/*
+	 * Generate attribute metadata needed later to produce tuples from raw C
+	 * strings
+	 */
+	attinmeta = TupleDescGetAttInMetadata(tupdesc);
+
+	if (stat->table_len == 0)
+	{
+		tuple_percent = 0.0;
+		dead_tuple_percent = 0.0;
+		free_percent = 0.0;
+	}
+	else
+	{
+		tuple_percent = 100.0 * stat->tuple_len / stat->table_len;
+		dead_tuple_percent = 100.0 * stat->dead_tuple_len / stat->table_len;
+		free_percent = 100.0 * stat->free_space / stat->table_len;
+	}
+
+	scanned_percent = 0.0;
+	if (stat->total_pages != 0)
+	{
+		scanned_percent = 100 * stat->scanned_pages / stat->total_pages;
+	}
+
+	for (i = 0; i < NCOLUMNS; i++)
+		values[i] = values_buf[i];
+	i = 0;
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->table_len);
+	snprintf(values[i++], NCHARS, "%.2f", scanned_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_count);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_len);
+	snprintf(values[i++], NCHARS, "%.2f", dead_tuple_percent);
+	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->free_space);
+	snprintf(values[i++], NCHARS, "%.2f", free_percent);
+
+	tuple = BuildTupleFromCStrings(attinmeta, values);
+
+	return HeapTupleGetDatum(tuple);
+}

Why go through C strings? I personally think we really shouldn't add
more callers to BuildTupleFromCStrings, it's such an absurd interface.

+	switch (rel->rd_rel->relkind)
+	{
+		case RELKIND_RELATION:
+		case RELKIND_MATVIEW:
+			PG_RETURN_DATUM(pgstatbloat_heap(rel, fcinfo));
+		case RELKIND_TOASTVALUE:
+			err = "toast value";
+			break;
+		case RELKIND_SEQUENCE:
+			err = "sequence";
+			break;
+		case RELKIND_INDEX:
+			err = "index";
+			break;
+		case RELKIND_VIEW:
+			err = "view";
+			break;
+		case RELKIND_COMPOSITE_TYPE:
+			err = "composite type";
+			break;
+		case RELKIND_FOREIGN_TABLE:
+			err = "foreign table";
+			break;
+		default:
+			err = "unknown";
+			break;
+	}

This breaks translateability. I'm not sure that's worthy of concern. I
think it'd actually be fine to just say that the relation has to be a
table or matview.

+static Datum
+pgstatbloat_heap(Relation rel, FunctionCallInfo fcinfo)
+{
+	/*
+	 * We use the visibility map to skip over SKIP_PAGES_THRESHOLD or
+	 * more contiguous all-visible pages. See the explanation in
+	 * lazy_scan_heap for the rationale.
+	 */

I don't believe that rationale is really true these days. I'd much
rather just rip this out here than copy the rather complex logic.

+	for (blkno = 0; blkno < nblocks; blkno++)
+	{

+		stat.free_space += PageGetHeapFreeSpace(page);
+
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}

I haven't checked, but I'm not sure that it's safe/meaningful to call
PageGetHeapFreeSpace() on a new page.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77

ams@2ndQuadrant.com

over 10 years ago

In reply to: Andres Freund (#76)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2015-05-09 02:20:51 +0200, andres@anarazel.de wrote:

+ * Abhijit Menon-Sen <ams@2ndQuadrant.com>
+ * Portions Copyright (c) 2001,2002	Tatsuo Ishii (from pgstattuple)
I think for new extension we don't really add authors and such anymore.

OK, I'll get rid of the boilerplate.

Hm. I do wonder if this should really be called 'statbloat'. Perhaps
it'd more appropriately be called pg_estimate_bloat or somesuch?

Since we've moved it into pgstattuple, perhaps pgstattuple_approximate()
or pgstattuple_estimated() would be a better name. I don't really care,
I'll change it to whatever people like.

Also, is free_space really exact? The fsm isn't byte exact.

You're right, I'll fix that.

Why go through C strings? I personally think we really shouldn't add
more callers to BuildTupleFromCStrings, it's such an absurd interface.

I just copied this more or less blindly from pgstattuple. I'll fix it
and submit a separate patch to fix the equivalent code in pgstattuple.

I think it'd actually be fine to just say that the relation has to be
a table or matview.

Good point. Agreed.

I don't believe that rationale is really true these days. I'd much
rather just rip this out here than copy the rather complex logic.

Yes, thanks, I very much agree that this isn't really worth copying.
I'll drop it in my next submission.

I haven't checked, but I'm not sure that it's safe/meaningful to call
PageGetHeapFreeSpace() on a new page.

OK, I'll check and fix if necessary.

Thanks for the feedback. I'll submit a revised patch shortly.

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

ams@2ndQuadrant.com

over 10 years ago

In reply to: Abhijit Menon-Sen (#77)

2 attachment(s)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

Hi Andres.

I've attached an updated patch for pgstatbloat, as well as a patch to
replace two uses of BuildTupleFromCStrings() elsewhere in pgstattuple.

I've made the changes you mentioned in your earlier mail, except that I
have not changed the name pending further suggestions about what would
be the best name.

Also:

At 2015-05-09 15:36:49 +0530, ams@2ndQuadrant.com wrote:

At 2015-05-09 02:20:51 +0200, andres@anarazel.de wrote:

I haven't checked, but I'm not sure that it's safe/meaningful to
call PageGetHeapFreeSpace() on a new page.

OK, I'll check and fix if necessary.

You're right, PageGetHeapFreeSpace() isn't safe on a new page. I've
added a guard to that call in the attached patch, but I'm not sure
that's the right thing to do.

Should I copy the orphaned new-page handling from lazy_scan_heap? What
about for empty pages? Neither feels like a very good thing to do, but
then neither does skipping the new page silently. Should I count the
space it would have free if it were initialised, but leave the page
alone for VACUUM to deal with? Or just leave it as it is?

-- Abhijit

Attachments:

0001-Make-pgstattuple-use-heap_form_tuple-instead-of-Buil.patchtext/x-diff; charset=us-asciiDownload

>From 421267bba8394255feed8f9b9424d25d0be9f979 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Mon, 11 May 2015 15:54:48 +0530
Subject: Make pgstattuple use heap_form_tuple instead of
 BuildTupleFromCStrings

---
 contrib/pgstattuple/pgstatindex.c | 43 ++++++++++++++++++-------------------
 contrib/pgstattuple/pgstattuple.c | 45 ++++++++++++++-------------------------
 2 files changed, 37 insertions(+), 51 deletions(-)

diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index a2ea5d7..608f729 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -257,7 +257,8 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 	{
 		TupleDesc	tupleDesc;
 		int			j;
-		char	   *values[10];
+		Datum		values[10];
+		bool		nulls[10];
 		HeapTuple	tuple;
 
 		/* Build a tuple descriptor for our result type */
@@ -265,33 +266,31 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 			elog(ERROR, "return type must be a row type");
 
 		j = 0;
-		values[j++] = psprintf("%d", indexStat.version);
-		values[j++] = psprintf("%d", indexStat.level);
-		values[j++] = psprintf(INT64_FORMAT,
-							   (indexStat.root_pages +
-								indexStat.leaf_pages +
-								indexStat.internal_pages +
-								indexStat.deleted_pages +
-								indexStat.empty_pages) * BLCKSZ);
-		values[j++] = psprintf("%u", indexStat.root_blkno);
-		values[j++] = psprintf(INT64_FORMAT, indexStat.internal_pages);
-		values[j++] = psprintf(INT64_FORMAT, indexStat.leaf_pages);
-		values[j++] = psprintf(INT64_FORMAT, indexStat.empty_pages);
-		values[j++] = psprintf(INT64_FORMAT, indexStat.deleted_pages);
+		values[j++] = Int32GetDatum(indexStat.version);
+		values[j++] = Int32GetDatum(indexStat.level);
+		values[j++] = Int64GetDatum((indexStat.root_pages +
+									 indexStat.leaf_pages +
+									 indexStat.internal_pages +
+									 indexStat.deleted_pages +
+									 indexStat.empty_pages) * BLCKSZ);
+		values[j++] = Int32GetDatum(indexStat.root_blkno);
+		values[j++] = Int32GetDatum(indexStat.internal_pages);
+		values[j++] = Int32GetDatum(indexStat.leaf_pages);
+		values[j++] = Int32GetDatum(indexStat.empty_pages);
+		values[j++] = Int32GetDatum(indexStat.deleted_pages);
 		if (indexStat.max_avail > 0)
-			values[j++] = psprintf("%.2f",
-								   100.0 - (double) indexStat.free_space / (double) indexStat.max_avail * 100.0);
+			values[j++] = Float8GetDatum(100.0 - (double) indexStat.free_space / (double) indexStat.max_avail * 100.0);
 		else
-			values[j++] = pstrdup("NaN");
+			values[j++] = CStringGetDatum("NaN");
 		if (indexStat.leaf_pages > 0)
-			values[j++] = psprintf("%.2f",
-								   (double) indexStat.fragments / (double) indexStat.leaf_pages * 100.0);
+			values[j++] = Float8GetDatum((double) indexStat.fragments / (double) indexStat.leaf_pages * 100.0);
 		else
-			values[j++] = pstrdup("NaN");
+			values[j++] = CStringGetDatum("NaN");
 
-		tuple = BuildTupleFromCStrings(TupleDescGetAttInMetadata(tupleDesc),
-									   values);
+		for (j = 0; j < 10; j++)
+			nulls[j] = false;
 
+		tuple = heap_form_tuple(tupleDesc, values, nulls);
 		result = HeapTupleGetDatum(tuple);
 	}
 
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c3a8b1d..552f188 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -85,28 +85,20 @@ static Datum
 build_pgstattuple_type(pgstattuple_type *stat, FunctionCallInfo fcinfo)
 {
 #define NCOLUMNS	9
-#define NCHARS		32
 
 	HeapTuple	tuple;
-	char	   *values[NCOLUMNS];
-	char		values_buf[NCOLUMNS][NCHARS];
+	Datum		values[NCOLUMNS];
+	bool		nulls[NCOLUMNS];
 	int			i;
 	double		tuple_percent;
 	double		dead_tuple_percent;
 	double		free_percent;	/* free/reusable space in % */
 	TupleDesc	tupdesc;
-	AttInMetadata *attinmeta;
 
 	/* Build a tuple descriptor for our result type */
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	/*
-	 * Generate attribute metadata needed later to produce tuples from raw C
-	 * strings
-	 */
-	attinmeta = TupleDescGetAttInMetadata(tupdesc);
-
 	if (stat->table_len == 0)
 	{
 		tuple_percent = 0.0;
@@ -120,26 +112,21 @@ build_pgstattuple_type(pgstattuple_type *stat, FunctionCallInfo fcinfo)
 		free_percent = 100.0 * stat->free_space / stat->table_len;
 	}
 
-	/*
-	 * Prepare a values array for constructing the tuple. This should be an
-	 * array of C strings which will be processed later by the appropriate
-	 * "in" functions.
-	 */
-	for (i = 0; i < NCOLUMNS; i++)
-		values[i] = values_buf[i];
 	i = 0;
-	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->table_len);
-	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_count);
-	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_len);
-	snprintf(values[i++], NCHARS, "%.2f", tuple_percent);
-	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_count);
-	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_len);
-	snprintf(values[i++], NCHARS, "%.2f", dead_tuple_percent);
-	snprintf(values[i++], NCHARS, INT64_FORMAT, stat->free_space);
-	snprintf(values[i++], NCHARS, "%.2f", free_percent);
-
-	/* build a tuple */
-	tuple = BuildTupleFromCStrings(attinmeta, values);
+	values[i++] = Int64GetDatum(stat->table_len);
+	values[i++] = Int64GetDatum(stat->tuple_count);
+	values[i++] = Int64GetDatum(stat->tuple_len);
+	values[i++] = Float8GetDatum(tuple_percent);
+	values[i++] = Int64GetDatum(stat->dead_tuple_count);
+	values[i++] = Int64GetDatum(stat->dead_tuple_len);
+	values[i++] = Float8GetDatum(dead_tuple_percent);
+	values[i++] = Int64GetDatum(stat->free_space);
+	values[i++] = Float8GetDatum(free_percent);
+
+	for (i = 0; i < NCOLUMNS; i++)
+		nulls[i] = false;
+
+	tuple = heap_form_tuple(tupdesc, values, nulls);
 
 	/* make the tuple into a datum */
 	return HeapTupleGetDatum(tuple);
-- 
1.9.1

0002-Add-pgstatbloat-to-pgstattuple.patchtext/x-diff; charset=us-asciiDownload

>From 0be1d1d49703ff9704255495cedb91c097fbeaa0 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatbloat to pgstattuple

---
 contrib/pgstattuple/Makefile                       |   4 +-
 contrib/pgstattuple/pgstatbloat.c                  | 282 +++++++++++++++++++++
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql      |  18 ++
 .../{pgstattuple--1.2.sql => pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control            |   2 +-
 doc/src/sgml/pgstattuple.sgml                      | 135 ++++++++++
 6 files changed, 455 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatbloat.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql => pgstattuple--1.3.sql} (72%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..d7d27a5 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatbloat.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatbloat.c b/contrib/pgstattuple/pgstatbloat.c
new file mode 100644
index 0000000..1c1f388
--- /dev/null
+++ b/contrib/pgstattuple/pgstatbloat.c
@@ -0,0 +1,282 @@
+/*
+ * contrib/pgstattuple/pgstatbloat.c
+ * Fast bloat estimation functions
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/multixact.h"
+#include "access/htup_details.h"
+#include "catalog/namespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/procarray.h"
+#include "storage/lmgr.h"
+#include "utils/builtins.h"
+#include "utils/tqual.h"
+#include "commands/vacuum.h"
+
+PG_FUNCTION_INFO_V1(pgstatbloat);
+
+/*
+ * tuple_percent, dead_tuple_percent and free_percent are computable,
+ * so not defined here. total_pages and misc_count are not returned.
+ */
+typedef struct pgstatbloat_output_type
+{
+	uint64		table_len;
+	uint64		total_pages;
+	uint64		scanned_pages;
+	uint64		tuple_count;
+	uint64		tuple_len;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	uint64		misc_count;
+	uint64		free_space;
+} pgstatbloat_output_type;
+
+#define NUM_PGSTATBLOAT_OUTPUT_COLUMNS 10
+
+/*
+ * This function takes an already open relation and scans its pages,
+ * skipping those that have the corresponding visibility map bit set.
+ * For pages we skip, we find the free space from the free space map
+ * and approximate tuple_len on that basis. For the others, we count
+ * the exact number of dead tuples etc.
+ *
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
+ * we do not try to avoid skipping single pages.
+ */
+
+static void
+pgstatbloat_heap(Relation rel, pgstatbloat_output_type *stat)
+{
+	BlockNumber scanned,
+				nblocks,
+				blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	BufferAccessStrategy bstrategy;
+	TransactionId OldestXmin;
+
+	OldestXmin = GetOldestXmin(rel, true);
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	scanned = 0;
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		Buffer		buf;
+		Page		page;
+		OffsetNumber offnum,
+					maxoff;
+		Size		freespace;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the page has only visible tuples, then we can find out the
+		 * free space from the FSM and move on.
+		 */
+
+		if (visibilitymap_test(rel, blkno, &vmbuffer))
+		{
+			freespace = GetRecordedFreeSpace(rel, blkno);
+			stat->tuple_len += BLCKSZ - freespace;
+			stat->free_space += freespace;
+			continue;
+		}
+
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, bstrategy);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buf);
+
+		if (!PageIsNew(page))
+			stat->free_space += PageGetHeapFreeSpace(page);
+
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		scanned++;
+
+		/*
+		 * Look at each tuple on the page and decide whether it's live
+		 * or dead, then count it and its size. Unlike lazy_scan_heap,
+		 * we can afford to ignore problems and special cases.
+		 */
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+			HeapTupleData tuple;
+
+			itemid = PageGetItemId(page, offnum);
+
+			if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid) ||
+				ItemIdIsDead(itemid))
+			{
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationGetRelid(rel);
+
+			/*
+			 * We count live and dead tuples, but we also need to add up
+			 * others in order to feed vac_estimate_reltuples.
+			 */
+
+			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			{
+				case HEAPTUPLE_RECENTLY_DEAD:
+					stat->misc_count++;
+					/* Fall through */
+				case HEAPTUPLE_DEAD:
+					stat->dead_tuple_len += tuple.t_len;
+					stat->dead_tuple_count++;
+					break;
+				case HEAPTUPLE_LIVE:
+					stat->tuple_len += tuple.t_len;
+					stat->tuple_count++;
+					break;
+				case HEAPTUPLE_INSERT_IN_PROGRESS:
+				case HEAPTUPLE_DELETE_IN_PROGRESS:
+					stat->misc_count++;
+					break;
+				default:
+					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+					break;
+			}
+		}
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	stat->table_len = (uint64) nblocks * BLCKSZ;
+	stat->tuple_count = vac_estimate_reltuples(rel, false, nblocks, scanned,
+											   stat->tuple_count+stat->misc_count);
+	stat->total_pages = nblocks;
+	stat->scanned_pages = scanned;
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+
+	relation_close(rel, AccessShareLock);
+}
+
+/*
+ * build and return a pgstatbloat_output_type tuple
+ */
+static HeapTuple
+build_tuple(pgstatbloat_output_type *stat, FunctionCallInfo fcinfo)
+{
+	TupleDesc	tupdesc;
+	bool		nulls[NUM_PGSTATBLOAT_OUTPUT_COLUMNS];
+	Datum		values[NUM_PGSTATBLOAT_OUTPUT_COLUMNS];
+	double		scanned_percent = 0;
+	double		tuple_percent = 0;
+	double		dead_tuple_percent = 0;
+	double		free_percent = 0;
+	int			i;
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/*
+	 * Calculate percentages.
+	 */
+
+	if (stat->total_pages != 0)
+		scanned_percent = 100 * stat->scanned_pages / stat->total_pages;
+
+	if (stat->table_len != 0)
+	{
+		tuple_percent = 100.0 * stat->tuple_len / stat->table_len;
+		dead_tuple_percent = 100.0 * stat->dead_tuple_len / stat->table_len;
+		free_percent = 100.0 * stat->free_space / stat->table_len;
+	}
+
+	i = 0;
+	values[i++] = Int64GetDatum(stat->table_len);
+	values[i++] = Float8GetDatum(scanned_percent);
+	values[i++] = Int64GetDatum(stat->tuple_count);
+	values[i++] = Int64GetDatum(stat->tuple_len);
+	values[i++] = Float8GetDatum(tuple_percent);
+	values[i++] = Int64GetDatum(stat->dead_tuple_count);
+	values[i++] = Int64GetDatum(stat->dead_tuple_len);
+	values[i++] = Float8GetDatum(dead_tuple_percent);
+	values[i++] = Int64GetDatum(stat->free_space);
+	values[i++] = Float8GetDatum(free_percent);
+
+	for (i = 0; i < NUM_PGSTATBLOAT_OUTPUT_COLUMNS; i++)
+		nulls[i] = false;
+
+	return heap_form_tuple(tupdesc, values, nulls);
+}
+
+/*
+ * Returns estimated live/dead tuple statistics for the given relid.
+ */
+
+Datum
+pgstatbloat(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	pgstatbloat_output_type stat = {0};
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to use pgstatbloat functions"))));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	/*
+	 * We support only ordinary relations and materialised views,
+	 * because we depend on the visibility map and free space map
+	 * for our estimates about unscanned pages.
+	 */
+	if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
+		  rel->rd_rel->relkind == RELKIND_MATVIEW))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("\"%s\" is not a table or materialized view",
+						RelationGetRelationName(rel))));
+
+	pgstatbloat_heap(rel, &stat);
+	return HeapTupleGetDatum(build_tuple(&stat, fcinfo));
+}
diff --git a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
new file mode 100644
index 0000000..e4762d6
--- /dev/null
+++ b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
@@ -0,0 +1,18 @@
+/* contrib/pgstattuple/pgstattuple--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pgstattuple UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT approx_free_space BIGINT,       -- estimated free space in bytes
+    OUT approx_free_percent FLOAT8)     -- free space in % (based on estimate)
+AS 'MODULE_PATHNAME', 'pgstatbloat'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.3.sql
similarity index 72%
rename from contrib/pgstattuple/pgstattuple--1.2.sql
rename to contrib/pgstattuple/pgstattuple--1.3.sql
index e5fa2f5..f0e2a7d 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.3.sql
@@ -1,4 +1,4 @@
-/* contrib/pgstattuple/pgstattuple--1.2.sql */
+/* contrib/pgstattuple/pgstattuple--1.3.sql */
 
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION pgstattuple" to load this file. \quit
@@ -77,3 +77,19 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+/* New stuff in 1.3 begins here */
+
+CREATE FUNCTION pgstatbloat(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT approx_free_space BIGINT,       -- estimated free space in bytes
+    OUT approx_free_percent FLOAT8)     -- free space in % (based on estimate)
+AS 'MODULE_PATHNAME', 'pgstatbloat'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple.control b/contrib/pgstattuple/pgstattuple.control
index a7cf47f..c03b180 100644
--- a/contrib/pgstattuple/pgstattuple.control
+++ b/contrib/pgstattuple/pgstattuple.control
@@ -1,5 +1,5 @@
 # pgstattuple extension
 comment = 'show tuple-level statistics'
-default_version = '1.2'
+default_version = '1.3'
 module_pathname = '$libdir/pgstattuple'
 relocatable = true
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 9cabd71..958c6f7 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -358,6 +358,140 @@ pending_tuples | 0
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <indexterm>
+      <primary>pgstatbloat</primary>
+     </indexterm>
+     <function>pgstatbloat(regclass) returns record</>
+    </term>
+
+    <listitem>
+     <para>
+      <function>pgstatbloat</function> is a faster alternative to
+      <function>pgstattuple</function> that returns approximate results.
+      The argument is the target relation's OID.
+      For example:
+<programlisting>
+test=> SELECT * FROM pgstatbloat('pg_catalog.pg_proc'::regclass);
+-[ RECORD 1 ]--------+-------
+table_len            | 573440
+scanned_percent      | 2
+approx_tuple_count   | 2740
+approx_tuple_len     | 561210
+approx_tuple_percent | 97.87
+dead_tuple_count     | 0
+dead_tuple_len       | 0
+dead_tuple_percent   | 0
+approx_free_space    | 11996
+approx_free_percent  | 2.09
+</programlisting>
+      The output columns are described in <xref linkend="pgstatbloat-columns">.
+     </para>
+
+     <para>
+     Whereas <function>pgstattuple</function> always performs a
+     full-table scan and returns an exact count of live and dead tuples
+     (and their sizes) and free space, <function>pgstatbloat</function>
+     tries to avoid the full-table scan and returns exact dead tuple
+     statistics along with an approximation of the number and
+     size of live tuples and free space.
+     </para>
+
+     <para>
+     It does this by skipping pages that have only visible tuples
+     according to the visibility map (if a page has the corresponding VM
+     bit set, then it is assumed to contain no dead tuples). For such
+     pages, it derives the free space value from the free space map, and
+     assumes that the rest of the space on the page is taken up by live
+     tuples.
+     </para>
+ 
+     <para>
+     For pages that cannot be skipped, it scans each tuple, recording its
+     presence and size in the appropriate counters, and adding up the
+     free space on the page. At the end, it estimates the total number of
+     live tuples based on the number of pages and tuples scanned (in the
+     same way that VACUUM estimates pg_class.reltuples).
+     </para>
+
+     <table id="pgstatbloat-columns">
+      <title><function>pgstatbloat</function> Output Columns</title>
+      <tgroup cols="3">
+       <thead>
+        <row>
+         <entry>Column</entry>
+         <entry>Type</entry>
+         <entry>Description</entry>
+        </row>
+       </thead>
+ 
+       <tbody>
+        <row>
+         <entry><structfield>table_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Physical relation length in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>scanned_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of table scanned</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of live tuples (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of live tuples in bytes (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of live tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of dead tuples (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of dead tuples in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of dead tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>approx_free_space</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total free space in bytes (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>approx_free_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of free space</entry>
+        </row>
+ 
+       </tbody>
+      </tgroup>
+     </table>
+ 
+     <para>
+     In the above output, the free space figures may not match the
+     <function>pgstattuple</function> output exactly, because the free
+     space map gives us an exact figure, but is not guaranteed to be
+     accurate to the byte.
+     </para>
+
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </sect2>
 
@@ -366,6 +500,7 @@ pending_tuples | 0
 
   <para>
    Tatsuo Ishii and Satoshi Nagayasu
+   Abhijit Menon-Sen
   </para>
  </sect2>
 
-- 
1.9.1

#79

andres@anarazel.de

over 10 years ago

In reply to: Abhijit Menon-Sen (#78)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

Hi,

On 2015-05-11 16:57:08 +0530, Abhijit Menon-Sen wrote:

I've attached an updated patch for pgstatbloat, as well as a patch to
replace two uses of BuildTupleFromCStrings() elsewhere in pgstattuple.

TBH, I'd rather not touch unrelated things right now. We're pretty badly
behind...

I've made the changes you mentioned in your earlier mail, except that I
have not changed the name pending further suggestions about what would
be the best name.

I don't really care how it's named, as long as it makes clear that it's
not an exact measurement.

I haven't checked, but I'm not sure that it's safe/meaningful to
call PageGetHeapFreeSpace() on a new page.

OK, I'll check and fix if necessary.

You're right, PageGetHeapFreeSpace() isn't safe on a new page. I've
added a guard to that call in the attached patch, but I'm not sure
that's the right thing to do.

Should I copy the orphaned new-page handling from lazy_scan_heap? What
about for empty pages? Neither feels like a very good thing to do, but
then neither does skipping the new page silently. Should I count the
space it would have free if it were initialised, but leave the page
alone for VACUUM to deal with? Or just leave it as it is?

I think there's absolutely no need for pgstattuple to do anything
here. It's not even desirable.

+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buf);
+
+		if (!PageIsNew(page))
+			stat->free_space += PageGetHeapFreeSpace(page);
+
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}

Shouldn't new pages be counted as being fully free or at least bloat?
Just disregarding them seems wrong.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

ams@2ndQuadrant.com

over 10 years ago

In reply to: Andres Freund (#79)

1 attachment(s)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

At 2015-05-11 19:15:47 +0200, andres@anarazel.de wrote:

TBH, I'd rather not touch unrelated things right now. We're pretty
badly behind...

OK. That patch is independent; just ignore it.

I don't really care how it's named, as long as it makes clear that
it's not an exact measurement.

Not having heard any better suggestions, I picked "pgstatapprox" as a
compromise between length and familiarity/consistency with pgstattuple.

Should I count the space it would have free if it were initialised,
but leave the page alone for VACUUM to deal with?

And this is what the attached patch does.

I also cleaned up a few things that I didn't like but had left alone to
make the code look similar to pgstattuple. In particular, build_tuple()
now does nothing but build a tuple from values calculated earlier in
pgstatapprox_heap().

Thank you.

-- Abhijit

P.S. What, if anything, should be done about the complicated and likely
not very useful skip-only-min#-blocks logic in lazy_scan_heap?

Attachments:

0001-Add-pgstatapprox-to-pgstattuple.patchtext/x-diff; charset=us-asciiDownload

>From 0a99fc61d36e3a3ff4003db95e5c299a5f07a850 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen <ams@2ndQuadrant.com>
Date: Fri, 26 Dec 2014 12:37:13 +0530
Subject: Add pgstatapprox to pgstattuple

---
 contrib/pgstattuple/Makefile                       |   4 +-
 contrib/pgstattuple/pgstatapprox.c                 | 277 +++++++++++++++++++++
 contrib/pgstattuple/pgstattuple--1.2--1.3.sql      |  18 ++
 .../{pgstattuple--1.2.sql => pgstattuple--1.3.sql} |  18 +-
 contrib/pgstattuple/pgstattuple.control            |   2 +-
 doc/src/sgml/pgstattuple.sgml                      | 135 ++++++++++
 6 files changed, 450 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pgstattuple/pgstatapprox.c
 create mode 100644 contrib/pgstattuple/pgstattuple--1.2--1.3.sql
 rename contrib/pgstattuple/{pgstattuple--1.2.sql => pgstattuple--1.3.sql} (72%)

diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index 862585c..6083dab 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -1,10 +1,10 @@
 # contrib/pgstattuple/Makefile
 
 MODULE_big	= pgstattuple
-OBJS		= pgstattuple.o pgstatindex.o $(WIN32RES)
+OBJS		= pgstattuple.o pgstatindex.o pgstatapprox.o $(WIN32RES)
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.2.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.3.sql pgstattuple--1.2--1.3.sql pgstattuple--1.1--1.2.sql pgstattuple--1.0--1.1.sql pgstattuple--unpackaged--1.0.sql
 PGFILEDESC = "pgstattuple - tuple-level statistics"
 
 REGRESS = pgstattuple
diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
new file mode 100644
index 0000000..46f5bc0
--- /dev/null
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -0,0 +1,277 @@
+/*
+ * contrib/pgstattuple/pgstatapprox.c
+ * Fast bloat estimation functions
+ */
+
+#include "postgres.h"
+
+#include "access/visibilitymap.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/multixact.h"
+#include "access/htup_details.h"
+#include "catalog/namespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/procarray.h"
+#include "storage/lmgr.h"
+#include "utils/builtins.h"
+#include "utils/tqual.h"
+#include "commands/vacuum.h"
+
+PG_FUNCTION_INFO_V1(pgstatapprox);
+
+typedef struct output_type
+{
+	uint64		table_len;
+	uint64		scanned_percent;
+	uint64		tuple_count;
+	uint64		tuple_len;
+	double		tuple_percent;
+	uint64		dead_tuple_count;
+	uint64		dead_tuple_len;
+	double		dead_tuple_percent;
+	uint64		free_space;
+	double		free_percent;
+} output_type;
+
+#define NUM_OUTPUT_COLUMNS 10
+
+/*
+ * This function takes an already open relation and scans its pages,
+ * skipping those that have the corresponding visibility map bit set.
+ * For pages we skip, we find the free space from the free space map
+ * and approximate tuple_len on that basis. For the others, we count
+ * the exact number of dead tuples etc.
+ *
+ * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but
+ * we do not try to avoid skipping single pages.
+ */
+
+static void
+pgstatapprox_heap(Relation rel, output_type *stat)
+{
+	BlockNumber scanned,
+				nblocks,
+				blkno;
+	Buffer		vmbuffer = InvalidBuffer;
+	BufferAccessStrategy bstrategy;
+	TransactionId OldestXmin;
+	uint64		misc_count = 0;
+
+	OldestXmin = GetOldestXmin(rel, true);
+	bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	nblocks = RelationGetNumberOfBlocks(rel);
+	scanned = 0;
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		Buffer		buf;
+		Page		page;
+		OffsetNumber offnum,
+					maxoff;
+		Size		freespace;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the page has only visible tuples, then we can find out the
+		 * free space from the FSM and move on.
+		 */
+
+		if (visibilitymap_test(rel, blkno, &vmbuffer))
+		{
+			freespace = GetRecordedFreeSpace(rel, blkno);
+			stat->tuple_len += BLCKSZ - freespace;
+			stat->free_space += freespace;
+			continue;
+		}
+
+		buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, bstrategy);
+
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buf);
+
+		/*
+		 * It's not safe to call PageGetHeapFreeSpace() on new pages, so
+		 * we treat them as being free space for our purposes.
+		 */
+		if (!PageIsNew(page))
+			stat->free_space += PageGetHeapFreeSpace(page);
+		else
+			stat->free_space += BLCKSZ - SizeOfPageHeaderData;
+
+		if (PageIsNew(page) || PageIsEmpty(page))
+		{
+			UnlockReleaseBuffer(buf);
+			continue;
+		}
+
+		scanned++;
+
+		/*
+		 * Look at each tuple on the page and decide whether it's live
+		 * or dead, then count it and its size. Unlike lazy_scan_heap,
+		 * we can afford to ignore problems and special cases.
+		 */
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (offnum = FirstOffsetNumber;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemid;
+			HeapTupleData tuple;
+
+			itemid = PageGetItemId(page, offnum);
+
+			if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid) ||
+				ItemIdIsDead(itemid))
+			{
+				continue;
+			}
+
+			Assert(ItemIdIsNormal(itemid));
+
+			ItemPointerSet(&(tuple.t_self), blkno, offnum);
+
+			tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);
+			tuple.t_len = ItemIdGetLength(itemid);
+			tuple.t_tableOid = RelationGetRelid(rel);
+
+			/*
+			 * We count live and dead tuples, but we also need to add up
+			 * others in order to feed vac_estimate_reltuples.
+			 */
+
+			switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))
+			{
+				case HEAPTUPLE_RECENTLY_DEAD:
+					misc_count++;
+					/* Fall through */
+				case HEAPTUPLE_DEAD:
+					stat->dead_tuple_len += tuple.t_len;
+					stat->dead_tuple_count++;
+					break;
+				case HEAPTUPLE_LIVE:
+					stat->tuple_len += tuple.t_len;
+					stat->tuple_count++;
+					break;
+				case HEAPTUPLE_INSERT_IN_PROGRESS:
+				case HEAPTUPLE_DELETE_IN_PROGRESS:
+					misc_count++;
+					break;
+				default:
+					elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+					break;
+			}
+		}
+
+		UnlockReleaseBuffer(buf);
+	}
+
+	stat->table_len = (uint64) nblocks * BLCKSZ;
+	stat->tuple_count = vac_estimate_reltuples(rel, false, nblocks, scanned,
+											   stat->tuple_count+misc_count);
+
+	/*
+	 * Calculate percentages if the relation has one or more pages.
+	 */
+	if (nblocks != 0)
+	{
+		stat->scanned_percent = 100 * scanned / nblocks;
+		stat->tuple_percent = 100.0 * stat->tuple_len / stat->table_len;
+		stat->dead_tuple_percent = 100.0 * stat->dead_tuple_len / stat->table_len;
+		stat->free_percent = 100.0 * stat->free_space / stat->table_len;
+	}
+
+	if (BufferIsValid(vmbuffer))
+	{
+		ReleaseBuffer(vmbuffer);
+		vmbuffer = InvalidBuffer;
+	}
+}
+
+/*
+ * build and return an output_type tuple
+ */
+static HeapTuple
+build_tuple(output_type *stat, FunctionCallInfo fcinfo)
+{
+	TupleDesc	tupdesc;
+	bool		nulls[NUM_OUTPUT_COLUMNS];
+	Datum		values[NUM_OUTPUT_COLUMNS];
+	int			i = 0;
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	values[i++] = Int64GetDatum(stat->table_len);
+	values[i++] = Float8GetDatum(stat->scanned_percent);
+	values[i++] = Int64GetDatum(stat->tuple_count);
+	values[i++] = Int64GetDatum(stat->tuple_len);
+	values[i++] = Float8GetDatum(stat->tuple_percent);
+	values[i++] = Int64GetDatum(stat->dead_tuple_count);
+	values[i++] = Int64GetDatum(stat->dead_tuple_len);
+	values[i++] = Float8GetDatum(stat->dead_tuple_percent);
+	values[i++] = Int64GetDatum(stat->free_space);
+	values[i++] = Float8GetDatum(stat->free_percent);
+
+	for (i = 0; i < NUM_OUTPUT_COLUMNS; i++)
+		nulls[i] = false;
+
+	return heap_form_tuple(tupdesc, values, nulls);
+}
+
+/*
+ * Returns estimated live/dead tuple statistics for the given relid.
+ */
+
+Datum
+pgstatapprox(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	rel;
+	output_type stat = {0};
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to use pgstattuple functions"))));
+
+	rel = relation_open(relid, AccessShareLock);
+
+	/*
+	 * Reject attempts to read non-local temporary relations; we would be
+	 * likely to get wrong data since we have no visibility into the owning
+	 * session's local buffers.
+	 */
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions")));
+
+	/*
+	 * We support only ordinary relations and materialised views,
+	 * because we depend on the visibility map and free space map
+	 * for our estimates about unscanned pages.
+	 */
+	if (!(rel->rd_rel->relkind == RELKIND_RELATION ||
+		  rel->rd_rel->relkind == RELKIND_MATVIEW))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("\"%s\" is not a table or materialized view",
+						RelationGetRelationName(rel))));
+
+	pgstatapprox_heap(rel, &stat);
+
+	relation_close(rel, AccessShareLock);
+
+	return HeapTupleGetDatum(build_tuple(&stat, fcinfo));
+}
diff --git a/contrib/pgstattuple/pgstattuple--1.2--1.3.sql b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
new file mode 100644
index 0000000..bfcf464
--- /dev/null
+++ b/contrib/pgstattuple/pgstattuple--1.2--1.3.sql
@@ -0,0 +1,18 @@
+/* contrib/pgstattuple/pgstattuple--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pgstattuple UPDATE TO '1.3'" to load this file. \quit
+
+CREATE FUNCTION pgstatapprox(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT approx_free_space BIGINT,       -- estimated free space in bytes
+    OUT approx_free_percent FLOAT8)     -- free space in % (based on estimate)
+AS 'MODULE_PATHNAME', 'pgstatapprox'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql b/contrib/pgstattuple/pgstattuple--1.3.sql
similarity index 72%
rename from contrib/pgstattuple/pgstattuple--1.2.sql
rename to contrib/pgstattuple/pgstattuple--1.3.sql
index e5fa2f5..f69b619 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.3.sql
@@ -1,4 +1,4 @@
-/* contrib/pgstattuple/pgstattuple--1.2.sql */
+/* contrib/pgstattuple/pgstattuple--1.3.sql */
 
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION pgstattuple" to load this file. \quit
@@ -77,3 +77,19 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+/* New stuff in 1.3 begins here */
+
+CREATE FUNCTION pgstatapprox(IN reloid regclass,
+    OUT table_len BIGINT,               -- physical table length in bytes
+    OUT scanned_percent FLOAT8,         -- what percentage of the table's pages was scanned
+    OUT approx_tuple_count BIGINT,      -- estimated number of live tuples
+    OUT approx_tuple_len BIGINT,        -- estimated total length in bytes of live tuples
+    OUT approx_tuple_percent FLOAT8,    -- live tuples in % (based on estimate)
+    OUT dead_tuple_count BIGINT,        -- exact number of dead tuples
+    OUT dead_tuple_len BIGINT,          -- exact total length in bytes of dead tuples
+    OUT dead_tuple_percent FLOAT8,      -- dead tuples in % (based on estimate)
+    OUT approx_free_space BIGINT,       -- estimated free space in bytes
+    OUT approx_free_percent FLOAT8)     -- free space in % (based on estimate)
+AS 'MODULE_PATHNAME', 'pgstatapprox'
+LANGUAGE C STRICT;
diff --git a/contrib/pgstattuple/pgstattuple.control b/contrib/pgstattuple/pgstattuple.control
index a7cf47f..c03b180 100644
--- a/contrib/pgstattuple/pgstattuple.control
+++ b/contrib/pgstattuple/pgstattuple.control
@@ -1,5 +1,5 @@
 # pgstattuple extension
 comment = 'show tuple-level statistics'
-default_version = '1.2'
+default_version = '1.3'
 module_pathname = '$libdir/pgstattuple'
 relocatable = true
diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 9cabd71..4cba006 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -358,6 +358,140 @@ pending_tuples | 0
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <indexterm>
+      <primary>pgstatapprox</primary>
+     </indexterm>
+     <function>pgstatapprox(regclass) returns record</>
+    </term>
+
+    <listitem>
+     <para>
+      <function>pgstatapprox</function> is a faster alternative to
+      <function>pgstattuple</function> that returns approximate results.
+      The argument is the target relation's OID.
+      For example:
+<programlisting>
+test=> SELECT * FROM pgstatapprox('pg_catalog.pg_proc'::regclass);
+-[ RECORD 1 ]--------+-------
+table_len            | 573440
+scanned_percent      | 2
+approx_tuple_count   | 2740
+approx_tuple_len     | 561210
+approx_tuple_percent | 97.87
+dead_tuple_count     | 0
+dead_tuple_len       | 0
+dead_tuple_percent   | 0
+approx_free_space    | 11996
+approx_free_percent  | 2.09
+</programlisting>
+      The output columns are described in <xref linkend="pgstatapprox-columns">.
+     </para>
+
+     <para>
+     Whereas <function>pgstattuple</function> always performs a
+     full-table scan and returns an exact count of live and dead tuples
+     (and their sizes) and free space, <function>pgstatapprox</function>
+     tries to avoid the full-table scan and returns exact dead tuple
+     statistics along with an approximation of the number and
+     size of live tuples and free space.
+     </para>
+
+     <para>
+     It does this by skipping pages that have only visible tuples
+     according to the visibility map (if a page has the corresponding VM
+     bit set, then it is assumed to contain no dead tuples). For such
+     pages, it derives the free space value from the free space map, and
+     assumes that the rest of the space on the page is taken up by live
+     tuples.
+     </para>
+ 
+     <para>
+     For pages that cannot be skipped, it scans each tuple, recording its
+     presence and size in the appropriate counters, and adding up the
+     free space on the page. At the end, it estimates the total number of
+     live tuples based on the number of pages and tuples scanned (in the
+     same way that VACUUM estimates pg_class.reltuples).
+     </para>
+
+     <table id="pgstatapprox-columns">
+      <title><function>pgstatapprox</function> Output Columns</title>
+      <tgroup cols="3">
+       <thead>
+        <row>
+         <entry>Column</entry>
+         <entry>Type</entry>
+         <entry>Description</entry>
+        </row>
+       </thead>
+ 
+       <tbody>
+        <row>
+         <entry><structfield>table_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Physical relation length in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>scanned_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of table scanned</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of live tuples (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of live tuples in bytes (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of live tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_count</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Number of dead tuples (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_len</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total length of dead tuples in bytes (exact)</entry>
+        </row>
+        <row>
+         <entry><structfield>dead_tuple_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of dead tuples</entry>
+        </row>
+        <row>
+         <entry><structfield>approx_free_space</structfield></entry>
+         <entry><type>bigint</type></entry>
+         <entry>Total free space in bytes (estimated)</entry>
+        </row>
+        <row>
+         <entry><structfield>approx_free_percent</structfield></entry>
+         <entry><type>float8</type></entry>
+         <entry>Percentage of free space</entry>
+        </row>
+ 
+       </tbody>
+      </tgroup>
+     </table>
+ 
+     <para>
+     In the above output, the free space figures may not match the
+     <function>pgstattuple</function> output exactly, because the free
+     space map gives us an exact figure, but is not guaranteed to be
+     accurate to the byte.
+     </para>
+
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </sect2>
 
@@ -366,6 +500,7 @@ pending_tuples | 0
 
   <para>
    Tatsuo Ishii and Satoshi Nagayasu
+   Abhijit Menon-Sen
   </para>
  </sect2>
 
-- 
1.9.1

#81

andres@anarazel.de

over 10 years ago

In reply to: Abhijit Menon-Sen (#80)

Re: a fast bloat measurement tool (was Re: Measuring relation free space)

On 2015-05-12 00:42:14 +0530, Abhijit Menon-Sen wrote:

At 2015-05-11 19:15:47 +0200, andres@anarazel.de wrote:

I don't really care how it's named, as long as it makes clear that
it's not an exact measurement.

Not having heard any better suggestions, I picked "pgstatapprox" as a
compromise between length and familiarity/consistency with pgstattuple.

I can live with that, although I'd personally go with
pgstattuple_approx()...

Other than that and some minor local changes I've made, I'm ready to
commit this.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82