Our CLUSTER implementation is pessimal

Started by Gregory Starkover 17 years ago12 messages
#1Gregory Stark
stark@enterprisedb.com
1 attachment(s)

One thing that's been annoying me for a while is that our CLUSTER
implementation is really very slow. When I say very slow I mean it's really
very very very slow.

The problem is that it does a full index scan and looks up each tuple in the
order of the index. That means it a) is doing a lot of random i/o and b) has
to access the same pages over and over again.

Our query planner knows better and normally does a sequential heap scan and
sort for queries of this type. But CLUSTER has to use the index.

Just the other day I wanted to randomize the order of a table so I added a
random column, indexed it and started cluster running. After 6 hours I looked
at how far it had gotten and found it was only about 20% done rewriting the
heap (not including rebuilding the indexes). I cancelled it and built a new
table in the desired orderwith CREATE TABLE AS ... ORDER BY in 45 minutes!

There are a couple problems with this:

a) We need some way to decide *when* to do a sort and when to do an index
scan. The planner has all this machinery but we don't really have all the
pieces handy to use it in a utility statement. This is especially important
for the case where we're doing a cluster operation on a table that's already
clustered. In that case an index scan could conceivably actually win (though I
kind of doubt it). I don't really have a solution for this.

b) tuplesort no longer has the pieces needed to sort whole tuples including
visibility info. And actually even the old pieces that were removed had not
quite the right interface and behaviour. We need to preserve t_self for the
heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to
generate the scan keys that match the index. The cleanest approach I think is
to just add back in the old raw tuple methods to tuplesort and tweak them to
preserve t_self and take Scankeys instead of attnums etc.

c) expression indexes are a problem. We would have to start with constructing
new tuples to hold the expression values but retain the entire original heap
tuple. Perhaps pass both an indextuple and a heaptuple to tuplesort and store
both in the sorttuple? For now I figure we can just punt on expression indexes
and always do an index sacn.

I tested out the idea and it works reasonably well. The code might need some
cleanup including, I think refactoring cluster_rel() and possibly tweaking the
interface to tuplesort. But I'm fairly happy with it.

The results clustering a 2.2GB table with 2,000,000 rows on my 1.5GB laptop
using maintenance_work_mem of 768MB and work_mem of 256MB:

postgres=# \d x

Table "public.x"
Column | Type | Modifiers
--------+------------------+-----------
i | integer |
r | double precision |
repeat | text |
Indexes:
"xi" btree (i)
"xi2" btree ((i + 0))
"xir" btree (r) CLUSTER

postgresql=# CLUSTER xi ON x;
CLUSTER
Time: 736029.322 ms

postgresql=# CLUSTER xir ON x;
CLUSTER
Time: 723610.507 ms

postgresql=# CLUSTER xi2 ON x;
CLUSTER
Time: 12842793.346 ms

That's 3.5 hours for traditional cluster and 12 minutes for cluster using a
sort...

Attachments:

cluster-sort-v1.difftext/x-diffDownload
Index: src/backend/commands/cluster.c
===================================================================
RCS file: /home/stark/src/REPOSITORY/pgsql/src/backend/commands/cluster.c,v
retrieving revision 1.177
diff -c -r1.177 cluster.c
*** src/backend/commands/cluster.c	19 Jun 2008 00:46:04 -0000	1.177
--- src/backend/commands/cluster.c	31 Aug 2008 11:36:14 -0000
***************
*** 19,24 ****
--- 19,25 ----
  
  #include "access/genam.h"
  #include "access/heapam.h"
+ #include "access/nbtree.h"
  #include "access/relscan.h"
  #include "access/rewriteheap.h"
  #include "access/transam.h"
***************
*** 46,52 ****
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
! 
  
  /*
   * This struct is used to pass around the information on tables to be
--- 47,53 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
! #include "utils/tuplesort.h"
  
  /*
   * This struct is used to pass around the information on tables to be
***************
*** 713,725 ****
  	int			natts;
  	Datum	   *values;
  	bool	   *isnull;
- 	IndexScanDesc scan;
  	HeapTuple	tuple;
  	bool		use_wal;
  	TransactionId OldestXmin;
  	TransactionId FreezeXid;
  	RewriteState rwstate;
  
  	/*
  	 * Open the relations we need.
  	 */
--- 714,739 ----
  	int			natts;
  	Datum	   *values;
  	bool	   *isnull;
  	HeapTuple	tuple;
  	bool		use_wal;
  	TransactionId OldestXmin;
  	TransactionId FreezeXid;
  	RewriteState rwstate;
  
+ 	bool	do_sort;
+ 
+ 	/* We use this if we're doing an index scan */
+ 	IndexScanDesc scan;
+ 
+ 	/* we need these if we're going to heap scan and sort */
+ 	HeapScanDesc hscan;
+ 	IndexInfo	*ii;
+ 	ScanKey		 scanKeys;
+ 	int			 i;
+ 
+ 	Tuplesortstate *tuplesort;
+ 	bool 		shouldfree;
+ 
  	/*
  	 * Open the relations we need.
  	 */
***************
*** 773,782 ****
  	 * copied, we scan with SnapshotAny and use HeapTupleSatisfiesVacuum for
  	 * the visibility test.
  	 */
- 	scan = index_beginscan(OldHeap, OldIndex,
- 						   SnapshotAny, 0, (ScanKey) NULL);
  
! 	while ((tuple = index_getnext(scan, ForwardScanDirection)) != NULL)
  	{
  		HeapTuple	copiedTuple;
  		bool		isdead;
--- 787,825 ----
  	 * copied, we scan with SnapshotAny and use HeapTupleSatisfiesVacuum for
  	 * the visibility test.
  	 */
  
! 	ii = BuildIndexInfo(OldIndex);
! 
! 	if (ii->ii_Expressions) {
! 		do_sort = false;
! 		shouldfree = false;
! 		scan = index_beginscan(OldHeap, OldIndex,
! 							   SnapshotAny, 0, (ScanKey) NULL);
! 	} else {
! 		do_sort = true;
! 
! 		/* Generate the scan keys for the index */
! 		scanKeys = _bt_mkscankey_nodata(OldIndex);
! 		/* And then point them at the heap attributes */
! 		for (i = 0; i < ii->ii_NumIndexAttrs; i++)
! 			scanKeys->sk_attno = ii->ii_KeyAttrNumbers[i];
! 		/* sort the raw heap tuples using the index ordering */
! 		tuplesort = tuplesort_begin_rawheap(oldTupDesc,
! 											ii->ii_NumIndexAttrs, scanKeys,
! 											maintenance_work_mem, false);
! 		
! 		hscan = heap_beginscan(OldHeap, SnapshotAny, 0, NULL);
! 		while ((tuple = heap_getnext(hscan, ForwardScanDirection))) {
! 			tuplesort_putrawtuple(tuplesort, tuple);
! 		}
! 		heap_endscan(hscan);
! 		tuplesort_performsort(tuplesort);
! 	}
! 	
! 		
! 	while ((tuple = (do_sort
! 					 ? tuplesort_getrawtuple(tuplesort, true, &shouldfree)
! 					 : index_getnext(scan, ForwardScanDirection))) != NULL)
  	{
  		HeapTuple	copiedTuple;
  		bool		isdead;
***************
*** 784,797 ****
  
  		CHECK_FOR_INTERRUPTS();
  
! 		/* Since we used no scan keys, should never need to recheck */
! 		if (scan->xs_recheck)
! 			elog(ERROR, "CLUSTER does not support lossy index conditions");
! 
! 		LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
  
  		switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin,
! 										 scan->xs_cbuf))
  		{
  			case HEAPTUPLE_DEAD:
  				/* Definitely dead */
--- 827,843 ----
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (!do_sort) {
! 			
! 			/* Since we used no scan keys, should never need to recheck */
! 			if (scan->xs_recheck)
! 				elog(ERROR, "CLUSTER does not support lossy index conditions");
! 			
! 			LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
! 		}
  
  		switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin,
! 										 (!do_sort ? scan->xs_cbuf : InvalidBuffer)))
  		{
  			case HEAPTUPLE_DEAD:
  				/* Definitely dead */
***************
*** 833,839 ****
  				break;
  		}
  
! 		LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
  
  		if (isdead)
  		{
--- 879,886 ----
  				break;
  		}
  
! 		if (!do_sort)
! 			LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
  
  		if (isdead)
  		{
***************
*** 875,883 ****
  		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
  
  		heap_freetuple(copiedTuple);
  	}
  
! 	index_endscan(scan);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
--- 922,935 ----
  		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
  
  		heap_freetuple(copiedTuple);
+ 		if (shouldfree)
+ 			heap_freetuple(tuple);
  	}
  
! 	if (do_sort)
! 		tuplesort_end(tuplesort);
! 	else
! 		index_endscan(scan);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
Index: src/backend/utils/error/elog.c
===================================================================
RCS file: /home/stark/src/REPOSITORY/pgsql/src/backend/utils/error/elog.c,v
retrieving revision 1.205
diff -c -r1.205 elog.c
*** src/backend/utils/error/elog.c	9 Jul 2008 15:56:49 -0000	1.205
--- src/backend/utils/error/elog.c	31 Aug 2008 11:21:00 -0000
***************
*** 2008,2017 ****
  		debug_query_string != NULL &&
  		!edata->hide_stmt)
  	{
! 		log_line_prefix(&buf);
! 		appendStringInfoString(&buf, _("STATEMENT:  "));
! 		append_with_tabs(&buf, debug_query_string);
! 		appendStringInfoChar(&buf, '\n');
  	}
  
  #ifdef HAVE_SYSLOG
--- 2008,2023 ----
  		debug_query_string != NULL &&
  		!edata->hide_stmt)
  	{
! 		/* awful kludge */
! 		static prev_string;
! 		if (prev_string != debug_query_string) {
! 			prev_string = debug_query_string;
! 
! 			log_line_prefix(&buf);
! 			appendStringInfoString(&buf, _("STATEMENT:  "));
! 			append_with_tabs(&buf, debug_query_string);
! 			appendStringInfoChar(&buf, '\n');
! 		}
  	}
  
  #ifdef HAVE_SYSLOG
Index: src/backend/utils/sort/tuplesort.c
===================================================================
RCS file: /home/stark/src/REPOSITORY/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.86
diff -c -r1.86 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	1 Aug 2008 13:16:09 -0000	1.86
--- src/backend/utils/sort/tuplesort.c	31 Aug 2008 13:09:14 -0000
***************
*** 443,448 ****
--- 443,449 ----
  static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
  static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
  static void markrunend(Tuplesortstate *state, int tapenum);
+ 
  static int comparetup_heap(const SortTuple *a, const SortTuple *b,
  				Tuplesortstate *state);
  static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
***************
*** 451,456 ****
--- 452,465 ----
  static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
  			 int tapenum, unsigned int len);
  static void reversedirection_heap(Tuplesortstate *state);
+ 
+ static int comparetup_rawheap(const SortTuple *a, const SortTuple *b, 
+ 							  Tuplesortstate *state);
+ static void copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup);
+ static void writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup);
+ static void readtup_rawheap(Tuplesortstate *state, SortTuple *stup, int tapenum, 
+ 							unsigned int len);
+ 
  static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
  				 Tuplesortstate *state);
  static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
***************
*** 462,467 ****
--- 471,477 ----
  			  int tapenum, unsigned int len);
  static void reversedirection_index_btree(Tuplesortstate *state);
  static void reversedirection_index_hash(Tuplesortstate *state);
+ 
  static int comparetup_datum(const SortTuple *a, const SortTuple *b,
  				 Tuplesortstate *state);
  static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
***************
*** 625,630 ****
--- 635,678 ----
  }
  
  Tuplesortstate *
+ tuplesort_begin_rawheap(TupleDesc tupDesc,
+ 						int nkeys, ScanKey scanKeys,
+ 						int workMem, bool randomAccess)
+ {
+ 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ 	MemoryContext oldcontext;
+ 
+ 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 
+ 	AssertArg(nkeys > 0);
+ 
+ #ifdef TRACE_SORT
+ 	if (trace_sort)
+ 		elog(LOG,
+ 			 "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
+ 			 nkeys, workMem, randomAccess ? 't' : 'f');
+ #endif
+ 
+ 	TRACE_POSTGRESQL_SORT_START(HEAP_SORT, false, nkeys, workMem, randomAccess);
+ 
+ 	state->nKeys = nkeys;
+ 
+ 	state->comparetup = comparetup_rawheap;
+ 	state->copytup = copytup_rawheap;
+ 	state->writetup = writetup_rawheap;
+ 	state->readtup = readtup_rawheap;
+ 	state->reversedirection = reversedirection_heap;
+ 
+ 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
+ 	state->scanKeys = (ScanKey) palloc(nkeys * sizeof(ScanKeyData));
+ 	memcpy(state->scanKeys, scanKeys, nkeys * sizeof(ScanKeyData));
+ 	
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return state;
+ }
+ 
+ Tuplesortstate *
  tuplesort_begin_index_btree(Relation indexRel,
  							bool enforceUnique,
  							int workMem, bool randomAccess)
***************
*** 916,921 ****
--- 964,991 ----
  }
  
  /*
+  * Accept one tuple while collecting input data for sort.
+  *
+  * Note that the input data is always copied; the caller need not save it.
+  */
+ void
+ tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup)
+ {
+ 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 	SortTuple	stup;
+ 
+ 	/*
+ 	 * Copy the given tuple into memory we control, and decrease availMem.
+ 	 * Then call the common code.
+ 	 */
+ 	COPYTUP(state, &stup, (void *) tup);
+ 
+ 	puttuple_common(state, &stup);
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ }
+ 
+ /*
   * Accept one index tuple while collecting input data for sort.
   *
   * Note that the input tuple is always copied; the caller need not save it.
***************
*** 1414,1419 ****
--- 1484,1509 ----
  }
  
  /*
+  * Fetch the next tuple in either forward or back direction.
+  * Returns NULL if no more tuples.	If *should_free is set, the
+  * caller must pfree the returned tuple when done with it.
+  */
+ HeapTuple
+ tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+ 					  bool *should_free)
+ {
+ 	MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 	SortTuple	stup;
+ 
+ 	if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
+ 		stup.tuple = NULL;
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return stup.tuple;
+ }
+ 
+ /*
   * Fetch the next index tuple in either forward or back direction.
   * Returns NULL if no more tuples.	If *should_free is set, the
   * caller must pfree the returned tuple when done with it.
***************
*** 2704,2709 ****
--- 2794,2908 ----
  
  
  /*
+  * Routines specialized for Raw on-disk HeapTuple case These are only used when
+  * we need the visibility info for things like CLUSTER. Otherwise we used the
+  * regular *tup_heap methods which actually manipulate MinimalTuples.
+  */
+ 
+ static int
+ comparetup_rawheap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
+ {
+ 	ScanKey		scanKey = state->scanKeys;
+ 	HeapTuple ltup;
+ 	HeapTuple rtup;
+ 	TupleDesc	tupDesc;
+ 	int			nkey;
+ 	int32		compare;
+ 
+ 	/* Allow interrupting long sorts */
+ 	CHECK_FOR_INTERRUPTS();
+ 
+ 	/* Compare the leading sort key */
+ 	compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+ 									  a->datum1, a->isnull1,
+ 									  b->datum1, b->isnull1);
+ 	if (compare != 0)
+ 		return compare;
+ 
+ 	/* Compare additional sort keys */
+ 	ltup = (HeapTuple) a->tuple;
+ 	rtup = (HeapTuple) b->tuple;
+ 	tupDesc = state->tupDesc;
+ 	scanKey++;
+ 	for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+ 	{
+ 		AttrNumber	attno = scanKey->sk_attno;
+ 		Datum		datum1,
+ 					datum2;
+ 		bool		isnull1,
+ 					isnull2;
+ 
+ 		datum1 = heap_getattr(ltup, attno, tupDesc, &isnull1);
+ 		datum2 = heap_getattr(rtup, attno, tupDesc, &isnull2);
+ 
+ 		compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+ 										  datum1, isnull1,
+ 										  datum2, isnull2);
+ 		if (compare != 0)
+ 			return compare;
+ 	}
+ 
+ 	return 0;
+ }
+ 
+ static void
+ copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup)
+ {
+ 	HeapTuple	tuple = (HeapTuple) tup;
+ 
+ 	/* copy the tuple into sort storage */
+ 	stup->tuple = (void *) heap_copytuple(tuple);
+ 	USEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ 	/* set up first-column key value */
+ 	stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 								state->scanKeys[0].sk_attno,
+ 								state->tupDesc,
+ 								&stup->isnull1);
+ }
+ 
+ static void
+ writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+ {
+ 	HeapTuple	tuple = (HeapTuple) stup->tuple;
+ 	tuple->t_len += HEAPTUPLESIZE; /* write out the header as well */
+ 
+ 	LogicalTapeWrite(state->tapeset, tapenum,
+ 					 tuple, tuple->t_len);
+ 
+ 	if (state->randomAccess)	/* need trailing length word? */
+ 		LogicalTapeWrite(state->tapeset, tapenum,
+ 						 tuple, sizeof(tuple->t_len));
+ 
+ 	FREEMEM(state, GetMemoryChunkSpace(tuple));
+ 	heap_freetuple(tuple);
+ }
+ 
+ static void
+ readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+ 			 int tapenum, unsigned int tuplen)
+ {
+ 	HeapTuple	tuple = (HeapTuple) palloc(tuplen);
+ 
+ 	USEMEM(state, GetMemoryChunkSpace(tuple));
+ 
+ 	tuple->t_len = tuplen - HEAPTUPLESIZE;
+ 	if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self, tuplen-sizeof(tuplen)) != tuplen-sizeof(tuplen))
+ 		elog(ERROR, "unexpected end of data");
+ 	if (state->randomAccess)	/* need trailing length word? */
+ 		if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
+ 							sizeof(tuplen)) != sizeof(tuplen))
+ 			elog(ERROR, "unexpected end of data");
+ 
+ 	stup->tuple = tuple;
+ 	/* set up first-column key value */
+ 	stup->datum1 = heap_getattr(tuple,
+ 								state->scanKeys[0].sk_attno,
+ 								state->tupDesc,
+ 								&stup->isnull1);
+ }
+ 
+ 
+ /*
   * Routines specialized for IndexTuple case
   *
   * The btree and hash cases require separate comparison functions, but the
Index: src/backend/utils/time/tqual.c
===================================================================
RCS file: /home/stark/src/REPOSITORY/pgsql/src/backend/utils/time/tqual.c,v
retrieving revision 1.110
diff -c -r1.110 tqual.c
*** src/backend/utils/time/tqual.c	26 Mar 2008 16:20:47 -0000	1.110
--- src/backend/utils/time/tqual.c	31 Aug 2008 11:34:11 -0000
***************
*** 86,91 ****
--- 86,97 ----
  SetHintBits(HeapTupleHeader tuple, Buffer buffer,
  			uint16 infomask, TransactionId xid)
  {
+ 	/* This is primarily for CLUSTER which needs to do HTSV on tuples copied
+ 	 * into local memory long after the original buffer is gone. Under normal
+ 	 * operation we hope branch prediction will make this zero-cost */
+ 	if (!BufferIsValid(buffer))
+ 		return;
+ 
  	if (TransactionIdIsValid(xid))
  	{
  		/* NB: xid must be known committed here! */
Index: src/include/utils/tuplesort.h
===================================================================
RCS file: /home/stark/src/REPOSITORY/pgsql/src/include/utils/tuplesort.h,v
retrieving revision 1.31
diff -c -r1.31 tuplesort.h
*** src/include/utils/tuplesort.h	19 Jun 2008 00:46:06 -0000	1.31
--- src/include/utils/tuplesort.h	31 Aug 2008 10:37:38 -0000
***************
*** 55,60 ****
--- 55,63 ----
  					 int nkeys, AttrNumber *attNums,
  					 Oid *sortOperators, bool *nullsFirstFlags,
  					 int workMem, bool randomAccess);
+ extern Tuplesortstate * tuplesort_begin_rawheap(TupleDesc tupDesc,
+ 						int nkeys, ScanKey scanKeys,
+ 						int workMem, bool randomAccess);
  extern Tuplesortstate *tuplesort_begin_index_btree(Relation indexRel,
  							bool enforceUnique,
  							int workMem, bool randomAccess);
***************
*** 69,74 ****
--- 72,78 ----
  
  extern void tuplesort_puttupleslot(Tuplesortstate *state,
  					   TupleTableSlot *slot);
+ extern void tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup);
  extern void tuplesort_putindextuple(Tuplesortstate *state, IndexTuple tuple);
  extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
  				   bool isNull);
***************
*** 77,82 ****
--- 81,88 ----
  
  extern bool tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
  					   TupleTableSlot *slot);
+ extern HeapTuple tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+ 									   bool *should_free);
  extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward,
  						bool *should_free);
  extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
#2Martijn van Oosterhout
kleptog@svana.org
In reply to: Gregory Stark (#1)
Re: Our CLUSTER implementation is pessimal

On Mon, Sep 01, 2008 at 12:25:26AM +0100, Gregory Stark wrote:

The problem is that it does a full index scan and looks up each tuple in the
order of the index. That means it a) is doing a lot of random i/o and b) has
to access the same pages over and over again.

<snip>

a) We need some way to decide *when* to do a sort and when to do an index
scan. The planner has all this machinery but we don't really have all the
pieces handy to use it in a utility statement. This is especially important
for the case where we're doing a cluster operation on a table that's already
clustered. In that case an index scan could conceivably actually win (though I
kind of doubt it). I don't really have a solution for this.

The case I had recently was a table that was hugely bloated. 300MB data
and only 110 live rows. A cluster was instant, a seqscan/sort would
probably be much slower. A VACUUM FULL probably worse :)

Isn't there some compromise. Like say scanning the index to collect a
few thousand records and then sort them the way a bitmap index scan
does. Should be much more efficient that what we have now.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Martijn van Oosterhout (#2)
Re: Our CLUSTER implementation is pessimal

Martijn van Oosterhout wrote:

On Mon, Sep 01, 2008 at 12:25:26AM +0100, Gregory Stark wrote:

The problem is that it does a full index scan and looks up each tuple in the
order of the index. That means it a) is doing a lot of random i/o and b) has
to access the same pages over and over again.

<snip>

a) We need some way to decide *when* to do a sort and when to do an index
scan. The planner has all this machinery but we don't really have all the
pieces handy to use it in a utility statement. This is especially important
for the case where we're doing a cluster operation on a table that's already
clustered. In that case an index scan could conceivably actually win (though I
kind of doubt it). I don't really have a solution for this.

The case I had recently was a table that was hugely bloated. 300MB data
and only 110 live rows. A cluster was instant, a seqscan/sort would
probably be much slower. A VACUUM FULL probably worse :)

Isn't there some compromise. Like say scanning the index to collect a
few thousand records and then sort them the way a bitmap index scan
does. Should be much more efficient that what we have now.

Ideally we would use the planner, and the planner would choose the best
plan for a bloated table like that (it probably does, I'm not sure) as well.

However, I'm not sure how much we can trust the statistics for a table
we're about to CLUSTER. Often you run CLUSTER on a table you've just
loaded or mass-updated.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#1)
Re: Our CLUSTER implementation is pessimal

Gregory Stark <stark@enterprisedb.com> writes:

There are a couple problems with this:

a) We need some way to decide *when* to do a sort and when to do an index
scan. The planner has all this machinery but we don't really have all the
pieces handy to use it in a utility statement.

Why not? You don't even need any quals when trying to cost a full-index
scan.

b) tuplesort no longer has the pieces needed to sort whole tuples including
visibility info. And actually even the old pieces that were removed had not
quite the right interface and behaviour. We need to preserve t_self for the
heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to
generate the scan keys that match the index.

So you just broke it irredeemably for non-btree indexes, no?

regards, tom lane

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Gregory Stark (#1)
Re: Our CLUSTER implementation is pessimal

On Mon, 2008-09-01 at 00:25 +0100, Gregory Stark wrote:

One thing that's been annoying me for a while is that our CLUSTER
implementation is really very slow. When I say very slow I mean it's really
very very very slow.

Does this implementation work towards being able to do
CREATE INDEX ... CLUSTER TABLE
So that we can do both actions with just one sort of the data?

I think there needs to be an option to force this to do either sorts or
indexscans. On a large table you may not have the space to perform a
full table sort, plus on a multi-column index we may not accurately
predict the cost of an indexscan.

(What is the change to elog.c about?)

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#5)
Re: Our CLUSTER implementation is pessimal

Simon Riggs wrote:

I think there needs to be an option to force this to do either sorts or
indexscans.

If we use the planner, "set enable_indexscan =off" or "set
enable_sort=off" ought to work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#6)
Re: Our CLUSTER implementation is pessimal

On Mon, 2008-09-08 at 13:52 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

I think there needs to be an option to force this to do either sorts or
indexscans.

If we use the planner, "set enable_indexscan =off" or "set
enable_sort=off" ought to work.

Agreed - as long as that is explicitly in the docs.

I'm wondering whether we should put a limit on size of each temp
tablespace. This change will cause old admin jobs to break disks that
aren't big enough for the new way of doing it.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#8Gregory Stark
stark@enterprisedb.com
In reply to: Heikki Linnakangas (#6)
Re: Our CLUSTER implementation is pessimal

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Simon Riggs wrote:

I think there needs to be an option to force this to do either sorts or
indexscans.

If we use the planner, "set enable_indexscan =off" or "set enable_sort=off"
ought to work.

Yeah, I've been thinking about how to use the planner to do this. It seems to
me it would be a much better solution because it would allow us to take
advantage of other access paths as well (such as other indexes) and even new
features that don't currently exist (index-only scans...).

To do that it seems to me what we would need to do is add a function
_pg_get_rawtuple_header() which returns the visibility information that HTSV
needs.

Then we need to construct an SPI query like

SELECT _pg_get_rawtuple_header(), * FROM tab ORDER BY col1, col2, col3, ...

For each tuple we'll have to deform it, and reform it using the new tuple
descriptor and just the columns excluding the header and pass that to the heap
rewrite module. Passing the header separately.

Heap rewrite would have to call HTSV on just the header (with the same hack I
put in for this patch to allow passing InvalidBuffer to HTSV).

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#8)
Re: Our CLUSTER implementation is pessimal

Gregory Stark <stark@enterprisedb.com> writes:

Yeah, I've been thinking about how to use the planner to do this.

I thought the answer to that was going to be more or less "call
cost_sort() and cost_index() and compare the answers".

To do that it seems to me what we would need to do is add a function
_pg_get_rawtuple_header() which returns the visibility information that HTSV
needs.

You seem to be confusing "use the planner" with "use the executor".
All that we need here is a decision about which code path to take
within CLUSTER. We don't need to bring in boatloads of irrelevant
infrastructure --- especially not infrastructure that's going to be
fighting us every step of the way. The executor isn't designed to
return raw tuples and no magic function is going to change that.

regards, tom lane

#10Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#9)
Re: Our CLUSTER implementation is pessimal

Tom Lane <tgl@sss.pgh.pa.us> writes:

Gregory Stark <stark@enterprisedb.com> writes:

Yeah, I've been thinking about how to use the planner to do this.

I thought the answer to that was going to be more or less "call
cost_sort() and cost_index() and compare the answers".

That was the way I was headed. But the idea of using the executor is looking
more and more attractive the more I think about it. It would solve this
problem as well as future-proof us against any other new features which could
speed up CLUSTER.

In particular I'm thinking of people clustering on a covering index (which
isn't as uncommon as it sounds, if you have a covering index you probably do
want to cluster it -- consider many-to-many join tables). We should be able to
do an index-only scan which might be even faster than sorting.

Also, it would cleanly solve the expression index case which my current
solution can't solve (not without much larger changes to tuplesort because the
sort key isn't present in the sort tuple). Also, if your table is very bloated
it could be faster to do a bitmap index scan and sort. Or scan another newer
and better organized index instead of the index you're clustering.

Basically I feel like what I've done is reproduce a small part of the planner
and executor and it would be a cleaner and more general solution to just do it
properly.

To do that it seems to me what we would need to do is add a function
_pg_get_rawtuple_header() which returns the visibility information that HTSV
needs.

You seem to be confusing "use the planner" with "use the executor".
All that we need here is a decision about which code path to take
within CLUSTER. We don't need to bring in boatloads of irrelevant
infrastructure --- especially not infrastructure that's going to be
fighting us every step of the way. The executor isn't designed to
return raw tuples and no magic function is going to change that.

Actually, thinking about it, couldn't we just a new system column to do this?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#10)
Re: Our CLUSTER implementation is pessimal

Gregory Stark <stark@enterprisedb.com> writes:

In particular I'm thinking of people clustering on a covering index (which
isn't as uncommon as it sounds, if you have a covering index you probably do
want to cluster it -- consider many-to-many join tables). We should be able to
do an index-only scan which might be even faster than sorting.

[ scratches head... ] You need *all* the data from the heap. Or by
"covering index" do you mean an index that contains the entire table
contents? Doesn't really sound like a case we need to focus on; or
at least this version of clustering isn't what it needs, it wants an
implementation where the table and the index are the same thing.

regards, tom lane

#12Bruce Momjian
bruce@momjian.us
In reply to: Gregory Stark (#1)
Re: Our CLUSTER implementation is pessimal

Added to TODO:

Improve CLUSTER performance by sorting to reduce random I/O

* http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php

---------------------------------------------------------------------------

Gregory Stark wrote:

One thing that's been annoying me for a while is that our CLUSTER
implementation is really very slow. When I say very slow I mean it's really
very very very slow.

The problem is that it does a full index scan and looks up each tuple in the
order of the index. That means it a) is doing a lot of random i/o and b) has
to access the same pages over and over again.

Our query planner knows better and normally does a sequential heap scan and
sort for queries of this type. But CLUSTER has to use the index.

Just the other day I wanted to randomize the order of a table so I added a
random column, indexed it and started cluster running. After 6 hours I looked
at how far it had gotten and found it was only about 20% done rewriting the
heap (not including rebuilding the indexes). I cancelled it and built a new
table in the desired orderwith CREATE TABLE AS ... ORDER BY in 45 minutes!

There are a couple problems with this:

a) We need some way to decide *when* to do a sort and when to do an index
scan. The planner has all this machinery but we don't really have all the
pieces handy to use it in a utility statement. This is especially important
for the case where we're doing a cluster operation on a table that's already
clustered. In that case an index scan could conceivably actually win (though I
kind of doubt it). I don't really have a solution for this.

b) tuplesort no longer has the pieces needed to sort whole tuples including
visibility info. And actually even the old pieces that were removed had not
quite the right interface and behaviour. We need to preserve t_self for the
heap rewrite tools and we need to be able to use _bt_mkscankey_nodata() to
generate the scan keys that match the index. The cleanest approach I think is
to just add back in the old raw tuple methods to tuplesort and tweak them to
preserve t_self and take Scankeys instead of attnums etc.

c) expression indexes are a problem. We would have to start with constructing
new tuples to hold the expression values but retain the entire original heap
tuple. Perhaps pass both an indextuple and a heaptuple to tuplesort and store
both in the sorttuple? For now I figure we can just punt on expression indexes
and always do an index sacn.

I tested out the idea and it works reasonably well. The code might need some
cleanup including, I think refactoring cluster_rel() and possibly tweaking the
interface to tuplesort. But I'm fairly happy with it.

The results clustering a 2.2GB table with 2,000,000 rows on my 1.5GB laptop
using maintenance_work_mem of 768MB and work_mem of 256MB:

postgres=# \d x

Table "public.x"
Column | Type | Modifiers
--------+------------------+-----------
i | integer |
r | double precision |
repeat | text |
Indexes:
"xi" btree (i)
"xi2" btree ((i + 0))
"xir" btree (r) CLUSTER

postgresql=# CLUSTER xi ON x;
CLUSTER
Time: 736029.322 ms

postgresql=# CLUSTER xir ON x;
CLUSTER
Time: 723610.507 ms

postgresql=# CLUSTER xi2 ON x;
CLUSTER
Time: 12842793.346 ms

That's 3.5 hours for traditional cluster and 12 minutes for cluster using a
sort...

[ Attachment, skipping... ]

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +