hash index improving v3

Started by Xiao Mengover 17 years ago63 messages

mx.cogito@gmail.com

over 17 years ago

1 attachment(s)

There's minor change against the previous one(
http://archives.postgresql.org/pgsql-hackers/2008-07/msg01183.php ).
* merge branch master(Aug 16) into the patch
* clean code and make some comment
Performance result is here
http://wiki.postgresql.org/wiki/Gsoc08-hashindex

It seems hash index is a little better on index creation and selection.
But maybe it's in the range of noise, I'm not sure.
I'd like to try it with a bigger dataset (e.g. table with 10GB) but there is
not enough space in my computer.
Anyone interest can make a test on a bigger data set.

--
Best Regards,
Xiao Meng

DKERC, Harbin Institute of Technology, China
Gtalk: mx.cogito@gmail.com
MSN: cnEnder@live.com
http://xiaomeng.yo2.cn

Attachments:

hash-v3.patchtext/x-diff; name=hash-v3.patchDownload

*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
***************
*** 129,135 **** hashbuildCallback(Relation index,
  	IndexTuple	itup;
  
  	/* form an index tuple and point it at the heap tuple */
! 	itup = index_form_tuple(RelationGetDescr(index), values, isnull);
  	itup->t_tid = htup->t_self;
  
  	/* Hash indexes don't index nulls, see notes in hashinsert */
--- 129,135 ----
  	IndexTuple	itup;
  
  	/* form an index tuple and point it at the heap tuple */
! 	itup = _hash_form_tuple(index, values,isnull);
  	itup->t_tid = htup->t_self;
  
  	/* Hash indexes don't index nulls, see notes in hashinsert */
***************
*** 153,160 **** hashbuildCallback(Relation index,
  /*
   *	hashinsert() -- insert an index tuple into a hash table.
   *
!  *	Hash on the index tuple's key, find the appropriate location
!  *	for the new tuple, and put it there.
   */
  Datum
  hashinsert(PG_FUNCTION_ARGS)
--- 153,160 ----
  /*
   *	hashinsert() -- insert an index tuple into a hash table.
   *
!  *	Hash on the heap tuple's key, form an index tuple with hash code.
!  *	Find the appropriate location for the new tuple, and put it there.
   */
  Datum
  hashinsert(PG_FUNCTION_ARGS)
***************
*** 171,177 **** hashinsert(PG_FUNCTION_ARGS)
  	IndexTuple	itup;
  
  	/* generate an index tuple */
! 	itup = index_form_tuple(RelationGetDescr(rel), values, isnull);
  	itup->t_tid = *ht_ctid;
  
  	/*
--- 171,177 ----
  	IndexTuple	itup;
  
  	/* generate an index tuple */
! 	itup = _hash_form_tuple(rel, values, isnull);
  	itup->t_tid = *ht_ctid;
  
  	/*
***************
*** 211,218 **** hashgettuple(PG_FUNCTION_ARGS)
  	OffsetNumber offnum;
  	bool		res;
  
! 	/* Hash indexes are never lossy (at the moment anyway) */
! 	scan->xs_recheck = false;
  
  	/*
  	 * We hold pin but not lock on current buffer while outside the hash AM.
--- 211,218 ----
  	OffsetNumber offnum;
  	bool		res;
  
! 	/* Hash indexes maybe lossy since we store hash code only */
! 	scan->xs_recheck = true;
  
  	/*
  	 * We hold pin but not lock on current buffer while outside the hash AM.
*** a/src/backend/access/hash/hashinsert.c
--- b/src/backend/access/hash/hashinsert.c
***************
*** 44,60 **** _hash_doinsert(Relation rel, IndexTuple itup)
  	uint32		hashkey;
  	Bucket		bucket;
  	Datum		datum;
- 	bool		isnull;
  
  	/*
! 	 * Compute the hash key for the item.  We do this first so as not to need
! 	 * to hold any locks while running the hash function.
  	 */
  	if (rel->rd_rel->relnatts != 1)
  		elog(ERROR, "hash indexes support only one index key");
! 	datum = index_getattr(itup, 1, RelationGetDescr(rel), &isnull);
! 	Assert(!isnull);
! 	hashkey = _hash_datum2hashkey(rel, datum);
  
  	/* compute item size too */
  	itemsz = IndexTupleDSize(*itup);
--- 44,58 ----
  	uint32		hashkey;
  	Bucket		bucket;
  	Datum		datum;
  
  	/*
! 	 * Get the hash key for the item.
  	 */
  	if (rel->rd_rel->relnatts != 1)
  		elog(ERROR, "hash indexes support only one index key");
! 
! 	datum = _hash_get_datum(itup);
! 	hashkey = DatumGetUInt32(datum);
  
  	/* compute item size too */
  	itemsz = IndexTupleDSize(*itup);
***************
*** 197,207 **** _hash_pgaddtup(Relation rel,
  {
  	OffsetNumber itup_off;
  	Page		page;
  
  	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
  	page = BufferGetPage(buf);
  
! 	itup_off = OffsetNumberNext(PageGetMaxOffsetNumber(page));
  	if (PageAddItem(page, (Item) itup, itemsize, itup_off, false, false)
  		== InvalidOffsetNumber)
  		elog(ERROR, "failed to add index item to \"%s\"",
--- 195,210 ----
  {
  	OffsetNumber itup_off;
  	Page		page;
+ 	Datum		datum;
+ 	uint32		hashkey;
  
  	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
  	page = BufferGetPage(buf);
  
! 	datum = _hash_get_datum(itup);
! 	hashkey = DatumGetUInt32(datum);
! 	itup_off = _hash_binsearch(page, hashkey);
! 
  	if (PageAddItem(page, (Item) itup, itemsize, itup_off, false, false)
  		== InvalidOffsetNumber)
  		elog(ERROR, "failed to add index item to \"%s\"",
*** a/src/backend/access/hash/hashpage.c
--- b/src/backend/access/hash/hashpage.c
***************
*** 348,358 **** _hash_metapinit(Relation rel, double num_tuples)
  	 * Determine the target fill factor (in tuples per bucket) for this index.
  	 * The idea is to make the fill factor correspond to pages about as full
  	 * as the user-settable fillfactor parameter says.	We can compute it
! 	 * exactly if the index datatype is fixed-width, but for var-width there's
! 	 * some guessing involved.
  	 */
! 	data_width = get_typavgwidth(RelationGetDescr(rel)->attrs[0]->atttypid,
! 								 RelationGetDescr(rel)->attrs[0]->atttypmod);
  	item_width = MAXALIGN(sizeof(IndexTupleData)) + MAXALIGN(data_width) +
  		sizeof(ItemIdData);		/* include the line pointer */
  	ffactor = RelationGetTargetPageUsage(rel, HASH_DEFAULT_FILLFACTOR) / item_width;
--- 348,356 ----
  	 * Determine the target fill factor (in tuples per bucket) for this index.
  	 * The idea is to make the fill factor correspond to pages about as full
  	 * as the user-settable fillfactor parameter says.	We can compute it
! 	 * exactly since the index datatype (i.e. hash code of uint32 ) is fixed-width.
  	 */
! 	data_width = 4;
  	item_width = MAXALIGN(sizeof(IndexTupleData)) + MAXALIGN(data_width) +
  		sizeof(ItemIdData);		/* include the line pointer */
  	ffactor = RelationGetTargetPageUsage(rel, HASH_DEFAULT_FILLFACTOR) / item_width;
***************
*** 785,791 **** _hash_splitbucket(Relation rel,
  	OffsetNumber omaxoffnum;
  	Page		opage;
  	Page		npage;
! 	TupleDesc	itupdesc = RelationGetDescr(rel);
  
  	/*
  	 * It should be okay to simultaneously write-lock pages from each bucket,
--- 783,789 ----
  	OffsetNumber omaxoffnum;
  	Page		opage;
  	Page		npage;
! 
  
  	/*
  	 * It should be okay to simultaneously write-lock pages from each bucket,
***************
*** 848,861 **** _hash_splitbucket(Relation rel,
  		/*
  		 * Re-hash the tuple to determine which bucket it now belongs in.
  		 *
! 		 * It is annoying to call the hash function while holding locks, but
! 		 * releasing and relocking the page for each tuple is unappealing too.
  		 */
  		itup = (IndexTuple) PageGetItem(opage, PageGetItemId(opage, ooffnum));
! 		datum = index_getattr(itup, 1, itupdesc, &null);
! 		Assert(!null);
! 
! 		bucket = _hash_hashkey2bucket(_hash_datum2hashkey(rel, datum),
  									  maxbucket, highmask, lowmask);
  
  		if (bucket == nbucket)
--- 846,856 ----
  		/*
  		 * Re-hash the tuple to determine which bucket it now belongs in.
  		 *
!          * We needn't call hash function since we store hash code in the index tuple
  		 */
  		itup = (IndexTuple) PageGetItem(opage, PageGetItemId(opage, ooffnum));
! 		datum = _hash_get_datum(itup);
! 		bucket = _hash_hashkey2bucket(DatumGetUInt32(datum),
  									  maxbucket, highmask, lowmask);
  
  		if (bucket == nbucket)
*** a/src/backend/access/hash/hashsearch.c
--- b/src/backend/access/hash/hashsearch.c
***************
*** 178,183 **** _hash_first(IndexScanDesc scan, ScanDirection dir)
--- 178,184 ----
  		hashkey = _hash_datum2hashkey_type(rel, cur->sk_argument,
  										   cur->sk_subtype);
  
+     so->hashso_sk_hash = hashkey;
  	/*
  	 * Acquire shared split lock so we can compute the target bucket safely
  	 * (see README).
***************
*** 289,370 **** _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
  	 * continue to step through tuples until: 1) we get to the end of the
  	 * bucket chain or 2) we find a valid tuple.
  	 */
! 	do
! 	{
! 		switch (dir)
  		{
! 			case ForwardScanDirection:
! 				if (offnum != InvalidOffsetNumber)
! 					offnum = OffsetNumberNext(offnum);	/* move forward */
! 				else
! 					offnum = FirstOffsetNumber; /* new page */
! 
! 				while (offnum > maxoff)
! 				{
! 					/*
! 					 * either this page is empty (maxoff ==
! 					 * InvalidOffsetNumber) or we ran off the end.
! 					 */
! 					_hash_readnext(rel, &buf, &page, &opaque);
! 					if (BufferIsValid(buf))
! 					{
! 						maxoff = PageGetMaxOffsetNumber(page);
! 						offnum = FirstOffsetNumber;
! 					}
! 					else
! 					{
! 						/* end of bucket */
! 						maxoff = offnum = InvalidOffsetNumber;
! 						break;	/* exit while */
! 					}
! 				}
! 				break;
! 
! 			case BackwardScanDirection:
! 				if (offnum != InvalidOffsetNumber)
! 					offnum = OffsetNumberPrev(offnum);	/* move back */
! 				else
! 					offnum = maxoff;	/* new page */
! 
! 				while (offnum < FirstOffsetNumber)
! 				{
! 					/*
! 					 * either this page is empty (offnum ==
! 					 * InvalidOffsetNumber) or we ran off the end.
! 					 */
! 					_hash_readprev(rel, &buf, &page, &opaque);
! 					if (BufferIsValid(buf))
! 						maxoff = offnum = PageGetMaxOffsetNumber(page);
! 					else
! 					{
! 						/* end of bucket */
! 						maxoff = offnum = InvalidOffsetNumber;
! 						break;	/* exit while */
! 					}
! 				}
! 				break;
! 
! 			default:
! 				/* NoMovementScanDirection */
! 				/* this should not be reached */
! 				break;
  		}
  
! 		/* we ran off the end of the world without finding a match */
! 		if (offnum == InvalidOffsetNumber)
  		{
! 			*bufP = so->hashso_curbuf = InvalidBuffer;
! 			ItemPointerSetInvalid(current);
! 			return false;
  		}
! 
! 		/* get ready to check this tuple */
! 		itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
! 	} while (!_hash_checkqual(scan, itup));
! 
! 	/* if we made it to here, we've found a valid tuple */
! 	blkno = BufferGetBlockNumber(buf);
! 	*bufP = so->hashso_curbuf = buf;
! 	ItemPointerSet(current, blkno, offnum);
! 	return true;
  }
--- 290,340 ----
  	 * continue to step through tuples until: 1) we get to the end of the
  	 * bucket chain or 2) we find a valid tuple.
  	 */
!     for (;;)
!     {
!         if (offnum == InvalidOffsetNumber)
!         {
! 			/*
! 			 * This is the first time we're scanning this particular
! 			 * page of the bucket, so jump to the right spot via
! 			 * binary search.
! 			 */
! 			offnum = _hash_binsearch(page, so->hashso_sk_hash);
! 		}
!         else
  		{
! 			/* Advance to the next tuple */
! 			offnum = OffsetNumberNext(offnum);
  		}
  
! 		if (offnum <= maxoff) itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
! 		if (offnum <= maxoff && _hash_checkqual(scan, itup))
  		{
! 			/* Found a matching tuple */
! 			*bufP = so->hashso_curbuf = buf;
! 			ItemPointerSet(current, BufferGetBlockNumber(buf), offnum);
! 			return true;
  		}
! 		else
! 		{
! 			/* No more matches on this page, so go on to next page */
! 			if (ScanDirectionIsForward(dir))
! 				_hash_readnext(rel, &buf, &page, &opaque);
! 			else
! 				_hash_readprev(rel, &buf, &page, &opaque);
! 
! 			if (BufferIsValid(buf))
! 			{
! 				maxoff = PageGetMaxOffsetNumber(page);
! 				offnum = InvalidOffsetNumber;
! 			}
! 			else
! 			{
! 				/* Ran out of pages, so we're done */
! 				*bufP = so->hashso_curbuf = InvalidBuffer;
! 				ItemPointerSetInvalid(current);
! 				return false;
! 		    }
!         }
!     }
  }
*** a/src/backend/access/hash/hashutil.c
--- b/src/backend/access/hash/hashutil.c
***************
*** 20,53 ****
  #include "executor/execdebug.h"
  #include "storage/bufmgr.h"
  #include "utils/lsyscache.h"
  
  
  /*
   * _hash_checkqual -- does the index tuple satisfy the scan conditions?
   */
  bool
  _hash_checkqual(IndexScanDesc scan, IndexTuple itup)
  {
- 	TupleDesc	tupdesc = RelationGetDescr(scan->indexRelation);
  	ScanKey		key = scan->keyData;
  	int			scanKeySize = scan->numberOfKeys;
  
  	IncrIndexProcessed();
  
  	while (scanKeySize > 0)
  	{
- 		Datum		datum;
- 		bool		isNull;
  		Datum		test;
! 
! 		datum = index_getattr(itup,
! 							  key->sk_attno,
! 							  tupdesc,
! 							  &isNull);
  
  		/* assume sk_func is strict */
- 		if (isNull)
- 			return false;
  		if (key->sk_flags & SK_ISNULL)
  			return false;
  
--- 20,59 ----
  #include "executor/execdebug.h"
  #include "storage/bufmgr.h"
  #include "utils/lsyscache.h"
+ #include "utils/memutils.h"
+ #include "catalog/pg_type.h"
  
  
  /*
+  *		hardcoded tuple descriptors.  see include/access/hash.h
+  */
+ static FormData_pg_attribute Desc_hash[1] = {Schema_Hash};
+ /*
   * _hash_checkqual -- does the index tuple satisfy the scan conditions?
   */
  bool
  _hash_checkqual(IndexScanDesc scan, IndexTuple itup)
  {
  	ScanKey		key = scan->keyData;
  	int			scanKeySize = scan->numberOfKeys;
+ 	Datum		datum;
+ 	HashScanOpaque      so = scan->opaque;
  
  	IncrIndexProcessed();
+ 	
+ 	datum = _hash_get_datum(itup);
+ 	if( so->hashso_sk_hash != DatumGetUInt32(datum) )
+ 		return false;
+ 	key++;
+ 	scanKeySize--;
  
  	while (scanKeySize > 0)
  	{
  		Datum		test;
! 		
! 		datum = _hash_get_datum(itup);
  
  		/* assume sk_func is strict */
  		if (key->sk_flags & SK_ISNULL)
  			return false;
  
***************
*** 221,223 **** hashoptions(PG_FUNCTION_ARGS)
--- 227,326 ----
  		PG_RETURN_BYTEA_P(result);
  	PG_RETURN_NULL();
  }
+ 
+ /*
+  * _get_hash_desc - get the hash index tuple descriptor
+  *
+  * The hash index tuple descriptor is a hard-coded tuple descriptor with only int32 attribute.
+  */
+ TupleDesc _get_hash_desc()
+ {
+ 	static TupleDesc hashdesc = NULL;
+ 
+ 	/* Already done? */
+ 	if (hashdesc == NULL){ 
+         /* 
+          * It's the same with BuildHardcodedDescriptor() in relcache.c to build
+          * a hard-coded tuple descriptor. 
+          */
+ 		MemoryContext oldcxt;
+ 		
+ 		oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
+ 		
+ 		hashdesc = CreateTemplateTupleDesc(1, false);
+ 		hashdesc->tdtypeid = INT4OID;
+ 		hashdesc->tdtypmod = -1;
+ 		memcpy(hashdesc->attrs[0], &Desc_hash[0], ATTRIBUTE_TUPLE_SIZE);
+ 		hashdesc->attrs[0]->attcacheoff = 0; 
+         
+ 		MemoryContextSwitchTo(oldcxt);
+     }
+ 
+ 	return hashdesc;
+ 
+ }
+ 
+ /*
+  * _hash_get_datum - get the hash index tuple's first column datum (i.e. hash code)
+  */
+ Datum _hash_get_datum(IndexTuple itup)
+ {
+     return fetch_att( (char *) (itup) + IndexInfoFindDataOffset((itup)->t_info) ,
+                       true, 
+                       4);
+ 
+ }
+ /*
+  * _hash_form_tuple - form a tuple with hash code only 
+  */
+ IndexTuple _hash_form_tuple(Relation rel, Datum* values, bool* isnull)
+ {
+     TupleDesc       hashdesc;
+     IndexTuple      itup;
+     uint32          hashkey;
+ 
+     hashdesc = _get_hash_desc();
+     hashkey = _hash_datum2hashkey(rel, values[0]);
+     values[0] = UInt32GetDatum(hashkey);
+     itup = index_form_tuple(hashdesc, values, isnull);
+     return itup;
+ }
+ 
+ /*
+  * _hash_binsearch - Return the offset number in the page where the specified hash value
+  *                   should be located. 
+  *
+  * The return value might exceed the page's max offset 
+  * if the hash value is greater than any hash in the page.
+  */
+ OffsetNumber
+ _hash_binsearch(Page page, uint32 hash_value)
+ {
+ 	OffsetNumber	upper;
+ 	OffsetNumber	lower;
+ 
+ 	upper = PageGetMaxOffsetNumber(page) + 1;
+ 	lower = FirstOffsetNumber;
+ 
+ 	while (upper > lower)
+ 	{
+ 		IndexTuple      pos;
+ 		OffsetNumber    off;
+         uint32          hashkey;
+         Datum           datum;
+         bool            isNull;
+ 
+ 		off = (upper + lower) / 2;
+ 		Assert(OffsetNumberIsValid(off));
+ 
+ 		pos = (IndexTuple) PageGetItem(page, PageGetItemId(page, off));
+         datum = _hash_get_datum(pos);
+         hashkey = DatumGetUInt32(datum);
+ 		if (hashkey < hash_value)
+ 			lower = off + 1;
+ 		else
+ 			upper = off;
+ 	}
+ 
+ 	return upper;
+ }
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
***************
*** 456,465 **** static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
--- 456,468 ----
  static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
  				 Tuplesortstate *state);
  static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup);
+ static void copytup_index_hash(Tuplesortstate *state, SortTuple *stup, void *tup);
  static void writetup_index(Tuplesortstate *state, int tapenum,
  			   SortTuple *stup);
  static void readtup_index(Tuplesortstate *state, SortTuple *stup,
  			  int tapenum, unsigned int len);
+ static void readtup_index_hash(Tuplesortstate *state, SortTuple *stup,
+ 			  int tapenum, unsigned int len);
  static void reversedirection_index_btree(Tuplesortstate *state);
  static void reversedirection_index_hash(Tuplesortstate *state);
  static int comparetup_datum(const SortTuple *a, const SortTuple *b,
***************
*** 682,690 **** tuplesort_begin_index_hash(Relation indexRel,
  	state->nKeys = 1;			/* Only one sort column, the hash code */
  
  	state->comparetup = comparetup_index_hash;
! 	state->copytup = copytup_index;
  	state->writetup = writetup_index;
! 	state->readtup = readtup_index;
  	state->reversedirection = reversedirection_index_hash;
  
  	state->indexRel = indexRel;
--- 685,693 ----
  	state->nKeys = 1;			/* Only one sort column, the hash code */
  
  	state->comparetup = comparetup_index_hash;
! 	state->copytup = copytup_index_hash;
  	state->writetup = writetup_index;
! 	state->readtup = readtup_index_hash;
  	state->reversedirection = reversedirection_index_hash;
  
  	state->indexRel = indexRel;
***************
*** 2822,2830 **** comparetup_index_hash(const SortTuple *a, const SortTuple *b,
  					  Tuplesortstate *state)
  {
  	/*
! 	 * It's slightly annoying to redo the hash function each time, although
! 	 * most hash functions ought to be cheap.  Is it worth having a variant
! 	 * tuple storage format so we can store the hash code?
  	 */
  	uint32		hash1;
  	uint32		hash2;
--- 2825,2831 ----
  					  Tuplesortstate *state)
  {
  	/*
!      * we needn't redo hash function each time since we've stored it.
  	 */
  	uint32		hash1;
  	uint32		hash2;
***************
*** 2836,2845 **** comparetup_index_hash(const SortTuple *a, const SortTuple *b,
  
  	/* Compute hash codes and mask off bits we don't want to sort by */
  	Assert(!a->isnull1);
! 	hash1 = DatumGetUInt32(FunctionCall1(state->hash_proc, a->datum1))
  		& state->hash_mask;
  	Assert(!b->isnull1);
! 	hash2 = DatumGetUInt32(FunctionCall1(state->hash_proc, b->datum1))
  		& state->hash_mask;
  
  	if (hash1 > hash2)
--- 2837,2846 ----
  
  	/* Compute hash codes and mask off bits we don't want to sort by */
  	Assert(!a->isnull1);
! 	hash1 = DatumGetUInt32(a->datum1)
  		& state->hash_mask;
  	Assert(!b->isnull1);
! 	hash2 = DatumGetUInt32(b->datum1)
  		& state->hash_mask;
  
  	if (hash1 > hash2)
***************
*** 2894,2899 **** copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
--- 2895,2916 ----
  }
  
  static void
+ copytup_index_hash(Tuplesortstate *state, SortTuple *stup, void *tup)
+ {
+ 	IndexTuple	tuple = (IndexTuple) tup;
+ 	unsigned int tuplen = IndexTupleSize(tuple);
+ 	IndexTuple	newtuple;
+ 
+ 	/* copy the tuple into sort storage */
+ 	newtuple = (IndexTuple) palloc(tuplen);
+ 	memcpy(newtuple, tuple, tuplen);
+ 	USEMEM(state, GetMemoryChunkSpace(newtuple));
+ 	stup->tuple = (void *) newtuple;
+ 	/* set up first-column key value(i.e. hash code) */
+ 	stup->datum1 = _hash_get_datum(newtuple);
+     stup->isnull1 = false;
+ }
+ static void
  writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
  {
  	IndexTuple	tuple = (IndexTuple) stup->tuple;
***************
*** 2936,2941 **** readtup_index(Tuplesortstate *state, SortTuple *stup,
--- 2953,2979 ----
  }
  
  static void
+ readtup_index_hash(Tuplesortstate *state, SortTuple *stup,
+ 			  int tapenum, unsigned int len)
+ {
+ 	unsigned int tuplen = len - sizeof(unsigned int);
+ 	IndexTuple	tuple = (IndexTuple) palloc(tuplen);
+ 
+ 	USEMEM(state, GetMemoryChunkSpace(tuple));
+ 	if (LogicalTapeRead(state->tapeset, tapenum, (void *) tuple,
+ 						tuplen) != tuplen)
+ 		elog(ERROR, "unexpected end of data");
+ 	if (state->randomAccess)	/* need trailing length word? */
+ 		if (LogicalTapeRead(state->tapeset, tapenum, (void *) &tuplen,
+ 							sizeof(tuplen)) != sizeof(tuplen))
+ 			elog(ERROR, "unexpected end of data");
+ 	stup->tuple = (void *) tuple;
+ 	/* set up first-column key value(i.e. hash code) */
+     stup->datum1 = _hash_get_datum(tuple);
+     stup->isnull1 = false;
+ }
+ 
+ static void
  reversedirection_index_btree(Tuplesortstate *state)
  {
  	ScanKey		scanKey = state->indexScanKey;
*** a/src/include/access/hash.h
--- b/src/include/access/hash.h
***************
*** 100,105 **** typedef struct HashScanOpaqueData
--- 100,107 ----
  	/* Current and marked position of the scan */
  	ItemPointerData hashso_curpos;
  	ItemPointerData hashso_mrkpos;
+ 	/* Hash value of the scan key */
+ 	uint32      hashso_sk_hash;
  } HashScanOpaqueData;
  
  typedef HashScanOpaqueData *HashScanOpaque;
***************
*** 227,233 **** typedef HashMetaPageData *HashMetaPage;
   */
  #define HASHPROC		1
  
! 
  /* public routines */
  
  extern Datum hashbuild(PG_FUNCTION_ARGS);
--- 229,239 ----
   */
  #define HASHPROC		1
  
! /*
!  * hard-coded hash desc
!  */
! #define Schema_Hash \
! { 0, {"hashcode"},		23, -1, 4, 1, 0, -1, -1, true, 'p', 'i', false, false, false, true, 0 }
  /* public routines */
  
  extern Datum hashbuild(PG_FUNCTION_ARGS);
***************
*** 330,335 **** extern Bucket _hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
--- 336,345 ----
  					 uint32 highmask, uint32 lowmask);
  extern uint32 _hash_log2(uint32 num);
  extern void _hash_checkpage(Relation rel, Buffer buf, int flags);
+ extern TupleDesc _get_hash_desc();
+ extern Datum _hash_get_datum(IndexTuple itup);
+ extern IndexTuple _hash_form_tuple(Relation rel, Datum* values, bool* isnull);
+ OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
  
  /* hash.c */
  extern void hash_redo(XLogRecPtr lsn, XLogRecord *record);

Xiao Meng

mx.cogito@gmail.com

over 17 years ago

In reply to: Xiao Meng (#1)

Re: hash index improving v3

With the help of David Fetter, you can get the copy by
git clone http://git.postgresql.org/git/~davidfetter/hash/.git<http://git.postgresql.org/git/%7Edavidfetter/hash/.git>
It's in the branch gsoc-hash.
Thank you, David.

--
Best Regards,
Xiao Meng

DKERC, Harbin Institute of Technology, China
Gtalk: mx.cogito@gmail.com
MSN: cnEnder@live.com
http://xiaomeng.yo2.cn

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Xiao Meng (#1)

Re: hash index improving v3

On Mon, 2008-08-18 at 09:46 +0800, Xiao Meng wrote:

There's minor change against the previous
one( http://archives.postgresql.org/pgsql-hackers/2008-07/msg01183.php ).
* merge branch master(Aug 16) into the patch
* clean code and make some comment
Performance result is here
http://wiki.postgresql.org/wiki/Gsoc08-hashindex

It seems hash index is a little better on index creation and
selection.
But maybe it's in the range of noise, I'm not sure.
I'd like to try it with a bigger dataset (e.g. table with 10GB) but
there is not enough space in my computer.
Anyone interest can make a test on a bigger data set.

You don't give the text of the query used to do these performance tests,
so I can't validate your test results.

Right now it seems strange that the index is larger than a btree, yet
the performance tests show that 3 times as much I/O was used accessing
the btree.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Jonah H. Harris

jonah.harris@gmail.com

over 17 years ago

In reply to: Simon Riggs (#3)

Re: hash index improving v3

On Wed, Sep 3, 2008 at 10:06 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

It seems hash index is a little better on index creation and
selection.
But maybe it's in the range of noise, I'm not sure.
I'd like to try it with a bigger dataset (e.g. table with 10GB) but
there is not enough space in my computer.
Anyone interest can make a test on a bigger data set.

I tried it earlier on a 500M row table and found a few bugs. In
particular, it doesn't seem like recheck is happening and the
performance/sizing is a bit *interesting*. I'll post stats tomorrow
when I'm in the office.

--
Jonah H. Harris, Senior DBA
myYearbook.com

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Simon Riggs (#3)

Re: hash index improving v3

Simon Riggs <simon@2ndQuadrant.com> writes:

Right now it seems strange that the index is larger than a btree, yet
the performance tests show that 3 times as much I/O was used accessing
the btree.

Well, in an ideal world a hash index probe is O(1) while a btree probe
is O(log N), so that result is exactly what hash proponents would hope
for. Whether it's real or not is another question, but it could be.

regards, tom lane

Zdenek Kotala

Zdenek.Kotala@Sun.COM

over 17 years ago

In reply to: Xiao Meng (#1)

Re: hash index improving v3

I performed code review and see my comments.

pgsql/src/backend/access/hash/hashpage.c
<http://reviewdemo.postgresql.org/r/26/#comment31>

use sizeof() or something better the 4.

pgsql/src/backend/access/hash/hashpage.c
<http://reviewdemo.postgresql.org/r/26/#comment32>

New empty line.

pgsql/src/backend/access/hash/hashutil.c
<http://reviewdemo.postgresql.org/r/26/#comment27>

It would be better remove #define from hash.h and setup it there
directly.

pgsql/src/backend/access/hash/hashutil.c
<http://reviewdemo.postgresql.org/r/26/#comment28>

Why not return directly uint32?

pgsql/src/backend/access/hash/hashutil.c
<http://reviewdemo.postgresql.org/r/26/#comment22>

Retype to correct return type.

pgsql/src/backend/access/hash/hashutil.c
<http://reviewdemo.postgresql.org/r/26/#comment29>

Whats about null values?

pgsql/src/backend/access/hash/hashutil.c
<http://reviewdemo.postgresql.org/r/26/#comment30>

I'm not sure if values modification is safe. Please, recheck.

pgsql/src/backend/access/hash/hashutil.c
<http://reviewdemo.postgresql.org/r/26/#comment23>

Return value is not much clear. I prefer to return InvalidOffset
when no record is found. However it seems that you use result also for
PageAddItem to put item on correct ordered position. I think better
explanation should help to understand how it works.

pgsql/src/backend/access/hash/hashutil.c
<http://reviewdemo.postgresql.org/r/26/#comment26>

It could return FirstOffset number in case when nothing interesting
is on the page.

pgsql/src/include/access/hash.h
<http://reviewdemo.postgresql.org/r/26/#comment34>

Why not define new datatype for example HashKey instead of uint32?

pgsql/src/include/access/hash.h
<http://reviewdemo.postgresql.org/r/26/#comment33>

It is not good place. See my other comment.

--------------
You also forgot to bump hash index version in meta page.

Zdenek

Xiao Meng

mx.cogito@gmail.com

over 17 years ago

In reply to: Simon Riggs (#3)

1 attachment(s)

Re: [PATCHES] hash index improving v3

On Thu, Sep 4, 2008 at 10:06 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

You don't give the text of the query used to do these performance tests,
so I can't validate your test results.

The attachment is the code to generate the text.
Just
$g++ my-permu-code.cpp
$./a.out > /tmp/words

--
Best Regards,
Xiao Meng

DKERC, Harbin Institute of Technology, China
Gtalk: mx.cogito@gmail.com
MSN: cnEnder@live.com
http://xiaomeng.yo2.cn

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Xiao Meng (#7)

Re: [PATCHES] hash index improving v3

On Thu, 2008-09-04 at 21:03 +0800, Xiao Meng wrote:

On Thu, Sep 4, 2008 at 10:06 AM, Simon Riggs <simon@2ndquadrant.com>
wrote:
You don't give the text of the query used to do these
performance tests,

so I can't validate your test results.

The attachment is the code to generate the text.
Just
$g++ my-permu-code.cpp
$./a.out > /tmp/words

That generates the data, fine. What about the test query?

You say you are running this command line
pgbench -U postgres -n -f /tmp/query.sql dict

Where is query.sql?

It doesn't seem a big ask to be shown the SQL statement that is being
executed, so we understand what is going on.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Zdenek Kotala (#6)

Re: hash index improving v3

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

I performed code review and see my comments.

Thanks for the comments. I've incorporated all of these into an updated
patch that I'm preparing, except for

Why not define new datatype for example HashKey instead of uint32?

This seems like a good idea, but I think we should do it as a separate,
cosmetic-cleanup patch. It'll touch a lot of parts of access/hash/ that
the current patch doesn't need to change, and thus complicate reviewing.

regards, tom lane

#10

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Zdenek Kotala (#6)

Re: hash index improving v3

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

pgsql/src/backend/access/hash/hashutil.c
<http://reviewdemo.postgresql.org/r/26/#comment27>

It would be better remove #define from hash.h and setup it there
directly.

Actually, I don't like this aspect of the patch one bit: it means that
the system catalogs are lying about what is stored in the index, which
seems likely to break something somewhere, either now or down the road.
I think the correct way to handle this is to make the pg_attribute entries
(and hence the index's relation descriptor) accurately match what is
stored in the index. For testing purposes I propose this crude hack
in catalog/index.c's ConstructTupleDescriptor():

*** src/backend/catalog/index.c.orig	Mon Aug 25 18:42:32 2008
--- src/backend/catalog/index.c	Thu Sep  4 16:20:12 2008
***************
*** 133,138 ****
--- 133,139 ----
  		Form_pg_attribute to = indexTupDesc->attrs[i];
  		HeapTuple	tuple;
  		Form_pg_type typeTup;
+ 		Form_pg_opclass opclassTup;
  		Oid			keyType;

if (atnum != 0)
***************
*** 240,246 ****
if (!HeapTupleIsValid(tuple))
elog(ERROR, "cache lookup failed for opclass %u",
classObjectId[i]);
! keyType = ((Form_pg_opclass) GETSTRUCT(tuple))->opckeytype;
ReleaseSysCache(tuple);

  		if (OidIsValid(keyType) && keyType != to->atttypid)
--- 241,252 ----
  		if (!HeapTupleIsValid(tuple))
  			elog(ERROR, "cache lookup failed for opclass %u",
  				 classObjectId[i]);
! 		opclassTup = (Form_pg_opclass) GETSTRUCT(tuple);
! 		/* HACK: make hash always use int4 as storage (really it's uint32) */
! 		if (opclassTup->opcmethod == HASH_AM_OID)
! 			keyType = INT4OID;
! 		else
! 			keyType = opclassTup->opckeytype;
  		ReleaseSysCache(tuple);

if (OidIsValid(keyType) && keyType != to->atttypid)

Assuming the patch gets accepted, we should devise some cleaner way
of letting index AMs adjust their indexes' reldescs; maybe declare a
new entry point column in pg_am that lets the AM modify the tupledesc
constructed by this function before it gets used to create the index.
But that is irrelevant to performance testing, so I'm not going to do
it right now.

regards, tom lane

#11

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Tom Lane (#10)

Re: hash index improving v3

Here is an updated patch incorporating Zdenek's review, my own
observation that we should make the index tupledesc tell the truth,
and some other fixes/improvements such as making backwards scans
work as expected.

The main thing lacking before this could be committed, from a code
standpoint, is a cleaner solution to the problem of adjusting the
index tupledesc (see the ugly hack in catalog/index.c). However,
that complaint is irrelevant for functionality or performance testing,
so I'm throwing this back out there in hopes someone will do some...

I thought a little bit about how to extend this to store both hashcode
and original index key, and realized that the desire to have a truthful
index tupledesc makes that a *whole* lot harder. The planner, and
really even the pg_index catalog representation, assume that the visible
columns of an index are one-for-one with the index keys. We can slide
through with the attached patch because this is still true ---
effectively we're just using a "storage type" different from the indexed
column's type for hash indexes, as already works for GIST and GIN.
But having two visible columns would bollix up quite a lot of stuff.
So I think if we actually want to do that, we'd need to revert to the
concept of cheating on the tupledesc. Aside from the various uglinesses
that I was able to remove from the original patch by not having that,
I'm still quite concerned that we'd find something else wrong with
doing that, further down the road.

So my thinking right now is that we should just test this patch as-is.
If it doesn't show really horrid performance when there are lots of
hash key collisions, we should forget the store-both-things idea and
just go with this.

regards, tom lane

#12

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Tom Lane (#11)

Re: hash index improving v3

On Thu, Sep 4, 2008 at 5:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So my thinking right now is that we should just test this patch as-is.
If it doesn't show really horrid performance when there are lots of
hash key collisions, we should forget the store-both-things idea and
just go with this.

Ok let me know if this is to naive of an approach or not hitting the
right cases you want tested.

create table hash_a (same text, uniq text);
insert into hash_a (same, uniq) select 'same', n from
generate_series(0, 5000) as n;

create table hash_b (uniq text);
insert into hash_b (uniq) select n from generate_series(5000, 10000) as n;

pgbench -c 1 -t 100 -n -f of the following

hash_same.sql:
set enable_seqscan to off;
set enable_mergejoin to off;
select 1 from hash_a as a inner join hash_a as aa on aa.same = a.same;

hash_uniq.sql:
set enable_seqscan to off;
set enable_mergejoin to off;
select 1 from hash_a as a inner join hash_b as b on b.uniq = a.uniq;

#13

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Alex Hunsaker (#12)

Re: hash index improving v3

"Alex Hunsaker" <badalex@gmail.com> writes:

Ok let me know if this is to naive of an approach or not hitting the
right cases you want tested.

You have the unique-versus-not dimension, but I'm also wondering about
narrow vs wide index keys (say about 8 bytes vs 50-100 or so). In the
former case we're not saving any index space by storing only the hash
code, so these could be expected to have different performance
behaviors.

As for specifics of the suggested scripts:

* might be better to do select count(*) not select 1, so that client
communication is minimized

* check that the queries actually use the indexes (not sure that the
proposed switch settings ensure this, not to mention you didn't create
the indexes)

* make sure the pgbench transaction counts are large enough to ensure
significant runtime

* the specific table sizes suggested are surely not large enough

regards, tom lane

#14

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Tom Lane (#13)

Re: hash index improving v3

I wrote:

You have the unique-versus-not dimension,

On second thought, actually not. What we want to look at is the penalty
for false matches due to *distinct* key values that happen to have the
same hash codes. Your test case for all-the-same is using all the same
key values, which means it'll hit the heap a lot, but none of those will
be wasted trips.

So what we need for testing is a few different key values that hash to
the same code. Not sure about an easy way to find such.

regards, tom lane

#15

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Tom Lane (#13)

Re: hash index improving v3

On Thu, Sep 4, 2008 at 7:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Alex Hunsaker" <badalex@gmail.com> writes:

Ok let me know if this is to naive of an approach or not hitting the
right cases you want tested.

You have the unique-versus-not dimension, but I'm also wondering about
narrow vs wide index keys (say about 8 bytes vs 50-100 or so). In the
former case we're not saving any index space by storing only the hash
code, so these could be expected to have different performance
behaviors.

Arg yes... I just read the last part of your mail in this thread. I
think it was the one on -hackers that talked about narrow vs wide...
so I figured I would just try to do what the thread where you posted
the patch talked about namley the below:

So my thinking right now is that we should just test this patch as-is.
If it doesn't show really horrid performance when there are lots of
hash key collisions, we should forget the store-both-things idea and
just go with this.

So I thought, lets try to generate lots of hash collisions... obviosly
though using the same key wont do that... Not sure what I was thinking

As for specifics of the suggested scripts:

* might be better to do select count(*) not select 1, so that client
communication is minimized

Yar.

* check that the queries actually use the indexes (not sure that the
proposed switch settings ensure this, not to mention you didn't create
the indexes)

Well I was assuming I could just test the speed of a hash join...

* make sure the pgbench transaction counts are large enough to ensure
significant runtime
* the specific table sizes suggested are surely not large enough

#16

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Tom Lane (#14)

Re: hash index improving v3

On Thu, Sep 4, 2008 at 7:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So what we need for testing is a few different key values that hash to
the same code. Not sure about an easy way to find such.

Hrm, well I have not really looked at the hash algorithm but I assume
we could just reduce the number of buckets?

#17

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Alex Hunsaker (#16)

Re: hash index improving v3

"Alex Hunsaker" <badalex@gmail.com> writes:

On Thu, Sep 4, 2008 at 7:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So what we need for testing is a few different key values that hash to
the same code. Not sure about an easy way to find such.

Hrm, well I have not really looked at the hash algorithm but I assume
we could just reduce the number of buckets?

No, we need fully equal hash keys, else the code won't visit the heap.

I guess one thing we could do for testing purposes is lobotomize one of
the datatype-specific hash functions. For instance, make int8_hash
return the input mod 2^32, ignoring the upper bytes. Then it'd be easy
to compute different int8s that hash to the same thing.

regards, tom lane

#18

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Alex Hunsaker (#15)

Re: hash index improving v3

"Alex Hunsaker" <badalex@gmail.com> writes:

On Thu, Sep 4, 2008 at 7:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

* check that the queries actually use the indexes (not sure that the
proposed switch settings ensure this, not to mention you didn't create
the indexes)

Well I was assuming I could just test the speed of a hash join...

Uh, no, hash joins have nearly zip to do with hash indexes. They rely
on the same per-datatype support functions but that's the end of the
commonality.

regards, tom lane

#19

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Tom Lane (#17)

Re: hash index improving v3

On Thu, Sep 4, 2008 at 8:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I guess one thing we could do for testing purposes is lobotomize one of
the datatype-specific hash functions. For instance, make int8_hash
return the input mod 2^32, ignoring the upper bytes. Then it'd be easy
to compute different int8s that hash to the same thing.

Heh Ok im slowly getting there... So we lobotomize hashint8 and then
time how long it takes to make an index on a table... something like:
create table test_hash(num int8);

(obviously on a 64 bit machine)
int main(void)
{
unsigned long y = 0;
unsigned cnt = 0;

printf("insert into test_hash (num) values ");

//while(cnt != LONG_MAX/UINT_MAX)
while(cnt < 10000000)
{
y += UINT_MAX;

printf("(%ld), ", y);

cnt++;
}

printf("(0);\n");

}

./a.out | psql

pgbench -c 1 -t1000 -n -f test.sql

test.sql:
create index test_hash_num_idx on test_hash using hash (num);
drop index test_hash_num_idx;

For both pre and post patch just to make sure post patch is not worse
than pre patch???

If im still way off and its not to much trouble want to give me a test
case to run =) ?

Or maybe because hash collisions should be fairly rare its not
something to really worry about?

#20

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#19)

Re: hash index improving v3

On Thu, Sep 4, 2008 at 9:48 PM, Alex Hunsaker <badalex@gmail.com> wrote:

Ok here are the results:

(data generated from the c program before)
select count(1) from test_hash;
count
-----------
100000011

create index test_hash_num_idx on test_hash using hash (num);
CVS: Time: 698065.180 ms
patch: Time: 565982.099 ms

./pgbench -c 1 -t 100000 -n -f bench.sql
bench.sql
select count(1) from test_hash where num = 110034304728896610;

CVS: tps = 7232.375875 (excluding connections establishing)
patch: tps = 7913.700150 (excluding connections establishing)

EXPLAIN ANALYZE select count(1) from test_hash where num = 110034304728896610;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=29.24..29.25 rows=1 width=0) (actual
time=0.066..0.067 rows=1 loops=1)
-> Index Scan using test_hash_num_idx on test_hash
(cost=0.00..29.24 rows=1 width=0) (actual time=0.051..0.054 rows=1
loops=1)
Index Cond: (num = 110034304728896610::bigint)
Total runtime: 0.153 ms

Oddly the index sizes were the same (4096 MB) is that to be expected?

Here is the change I made to hashint8
--- a/src/backend/access/hash/hashfunc.c
+++ b/src/backend/access/hash/hashfunc.c
@@ -61,12 +61,14 @@ hashint8(PG_FUNCTION_ARGS)
         */
 #ifndef INT64_IS_BUSTED
        int64           val = PG_GETARG_INT64(0);
-       uint32          lohalf = (uint32) val;
+/*     uint32          lohalf = (uint32) val;
        uint32          hihalf = (uint32) (val >> 32);

lohalf ^= (val >= 0) ? hihalf : ~hihalf;

return hash_uint32(lohalf);
+*/
+ return val % 4294967296;
#else
/* here if we can't count on "x >> 32" to work sanely */
return hash_uint32((int32) PG_GETARG_INT64(0));

#21

Zdenek Kotala

Zdenek.Kotala@Sun.COM

over 17 years ago

In reply to: Alex Hunsaker (#20)

Re: hash index improving v3

Alex Hunsaker napsal(a):

On Thu, Sep 4, 2008 at 9:48 PM, Alex Hunsaker <badalex@gmail.com> wrote:

Ok here are the results:

(data generated from the c program before)
select count(1) from test_hash;
count
-----------
100000011

create index test_hash_num_idx on test_hash using hash (num);
CVS: Time: 698065.180 ms
patch: Time: 565982.099 ms

./pgbench -c 1 -t 100000 -n -f bench.sql
bench.sql
select count(1) from test_hash where num = 110034304728896610;

CVS: tps = 7232.375875 (excluding connections establishing)
patch: tps = 7913.700150 (excluding connections establishing)

EXPLAIN ANALYZE select count(1) from test_hash where num = 110034304728896610;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=29.24..29.25 rows=1 width=0) (actual
time=0.066..0.067 rows=1 loops=1)
-> Index Scan using test_hash_num_idx on test_hash
(cost=0.00..29.24 rows=1 width=0) (actual time=0.051..0.054 rows=1
loops=1)
Index Cond: (num = 110034304728896610::bigint)
Total runtime: 0.153 ms

Oddly the index sizes were the same (4096 MB) is that to be expected?

I think yes, because haskey is uint32. You save space only if you use hash for
example on varchar attribute.

Zdenek

#22

Xiao Meng

mx.cogito@gmail.com

over 17 years ago

In reply to: Simon Riggs (#8)

2 attachment(s)

Re: hash index improving v3

On Fri, Sep 5, 2008 at 12:57 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

That generates the data, fine. What about the test query?

You say you are running this command line
pgbench -U postgres -n -f /tmp/query.sql dict

Where is query.sql?

oh,sorry. I forgot it.
The attachment is query.sql and the generator.

--
Best Regards,
Xiao Meng

DKERC, Harbin Institute of Technology, China
Gtalk: mx.cogito@gmail.com
MSN: cnEnder@live.com
http://xiaomeng.yo2.cn

#23

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#19)

Re: hash index improving v3

On Thu, Sep 4, 2008 at 9:48 PM, Alex Hunsaker <badalex@gmail.com> wrote:

(obviously on a 64 bit machine)
int main(void)
{
unsigned long y = 0;
unsigned cnt = 0;

printf("insert into test_hash (num) values ");

//while(cnt != LONG_MAX/UINT_MAX)
while(cnt < 10000000)
{
y += UINT_MAX;

printf("(%ld), ", y);

cnt++;
}

printf("(0);\n");

}

Argh sorry I must have been smoking something yesterday thats supposed
to be y+= UINT_MAX + 1 so the benchmarking I did was worthless...
running again with the right values....

#24

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#23)

Re: hash index improving v3

Ok now that I made it so it actually *test* collisions, with the patch
it always returns all rows that matched the hashed "key". For example
(non lobotomized inthash8, I just brute forced some collisions for 0
so that anyone can try this)

create table test_hash (num int8);
insert into test_hash (num) values (0), (1805671691), (3294821164),
(4294967297);
create index test_hash_num_idx on test_hash using hash (num);
select * from test_hash where num = 0;
num
------------
0
1805671691
3294821164
4294967297

set enable_bitmapscan to off;

select * from test_hash where num = 0;
num
-----
0

CVS HEAD, 8_3_STABLE obviously work if I force the index scan

#25

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#24)

Re: hash index improving v3

]

On Fri, Sep 5, 2008 at 2:21 PM, Alex Hunsaker <badalex@gmail.com> wrote:

Ok now that I made it so it actually *test* collisions, with the patch
it always returns all rows that matched the hashed "key".

And here is the fix, we just forget to set the recheck flag for bitmap scans.

*** a/src/backend/access/hash/hash.c
--- b/src/backend/access/hash/hash.c
***************
*** 317,323 **** hashgetbitmap(PG_FUNCTION_ARGS)
  		/* Save tuple ID, and continue scanning */
  		if (add_tuple)
  		{
! 			tbm_add_tuples(tbm, &scan->xs_ctup.t_self, 1, false);
  			ntids++;
  		}

--- 317,323 ----
  		/* Save tuple ID, and continue scanning */
  		if (add_tuple)
  		{
! 			tbm_add_tuples(tbm, &scan->xs_ctup.t_self, 1, true);
  			ntids++;
  		}

#26

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Alex Hunsaker (#25)

Re: hash index improving v3

"Alex Hunsaker" <badalex@gmail.com> writes:

- 			tbm_add_tuples(tbm, &scan->xs_ctup.t_self, 1, false);
+ 			tbm_add_tuples(tbm, &scan->xs_ctup.t_self, 1, true);

Hah, I bet that explains Jonah's complaint that recheck didn't seem to
be happening in his tests. Nice catch.

regards, tom lane

#27

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Alex Hunsaker (#25)

Re: hash index improving v3

"Alex Hunsaker" <badalex@gmail.com> writes:

On Fri, Sep 5, 2008 at 2:21 PM, Alex Hunsaker <badalex@gmail.com> wrote:

Ok now that I made it so it actually *test* collisions, with the patch
it always returns all rows that matched the hashed "key".

And here is the fix, we just forget to set the recheck flag for bitmap scans.

For the convenience of anyone intending to test, here is an updated
patch against CVS HEAD that incorporates Alex's fix.

regards, tom lane

#28

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Tom Lane (#27)

5 attachment(s)

Re: hash index improving v3

On Sat, Sep 6, 2008 at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

For the convenience of anyone intending to test, here is an updated
patch against CVS HEAD that incorporates Alex's fix.

Here are the results for a table containing 1 million entries that
will generate hash collisions. It paints a bad picture for the patch
but then again im not sure how relevant the issue is. For example
yesterday I imported a table with 10 million collisions and the create
index is still running (now at about ~18 hours). Maybe we should warn
if there are lots of collisions when creating the index and suggest
you use a btree? Anyway here are the results.

./pgbench -c1 -n -t10 -f bench_createindex.sql
cvs head: tps = 0.002169
v5 : tps = 0.002196

pgbench -c1 -n -t1000 -f bench_bitmap.sql
cvs head: tps = 24.011871
v5: tps = 2.543123

pgbench -c1 -n -t1000 -f bench_index.sql
cvs head: tps = 51.614502
v5: tps = 3.205542

pgbench -c1 -n -t1000 -f bench_seqscan.sql
cvs head: tps = 8.553318
v5: tps = 9.836091

Table created via:
create table test_hash (num int8);
./hash | psql -c 'copy test_hash from stdin;'

Attachments:

hash.ctext/x-csrc; name=hash.cDownload

bench_create.sqlapplication/octet-stream; name=bench_create.sqlDownload

bench_index.sqlapplication/octet-stream; name=bench_index.sqlDownload

bench_seqscan.sqlapplication/octet-stream; name=bench_seqscan.sqlDownload

int8collide.patchapplication/octet-stream; name=int8collide.patchDownload

*** a/src/backend/access/hash/hashfunc.c
--- b/src/backend/access/hash/hashfunc.c
***************
*** 62,70 **** hashint8(PG_FUNCTION_ARGS)
  #ifndef INT64_IS_BUSTED
  	int64		val = PG_GETARG_INT64(0);
  	uint32		lohalf = (uint32) val;
- 	uint32		hihalf = (uint32) (val >> 32);
- 
- 	lohalf ^= (val >= 0) ? hihalf : ~hihalf;
  
  	return hash_uint32(lohalf);
  #else
--- 62,67 ----

#29

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#28)

1 attachment(s)

Re: hash index improving v3

2008/9/6 Alex Hunsaker <badalex@gmail.com>:

pgbench -c1 -n -t1000 -f bench_bitmap.sql
cvs head: tps = 24.011871
v5: tps = 2.543123

oops forgot to attach bench_bitmap.sql

#30

Zdenek Kotala

Zdenek.Kotala@Sun.COM

over 17 years ago

In reply to: Alex Hunsaker (#24)

Re: hash index improving v3

It is nice test case. It should be part of hash index regression test.

Zdenek

Alex Hunsaker napsal(a):

Ok now that I made it so it actually *test* collisions, with the patch
it always returns all rows that matched the hashed "key". For example
(non lobotomized inthash8, I just brute forced some collisions for 0
so that anyone can try this)

create table test_hash (num int8);
insert into test_hash (num) values (0), (1805671691), (3294821164),
(4294967297);
create index test_hash_num_idx on test_hash using hash (num);
select * from test_hash where num = 0;
num
------------
0
1805671691
3294821164
4294967297

set enable_bitmapscan to off;

select * from test_hash where num = 0;
num
-----
0

CVS HEAD, 8_3_STABLE obviously work if I force the index scan

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#31

Zdenek Kotala

Zdenek.Kotala@Sun.COM

over 17 years ago

In reply to: Tom Lane (#27)

1 attachment(s)

Re: hash index improving v3

Tom Lane napsal(a):

"Alex Hunsaker" <badalex@gmail.com> writes:

On Fri, Sep 5, 2008 at 2:21 PM, Alex Hunsaker <badalex@gmail.com> wrote:

Ok now that I made it so it actually *test* collisions, with the patch
it always returns all rows that matched the hashed "key".

And here is the fix, we just forget to set the recheck flag for bitmap scans.

For the convenience of anyone intending to test, here is an updated
patch against CVS HEAD that incorporates Alex's fix.

Tom,
I attach cleanup patch which we discussed last commit fest. It introduce new
macros HashGetMetaPage and HashGetBitmap and of course, it break backward on
disk format compatibility which we will break it anyway when Xiao's patch will
be committed.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

Attachments:

hash_cleanup.patchtext/x-diff; name=hash_cleanup.patchDownload

bÄÅ¾nÃ© podadresÃ¡Åe: pgsql_hash.be486df40421/src a pgsql_hash/src
bÄÅ¾nÃ© podadresÃ¡Åe: pgsql_hash.be486df40421/src/backend a pgsql_hash/src/backend
bÄÅ¾nÃ© podadresÃ¡Åe: pgsql_hash.be486df40421/src/include a pgsql_hash/src/include
bÄÅ¾nÃ© podadresÃ¡Åe: pgsql_hash.be486df40421/src/backend/access a pgsql_hash/src/backend/access
bÄÅ¾nÃ© podadresÃ¡Åe: pgsql_hash.be486df40421/src/backend/access/hash a pgsql_hash/src/backend/access/hash
diff -rc pgsql_hash.be486df40421/src/backend/access/hash/hash.c pgsql_hash/src/backend/access/hash/hash.c
*** pgsql_hash.be486df40421/src/backend/access/hash/hash.c	po zÃ¡Å  8 16:14:00 2008
--- pgsql_hash/src/backend/access/hash/hash.c	po zÃ¡Å  8 16:14:00 2008
***************
*** 527,533 ****
  	 * each bucket.
  	 */
  	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
! 	metap = (HashMetaPage) BufferGetPage(metabuf);
  	orig_maxbucket = metap->hashm_maxbucket;
  	orig_ntuples = metap->hashm_ntuples;
  	memcpy(&local_metapage, metap, sizeof(local_metapage));
--- 527,533 ----
  	 * each bucket.
  	 */
  	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
! 	metap =  HashPageGetMeta(BufferGetPage(metabuf));
  	orig_maxbucket = metap->hashm_maxbucket;
  	orig_ntuples = metap->hashm_ntuples;
  	memcpy(&local_metapage, metap, sizeof(local_metapage));
***************
*** 629,635 ****
  
  	/* Write-lock metapage and check for split since we started */
  	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_WRITE, LH_META_PAGE);
! 	metap = (HashMetaPage) BufferGetPage(metabuf);
  
  	if (cur_maxbucket != metap->hashm_maxbucket)
  	{
--- 629,635 ----
  
  	/* Write-lock metapage and check for split since we started */
  	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_WRITE, LH_META_PAGE);
! 	metap = HashPageGetMeta(BufferGetPage(metabuf));
  
  	if (cur_maxbucket != metap->hashm_maxbucket)
  	{
diff -rc pgsql_hash.be486df40421/src/backend/access/hash/hashinsert.c pgsql_hash/src/backend/access/hash/hashinsert.c
*** pgsql_hash.be486df40421/src/backend/access/hash/hashinsert.c	po zÃ¡Å  8 16:14:00 2008
--- pgsql_hash/src/backend/access/hash/hashinsert.c	po zÃ¡Å  8 16:14:00 2008
***************
*** 69,75 ****
  
  	/* Read the metapage */
  	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
! 	metap = (HashMetaPage) BufferGetPage(metabuf);
  
  	/*
  	 * Check whether the item can fit on a hash page at all. (Eventually, we
--- 69,75 ----
  
  	/* Read the metapage */
  	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
! 	metap = HashPageGetMeta(BufferGetPage(metabuf));
  
  	/*
  	 * Check whether the item can fit on a hash page at all. (Eventually, we
diff -rc pgsql_hash.be486df40421/src/backend/access/hash/hashovfl.c pgsql_hash/src/backend/access/hash/hashovfl.c
*** pgsql_hash.be486df40421/src/backend/access/hash/hashovfl.c	po zÃ¡Å  8 16:14:00 2008
--- pgsql_hash/src/backend/access/hash/hashovfl.c	po zÃ¡Å  8 16:14:00 2008
***************
*** 187,193 ****
  	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
  
  	_hash_checkpage(rel, metabuf, LH_META_PAGE);
! 	metap = (HashMetaPage) BufferGetPage(metabuf);
  
  	/* start search at hashm_firstfree */
  	orig_firstfree = metap->hashm_firstfree;
--- 187,193 ----
  	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
  
  	_hash_checkpage(rel, metabuf, LH_META_PAGE);
! 	metap = HashPageGetMeta(BufferGetPage(metabuf));
  
  	/* start search at hashm_firstfree */
  	orig_firstfree = metap->hashm_firstfree;
***************
*** 450,456 ****
  
  	/* Read the metapage so we can determine which bitmap page to use */
  	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
! 	metap = (HashMetaPage) BufferGetPage(metabuf);
  
  	/* Identify which bit to set */
  	ovflbitno = blkno_to_bitno(metap, ovflblkno);
--- 450,456 ----
  
  	/* Read the metapage so we can determine which bitmap page to use */
  	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
! 	metap = HashPageGetMeta(BufferGetPage(metabuf));
  
  	/* Identify which bit to set */
  	ovflbitno = blkno_to_bitno(metap, ovflblkno);
diff -rc pgsql_hash.be486df40421/src/backend/access/hash/hashpage.c pgsql_hash/src/backend/access/hash/hashpage.c
*** pgsql_hash.be486df40421/src/backend/access/hash/hashpage.c	po zÃ¡Å  8 16:14:00 2008
--- pgsql_hash/src/backend/access/hash/hashpage.c	po zÃ¡Å  8 16:14:00 2008
***************
*** 395,401 ****
  	pageopaque->hasho_flag = LH_META_PAGE;
  	pageopaque->hasho_page_id = HASHO_PAGE_ID;
  
! 	metap = (HashMetaPage) pg;
  
  	metap->hashm_magic = HASH_MAGIC;
  	metap->hashm_version = HASH_VERSION;
--- 395,401 ----
  	pageopaque->hasho_flag = LH_META_PAGE;
  	pageopaque->hasho_page_id = HASHO_PAGE_ID;
  
! 	metap = HashPageGetMeta(pg);
  
  	metap->hashm_magic = HASH_MAGIC;
  	metap->hashm_version = HASH_VERSION;
***************
*** 402,414 ****
  	metap->hashm_ntuples = 0;
  	metap->hashm_nmaps = 0;
  	metap->hashm_ffactor = ffactor;
! 	metap->hashm_bsize = BufferGetPageSize(metabuf);
  	/* find largest bitmap array size that will fit in page size */
  	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
  	{
! 		if ((1 << i) <= (metap->hashm_bsize -
! 						 (MAXALIGN(sizeof(PageHeaderData)) +
! 						  MAXALIGN(sizeof(HashPageOpaqueData)))))
  			break;
  	}
  	Assert(i > 0);
--- 402,412 ----
  	metap->hashm_ntuples = 0;
  	metap->hashm_nmaps = 0;
  	metap->hashm_ffactor = ffactor;
! 	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
  	/* find largest bitmap array size that will fit in page size */
  	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
  	{
! 		if ((1 << i) <= metap->hashm_bsize)
  			break;
  	}
  	Assert(i > 0);
***************
*** 532,538 ****
  	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
  
  	_hash_checkpage(rel, metabuf, LH_META_PAGE);
! 	metap = (HashMetaPage) BufferGetPage(metabuf);
  
  	/*
  	 * Check to see if split is still needed; someone else might have already
--- 530,536 ----
  	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
  
  	_hash_checkpage(rel, metabuf, LH_META_PAGE);
! 	metap = HashPageGetMeta(BufferGetPage(metabuf));
  
  	/*
  	 * Check to see if split is still needed; someone else might have already
diff -rc pgsql_hash.be486df40421/src/backend/access/hash/hashsearch.c pgsql_hash/src/backend/access/hash/hashsearch.c
*** pgsql_hash.be486df40421/src/backend/access/hash/hashsearch.c	po zÃ¡Å  8 16:14:00 2008
--- pgsql_hash/src/backend/access/hash/hashsearch.c	po zÃ¡Å  8 16:14:00 2008
***************
*** 186,192 ****
  
  	/* Read the metapage */
  	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
! 	metap = (HashMetaPage) BufferGetPage(metabuf);
  
  	/*
  	 * Compute the target bucket number, and convert to block number.
--- 186,192 ----
  
  	/* Read the metapage */
  	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
! 	metap = HashPageGetMeta(BufferGetPage(metabuf));
  
  	/*
  	 * Compute the target bucket number, and convert to block number.
diff -rc pgsql_hash.be486df40421/src/backend/access/hash/hashutil.c pgsql_hash/src/backend/access/hash/hashutil.c
*** pgsql_hash.be486df40421/src/backend/access/hash/hashutil.c	po zÃ¡Å  8 16:14:00 2008
--- pgsql_hash/src/backend/access/hash/hashutil.c	po zÃ¡Å  8 16:14:00 2008
***************
*** 190,196 ****
  	 */
  	if (flags == LH_META_PAGE)
  	{
! 		HashMetaPage metap = (HashMetaPage) page;
  
  		if (metap->hashm_magic != HASH_MAGIC)
  			ereport(ERROR,
--- 190,196 ----
  	 */
  	if (flags == LH_META_PAGE)
  	{
! 		HashMetaPage metap = HashPageGetMeta(page);
  
  		if (metap->hashm_magic != HASH_MAGIC)
  			ereport(ERROR,
bÄÅ¾nÃ© podadresÃ¡Åe: pgsql_hash.be486df40421/src/include/access a pgsql_hash/src/include/access
diff -rc pgsql_hash.be486df40421/src/include/access/hash.h pgsql_hash/src/include/access/hash.h
*** pgsql_hash.be486df40421/src/include/access/hash.h	po zÃ¡Å  8 16:14:00 2008
--- pgsql_hash/src/include/access/hash.h	po zÃ¡Å  8 16:14:00 2008
***************
*** 138,144 ****
  
  typedef struct HashMetaPageData
  {
- 	PageHeaderData hashm_phdr;	/* pad for page header (do not use) */
  	uint32		hashm_magic;	/* magic no. for hash tables */
  	uint32		hashm_version;	/* version ID */
  	double		hashm_ntuples;	/* number of tuples stored in the table */
--- 138,143 ----
***************
*** 191,199 ****
  #define BMPGSZ_BIT(metap)		((metap)->hashm_bmsize << BYTE_TO_BIT)
  #define BMPG_SHIFT(metap)		((metap)->hashm_bmshift)
  #define BMPG_MASK(metap)		(BMPGSZ_BIT(metap) - 1)
- #define HashPageGetBitmap(pg) \
- 	((uint32 *) (((char *) (pg)) + MAXALIGN(sizeof(PageHeaderData))))
  
  /*
   * The number of bits in an ovflpage bitmap word.
   */
--- 190,207 ----
  #define BMPGSZ_BIT(metap)		((metap)->hashm_bmsize << BYTE_TO_BIT)
  #define BMPG_SHIFT(metap)		((metap)->hashm_bmshift)
  #define BMPG_MASK(metap)		(BMPGSZ_BIT(metap) - 1)
  
+ #define HashPageGetBitmap(page) \
+ 	((uint32*) PageGetContents(page))
+ 
+ #define HashGetMaxBitmapSize(page) \
+ 	(PageGetPageSize((Page) page)- \
+ 	(MAXALIGN(SizeOfPageHeaderData) + \
+      MAXALIGN(sizeof(HashPageOpaqueData))) )
+ 
+ #define HashPageGetMeta(page) \
+ 	((HashMetaPage) PageGetContents(page))
+ 
  /*
   * The number of bits in an ovflpage bitmap word.
   */

#32

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Zdenek Kotala (#31)

Re: hash index improving v3

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:

I attach cleanup patch which we discussed last commit fest. It introduce new
macros HashGetMetaPage and HashGetBitmap and of course, it break backward on
disk format compatibility which we will break it anyway when Xiao's patch will
be committed.

If we accept that patch, I'm okay with including this one too. Still
hoping for some performance testing though. (Note that this one isn't
going to affect performance, so no need to include it for that purpose.)

regards, tom lane

#33

Kenneth Marshall

ktm@rice.edu

over 17 years ago

In reply to: Alex Hunsaker (#28)

Re: hash index improving v3

On Sat, Sep 06, 2008 at 08:23:05PM -0600, Alex Hunsaker wrote:

On Sat, Sep 6, 2008 at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

For the convenience of anyone intending to test, here is an updated
patch against CVS HEAD that incorporates Alex's fix.

Here are the results for a table containing 1 million entries that
will generate hash collisions. It paints a bad picture for the patch
but then again im not sure how relevant the issue is. For example
yesterday I imported a table with 10 million collisions and the create
index is still running (now at about ~18 hours). Maybe we should warn
if there are lots of collisions when creating the index and suggest
you use a btree? Anyway here are the results.

I think that the glacial speed for generating a big hash index is
the same problem that the original code faced. Because of the collisions
you are unable to achieve the correct packing that the code assumes.
This results in the splitting/copying of every page in the hash index,
a very slow proposition. I had suggested adding some additional parameters
like fillfactor to accomodate these sorts of situations. Since your test
cuts the effective fill by 2 because of the many collisions, you would
need to adjust that calculation to avoid the tremendous amount of random
I/O generated by that mis-assumption.

./pgbench -c1 -n -t10 -f bench_createindex.sql
cvs head: tps = 0.002169
v5 : tps = 0.002196

pgbench -c1 -n -t1000 -f bench_bitmap.sql
cvs head: tps = 24.011871
v5: tps = 2.543123

pgbench -c1 -n -t1000 -f bench_index.sql
cvs head: tps = 51.614502
v5: tps = 3.205542

pgbench -c1 -n -t1000 -f bench_seqscan.sql
cvs head: tps = 8.553318
v5: tps = 9.836091

Table created via:
create table test_hash (num int8);
./hash | psql -c 'copy test_hash from stdin;'

It would be useful to have an equivalent test for the hash-only
index without the modified int8 hash function, since that would
be more representative of its performance. The collision rates
that I was observing in my tests of the old and new mix() functions
was about 2 * (1/10000) of what you test generated. You could just
test against the integers between 1 and 2000000.

Ken

#34

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Kenneth Marshall (#33)

Re: hash index improving v3

On Tue, Sep 9, 2008 at 7:48 AM, Kenneth Marshall <ktm@rice.edu> wrote:

I think that the glacial speed for generating a big hash index is
the same problem that the original code faced.

Yeah sorry, I was not saying it was a new problem with the patch. Err
at least not trying to :) *Both* of them had been running at 18+ (I
finally killed them sometime Sunday or around +32 hours...)

It would be useful to have an equivalent test for the hash-only
index without the modified int8 hash function, since that would
be more representative of its performance. The collision rates
that I was observing in my tests of the old and new mix() functions
was about 2 * (1/10000) of what you test generated. You could just
test against the integers between 1 and 2000000.

Sure but then its pretty much just a general test of patch vs no
patch. i.e. How do we measure how much longer collisions take when
the new patch makes things faster? That's what I was trying to
measure... Though I apologize I don't think that was clearly stated
anywhere...
Now of course it still would be interesting... And if its only to
2,000,000 I can still use the modified int8 or just use the int4
one...

Anyway Here are the numbers:
create table test_hash(num int8);
insert into test_hash (num) select generate_series(1, 2000000);
create index test_hash_num_idx on test_hash (num);

pgbench -c1 -n -t10000 -f bench_index.sql
cvs head: tps = 3161.500511
v5: tps = 7248.808839

BTW Im still planning on doing a wide vs narrow test... sometime... :)

#35

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#34)

Re: hash index improving v3

On Tue, Sep 9, 2008 at 7:23 PM, Alex Hunsaker <badalex@gmail.com> wrote:

create table test_hash(num int8);
insert into test_hash (num) select generate_series(1, 2000000);
create index test_hash_num_idx on test_hash (num);

pgbench -c1 -n -t10000 -f bench_index.sql
cvs head: tps = 3161.500511
v5: tps = 7248.808839

Whoa... Ignore those numbers!

It would bee nice to actually test with a hash index... and my cvs
head atm had cassert on.. sorry about that.

create table test_hash(num int8);
insert into test_hash (num) select generate_series(1, 2000000);
create index test_hash_num_idx on test_hash using hash (num);

pgbench -c1 -n -t100000 -f bench_index.sql
cvs head: tps = 7345.69432
v5: tps = 7526.290462

#36

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#34)

1 attachment(s)

Re: hash index improving v3

On Tue, Sep 9, 2008 at 7:23 PM, Alex Hunsaker <badalex@gmail.com> wrote:

BTW Im still planning on doing a wide vs narrow test... sometime... :)

narrow: (exactly the same as what I just did in the other post)
create table test_hash(num int8);
insert into test_hash (num) select generate_series(1, 2000000);
create index test_hash_num_idx on test_hash using hash (num);

pgbench -c1 -n -t100000 -f bench_index.sql
cvs head: tps = 7345.69432
v5: tps = 7526.290462

wide:
# NOTE not on the same machine as the "narrow" test was run!

# spit out 2, 000, 000 random 100 length strings
perl gen.pl > data.sql
create table test_hash (wide text);
copy test_hash from './data.sql';
create index test_hash_num_idx on test_hash using hash (wide);

bench.sql:
select a.wide from test_hash as a inner join test_hash as b on b.wide
= a.wide where a.wide =
'BJNORSLMITGKHJCWDBLKLYRSJTVPTYXZJPWNBKXGHYFNDHRAKNFMDHRMUXLDXNTRBJMTHPGPBFJZPAENZXDHAHCUSCJTUPUXWCXUH';

# ^ that string is in data.sql

# 3 runs each
pgbench -c1 -n -t100000 -f bench.sql
cvs head: tps = 5073.463498, 5110.620923, 4955.347610
v5: tps = 5870.681336, 5740.007837, 5699.002942

#37

Kenneth Marshall

ktm@rice.edu

over 17 years ago

In reply to: Alex Hunsaker (#34)

Re: hash index improving v3

On Tue, Sep 09, 2008 at 07:23:03PM -0600, Alex Hunsaker wrote:

On Tue, Sep 9, 2008 at 7:48 AM, Kenneth Marshall <ktm@rice.edu> wrote:

I think that the glacial speed for generating a big hash index is
the same problem that the original code faced.

Yeah sorry, I was not saying it was a new problem with the patch. Err
at least not trying to :) *Both* of them had been running at 18+ (I
finally killed them sometime Sunday or around +32 hours...)

It would be useful to have an equivalent test for the hash-only
index without the modified int8 hash function, since that would
be more representative of its performance. The collision rates
that I was observing in my tests of the old and new mix() functions
was about 2 * (1/10000) of what you test generated. You could just
test against the integers between 1 and 2000000.

Sure but then its pretty much just a general test of patch vs no
patch. i.e. How do we measure how much longer collisions take when
the new patch makes things faster? That's what I was trying to
measure... Though I apologize I don't think that was clearly stated
anywhere...

Right, I agree that we need to benchmark the collision processing
time difference. I am not certain that two data points is useful
information. There are 469 collisions with our current hash function
on the integers from 1 to 2000000. What about testing the performance
at power-of-2 multiples of 500, i.e. 500, 1000, 2000, 4000, 8000,...
Unless you adjust the fill calculation for the CREATE INDEX, I would
stop once the time to create the index spikes. It might also be useful
to see if a CLUSTER affects the performance as well. What do you think
of that strategy?

Regards,
Ken

Show quoted text

Now of course it still would be interesting... And if its only to
2,000,000 I can still use the modified int8 or just use the int4
one...

Anyway Here are the numbers:
create table test_hash(num int8);
insert into test_hash (num) select generate_series(1, 2000000);
create index test_hash_num_idx on test_hash (num);

pgbench -c1 -n -t10000 -f bench_index.sql
cvs head: tps = 3161.500511
v5: tps = 7248.808839

BTW Im still planning on doing a wide vs narrow test... sometime... :)

#38

Zdenek Kotala

Zdenek.Kotala@Sun.COM

over 17 years ago

In reply to: Alex Hunsaker (#36)

Re: hash index improving v3

Alex Hunsaker napsal(a):

wide:
# NOTE not on the same machine as the "narrow" test was run!

# spit out 2, 000, 000 random 100 length strings
perl gen.pl > data.sql
create table test_hash (wide text);
copy test_hash from './data.sql';
create index test_hash_num_idx on test_hash using hash (wide);

bench.sql:
select a.wide from test_hash as a inner join test_hash as b on b.wide
= a.wide where a.wide =
'BJNORSLMITGKHJCWDBLKLYRSJTVPTYXZJPWNBKXGHYFNDHRAKNFMDHRMUXLDXNTRBJMTHPGPBFJZPAENZXDHAHCUSCJTUPUXWCXUH';

# ^ that string is in data.sql

# 3 runs each
pgbench -c1 -n -t100000 -f bench.sql
cvs head: tps = 5073.463498, 5110.620923, 4955.347610
v5: tps = 5870.681336, 5740.007837, 5699.002942

What locale did you use? It would be nice to have also comparing between
C and any UTF8 locale. I think we should see big benefit when non C
locale is used.

Thanks Zdenek

#39

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Zdenek Kotala (#38)

Re: hash index improving v3

On Wed, Sep 10, 2008 at 8:47 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:

What locale did you use? It would be nice to have also comparing between C
and any UTF8 locale. I think we should see big benefit when non C locale is
used.

Err yes this was UTF8, Ill see about doing a C locale.

#40

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#39)

Re: hash index improving v3

On Wed, Sep 10, 2008 at 10:27 AM, Alex Hunsaker <badalex@gmail.com> wrote:

On Wed, Sep 10, 2008 at 8:47 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:

What locale did you use? It would be nice to have also comparing between C
and any UTF8 locale. I think we should see big benefit when non C locale is
used.

Err yes this was UTF8, Ill see about doing a C locale.

And here it with a C locale:

pgbench -c1 -n -t100000 -f bench.sql
cvs head: tps = 5142.784262
v5: tps = 6169.405965

and just for fun a the same using a btree index
pgbench -c1 -n -t100000 -f bench.sql
cvs head: tps = 5234.680407
v5: tps = 5286.08252

#41

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Kenneth Marshall (#37)

Re: hash index improving v3

On Wed, Sep 10, 2008 at 7:04 AM, Kenneth Marshall <ktm@rice.edu> wrote:

On Tue, Sep 09, 2008 at 07:23:03PM -0600, Alex Hunsaker wrote:

On Tue, Sep 9, 2008 at 7:48 AM, Kenneth Marshall <ktm@rice.edu> wrote:

I think that the glacial speed for generating a big hash index is
the same problem that the original code faced.

Yeah sorry, I was not saying it was a new problem with the patch. Err
at least not trying to :) *Both* of them had been running at 18+ (I
finally killed them sometime Sunday or around +32 hours...)

It would be useful to have an equivalent test for the hash-only
index without the modified int8 hash function, since that would
be more representative of its performance. The collision rates
that I was observing in my tests of the old and new mix() functions
was about 2 * (1/10000) of what you test generated. You could just
test against the integers between 1 and 2000000.

Sure but then its pretty much just a general test of patch vs no
patch. i.e. How do we measure how much longer collisions take when
the new patch makes things faster? That's what I was trying to
measure... Though I apologize I don't think that was clearly stated
anywhere...

Right, I agree that we need to benchmark the collision processing
time difference. I am not certain that two data points is useful
information. There are 469 collisions with our current hash function
on the integers from 1 to 2000000. What about testing the performance
at power-of-2 multiples of 500, i.e. 500, 1000, 2000, 4000, 8000,...
Unless you adjust the fill calculation for the CREATE INDEX, I would
stop once the time to create the index spikes. It might also be useful
to see if a CLUSTER affects the performance as well. What do you think
of that strategy?

Not sure it will be a good benchmark of collision processing. Then
again you seem to have studied the hash algo closer than me. Ill go
see about doing this. Stay tuned.

#42

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#41)

Re: hash index improving v3

On Wed, Sep 10, 2008 at 9:49 PM, Alex Hunsaker <badalex@gmail.com> wrote:

On Wed, Sep 10, 2008 at 7:04 AM, Kenneth Marshall <ktm@rice.edu> wrote:

On Tue, Sep 09, 2008 at 07:23:03PM -0600, Alex Hunsaker wrote:

On Tue, Sep 9, 2008 at 7:48 AM, Kenneth Marshall <ktm@rice.edu> wrote:

I think that the glacial speed for generating a big hash index is
the same problem that the original code faced.

Yeah sorry, I was not saying it was a new problem with the patch. Err
at least not trying to :) *Both* of them had been running at 18+ (I
finally killed them sometime Sunday or around +32 hours...)

It would be useful to have an equivalent test for the hash-only
index without the modified int8 hash function, since that would
be more representative of its performance. The collision rates
that I was observing in my tests of the old and new mix() functions
was about 2 * (1/10000) of what you test generated. You could just
test against the integers between 1 and 2000000.

Sure but then its pretty much just a general test of patch vs no
patch. i.e. How do we measure how much longer collisions take when
the new patch makes things faster? That's what I was trying to
measure... Though I apologize I don't think that was clearly stated
anywhere...

Right, I agree that we need to benchmark the collision processing
time difference. I am not certain that two data points is useful
information. There are 469 collisions with our current hash function
on the integers from 1 to 2000000. What about testing the performance
at power-of-2 multiples of 500, i.e. 500, 1000, 2000, 4000, 8000,...
Unless you adjust the fill calculation for the CREATE INDEX, I would
stop once the time to create the index spikes. It might also be useful
to see if a CLUSTER affects the performance as well. What do you think
of that strategy?

Not sure it will be a good benchmark of collision processing. Then
again you seem to have studied the hash algo closer than me. Ill go
see about doing this. Stay tuned.

Assuming I understood you correctly, And I probably didn't this does
not work very well because you max out at 27,006 values before you get
this error:
ERROR: index row size 8152 exceeds hash maximum 8144
HINT: Values larger than a buffer page cannot be indexed.

So is a power-of-2 multiple of 500 not simply:
x = 500;
while(1)
{
print x;
x *= 2;
}

#43

Kenneth Marshall

ktm@rice.edu

over 17 years ago

In reply to: Alex Hunsaker (#42)

Re: hash index improving v3

On Wed, Sep 10, 2008 at 10:17:31PM -0600, Alex Hunsaker wrote:

On Wed, Sep 10, 2008 at 9:49 PM, Alex Hunsaker <badalex@gmail.com> wrote:

On Wed, Sep 10, 2008 at 7:04 AM, Kenneth Marshall <ktm@rice.edu> wrote:

On Tue, Sep 09, 2008 at 07:23:03PM -0600, Alex Hunsaker wrote:

On Tue, Sep 9, 2008 at 7:48 AM, Kenneth Marshall <ktm@rice.edu> wrote:

I think that the glacial speed for generating a big hash index is
the same problem that the original code faced.

Yeah sorry, I was not saying it was a new problem with the patch. Err
at least not trying to :) *Both* of them had been running at 18+ (I
finally killed them sometime Sunday or around +32 hours...)

It would be useful to have an equivalent test for the hash-only
index without the modified int8 hash function, since that would
be more representative of its performance. The collision rates
that I was observing in my tests of the old and new mix() functions
was about 2 * (1/10000) of what you test generated. You could just
test against the integers between 1 and 2000000.

Sure but then its pretty much just a general test of patch vs no
patch. i.e. How do we measure how much longer collisions take when
the new patch makes things faster? That's what I was trying to
measure... Though I apologize I don't think that was clearly stated
anywhere...

Right, I agree that we need to benchmark the collision processing
time difference. I am not certain that two data points is useful
information. There are 469 collisions with our current hash function
on the integers from 1 to 2000000. What about testing the performance
at power-of-2 multiples of 500, i.e. 500, 1000, 2000, 4000, 8000,...
Unless you adjust the fill calculation for the CREATE INDEX, I would
stop once the time to create the index spikes. It might also be useful
to see if a CLUSTER affects the performance as well. What do you think
of that strategy?

Not sure it will be a good benchmark of collision processing. Then
again you seem to have studied the hash algo closer than me. Ill go
see about doing this. Stay tuned.

Assuming I understood you correctly, And I probably didn't this does
not work very well because you max out at 27,006 values before you get
this error:
ERROR: index row size 8152 exceeds hash maximum 8144
HINT: Values larger than a buffer page cannot be indexed.

So is a power-of-2 multiple of 500 not simply:
x = 500;
while(1)
{
print x;
x *= 2;
}

?

Alex,

I meant to check the performance with increasing numbers of collisions,
not increasing size of the hashed item. In other words, something like
this:

for ($coll=500; $i<=1000000; $i=$i*2) {
for ($i=0; $i<=1000000; $i++) {
hash(int8 $i);
}
# add the appropriate number of collisions, distributed evenly to
# minimize the packing overrun problem
for ($dup=0; $dup<=$coll; $dup++) {
hash(int8 MAX_INT + $dup * 1000000/$coll);
}
}

Ken

#44

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Kenneth Marshall (#43)

Re: hash index improving v3

On Thu, Sep 11, 2008 at 9:24 AM, Kenneth Marshall <ktm@rice.edu> wrote:

Alex,

I meant to check the performance with increasing numbers of collisions,
not increasing size of the hashed item. In other words, something like
this:

for ($coll=500; $i<=1000000; $i=$i*2) {
for ($i=0; $i<=1000000; $i++) {
hash(int8 $i);
}
# add the appropriate number of collisions, distributed evenly to
# minimize the packing overrun problem
for ($dup=0; $dup<=$coll; $dup++) {
hash(int8 MAX_INT + $dup * 1000000/$coll);
}
}

Ken

*doh* right something like this...

create or replace function create_test_hash() returns bool as $$
declare
coll integer default 500;
-- tweak this to where create index gets really slow
max_coll integer default 1000000;
begin
loop
execute 'create table test_hash_'|| coll ||'(num int8);';
execute 'insert into test_hash_'|| coll ||' (num) select n
from generate_series(0, '|| max_coll ||') as n;';
execute 'insert into test_hash_'|| coll ||' (num) select
(n+4294967296) * '|| max_col ||'/'|| coll ||'::int from
generate_series(0, '|| coll ||') as n;';

coll := coll * 2;

exit when coll >= max_coll;
end loop;
return true;
end;
$$ language 'plpgsql';

And then benchmark each table, and for extra credit cluster the table
on the index and benchmark that.

Also obviously with the hashint8 which just ignores the top 32 bits.

Right?

#45

Zdenek Kotala

Zdenek.Kotala@Sun.COM

over 17 years ago

In reply to: Alex Hunsaker (#40)

Re: hash index improving v3

Alex Hunsaker napsal(a):

On Wed, Sep 10, 2008 at 10:27 AM, Alex Hunsaker <badalex@gmail.com> wrote:

On Wed, Sep 10, 2008 at 8:47 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:

What locale did you use? It would be nice to have also comparing between C
and any UTF8 locale. I think we should see big benefit when non C locale is
used.

Err yes this was UTF8, Ill see about doing a C locale.

And here it with a C locale:

pgbench -c1 -n -t100000 -f bench.sql
cvs head: tps = 5142.784262
v5: tps = 6169.405965

If I look on both results

C UTF8 difference
-----------------------------------
cvs head: 5140 5050 -2%
v5: 6170 5750 -7%
improvement: 20% 14%

than I little bit wonder. I personally expected bigger difference of UTF8
comparing between CVS a v5. This test also shows that locale selection has
bigger impact on performance in v5 case, but result is still better than cvs head.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

#46

Kenneth Marshall

ktm@rice.edu

over 17 years ago

In reply to: Alex Hunsaker (#44)

Re: hash index improving v3

On Thu, Sep 11, 2008 at 08:51:53PM -0600, Alex Hunsaker wrote:

On Thu, Sep 11, 2008 at 9:24 AM, Kenneth Marshall <ktm@rice.edu> wrote:

Alex,

I meant to check the performance with increasing numbers of collisions,
not increasing size of the hashed item. In other words, something like
this:

for ($coll=500; $i<=1000000; $i=$i*2) {
for ($i=0; $i<=1000000; $i++) {
hash(int8 $i);
}
# add the appropriate number of collisions, distributed evenly to
# minimize the packing overrun problem
for ($dup=0; $dup<=$coll; $dup++) {
hash(int8 MAX_INT + $dup * 1000000/$coll);
}
}

Ken

*doh* right something like this...

create or replace function create_test_hash() returns bool as $$
declare
coll integer default 500;
-- tweak this to where create index gets really slow
max_coll integer default 1000000;
begin
loop
execute 'create table test_hash_'|| coll ||'(num int8);';
execute 'insert into test_hash_'|| coll ||' (num) select n
from generate_series(0, '|| max_coll ||') as n;';
execute 'insert into test_hash_'|| coll ||' (num) select
(n+4294967296) * '|| max_col ||'/'|| coll ||'::int from
generate_series(0, '|| coll ||') as n;';

coll := coll * 2;

exit when coll >= max_coll;
end loop;
return true;
end;
$$ language 'plpgsql';

And then benchmark each table, and for extra credit cluster the table
on the index and benchmark that.

Also obviously with the hashint8 which just ignores the top 32 bits.

Right?

Yes, that is exactly right.

Ken

#47

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Zdenek Kotala (#45)

Re: hash index improving v3

On Fri, Sep 12, 2008 at 3:14 AM, Zdenek Kotala <Zdenek.Kotala@sun.com> wrote:

Alex Hunsaker napsal(a):

On Wed, Sep 10, 2008 at 10:27 AM, Alex Hunsaker <badalex@gmail.com> wrote:

On Wed, Sep 10, 2008 at 8:47 AM, Zdenek Kotala <Zdenek.Kotala@sun.com>
wrote:

What locale did you use? It would be nice to have also comparing between
C
and any UTF8 locale. I think we should see big benefit when non C locale
is
used.

Err yes this was UTF8, Ill see about doing a C locale.

And here it with a C locale:

pgbench -c1 -n -t100000 -f bench.sql
cvs head: tps = 5142.784262
v5: tps = 6169.405965

If I look on both results

C UTF8 difference
-----------------------------------
cvs head: 5140 5050 -2%
v5: 6170 5750 -7%
improvement: 20% 14%

than I little bit wonder. I personally expected bigger difference of UTF8
comparing between CVS a v5. This test also shows that locale selection has
bigger impact on performance in v5 case, but result is still better than cvs
head.

Right, I think part of it is I need to try again with a larger
dataset... The reason I did three runs before was because it was
variable between (v5 between 5700 and 6200). Also the query im doing
ming be two simplistic because it runs to fast, thats part of why I
get the variable speed between runs (i think...)

Suggestions?

#48

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Kenneth Marshall (#46)

Re: hash index improving v3

I did some testing of my own on the hash index patch, and mostly seem to
have confirmed Alex's results. I used a table like this:

create table tab (f1 serial primary key, f2 some-datatype);
create index ind on tab using hash (f2);

and populated it with 2 million rows of data; then timed queries
like this:

select * from tab a join tab b using(f2)
where f2 = (select f2 from tab c
where c.f1 = (select int4(random() * 2e6)));

using pgbench like this:

pgbench -n -c 1 -T 60 -M prepared -f query.sql hashdb

To test "wide" indexed columns I used a text column with entries of 100
random characters (identical to Alex's test). I saw a performance gain
of about 50% with an empty kernel cache (26.9 vs 41.9 tps), dropping to
about 14% once the table and index were fully swapped in (4185 vs 4764
tps). This was in a C-locale database. Presumably the win would have
been significantly greater in a non-C-locale test, but I didn't try it.

To test "narrow" indexed columns I made f2 a bigint containing 2000000
consecutive integers. Here I saw about a 5% improvement with either
empty cache (48.1 vs 50.5 tps) or full cache (4590 vs 4800 tps).
This is not too surprising since about the same amount of I/O is needed
either way, and bigint comparison is very cheap. (This is a 64-bit
machine, and it's defaulting to pass-by-value for bigints, so value
comparisons are hardly more expensive than hashcode comparisons.)
But the patch still wins a bit by being able to do binary search within
index pages.

In both of the above cases there were only a negligible number of hash
collisions. I made up another test case, still 2 million bigints, but
with values chosen so that almost every entry had a hash collision with
another entry (a small fraction had two or even 3 collisions). This
showed about a 25% slowdown compared to CVS HEAD with empty cache
(49.9 tps down to 37.2), decreasing to 3% with full cache (4609 vs 4482
tps).

I experimented with some variant queries that did more hash index
fetches per query, and saw penalties as high as 50%. However, this is
surely by far the worst case for the patch: no savings in index size,
and a ridiculously high collision rate.

Lastly, for comparison's sake I tried the "wide column" case with a
btree instead of hash index. This gave me 31.5 tps with empty cache,
4749 tps with full cache. Note that the btree is losing significantly
to the patched hash index in empty-cache conditions --- this presumably
reflects the much larger index size causing more I/O.

I'm thinking that we should go ahead and apply the patch. AFAIR this is
the first example we've ever seen of hash beating btree for fetch
performance, and it's pretty obvious that removing the indexed value
from the index contents is the reason for the win. So I've lost
interest in the alternative that involved storing both hashcode and
indexed value. We now see a possible production use-case for hash,
namely indexing wide column values.

BTW, one thing I noticed was that the hash index build time for the
"wide column" case got a lot worse after applying the patch (from 56 to
237 sec). The reason for this turned out to be that with the smaller
predicted index size, the code decided not to use the pre-sorting method
that was recently added. Reducing effective_cache_size to less than the
index size brought the time back down, to about 54 sec. So it would
seem that effective_cache_size is too large a cutoff value. I'm
considering changing hashbuild to switch over at shared_buffers instead
of effective_cache_size --- any thoughts about that?

regards, tom lane

#49

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Kenneth Marshall (#46)

3 attachment(s)

Re: hash index improving v3

On Fri, Sep 12, 2008 at 8:29 AM, Kenneth Marshall <ktm@rice.edu> wrote:

On Thu, Sep 11, 2008 at 08:51:53PM -0600, Alex Hunsaker wrote:

On Thu, Sep 11, 2008 at 9:24 AM, Kenneth Marshall <ktm@rice.edu> wrote:

Alex,

I meant to check the performance with increasing numbers of collisions,
not increasing size of the hashed item. In other words, something like
this:

for ($coll=500; $i<=1000000; $i=$i*2) {
for ($i=0; $i<=1000000; $i++) {
hash(int8 $i);
}
# add the appropriate number of collisions, distributed evenly to
# minimize the packing overrun problem
for ($dup=0; $dup<=$coll; $dup++) {
hash(int8 MAX_INT + $dup * 1000000/$coll);
}
}

Ken

*doh* right something like this...

create or replace function create_test_hash() returns bool as $$
declare
coll integer default 500;
-- tweak this to where create index gets really slow
max_coll integer default 1000000;
begin
loop
execute 'create table test_hash_'|| coll ||'(num int8);';
execute 'insert into test_hash_'|| coll ||' (num) select n
from generate_series(0, '|| max_coll ||') as n;';
execute 'insert into test_hash_'|| coll ||' (num) select
(n+4294967296) * '|| max_col ||'/'|| coll ||'::int from
generate_series(0, '|| coll ||') as n;';

coll := coll * 2;

exit when coll >= max_coll;
end loop;
return true;
end;
$$ language 'plpgsql';

And then benchmark each table, and for extra credit cluster the table
on the index and benchmark that.

Also obviously with the hashint8 which just ignores the top 32 bits.

Right?

Yes, that is exactly right.

Ken

Ok I finally found time to do this, In summary looks like v5 scales
about the same as cvs head when the collisions are spread evenly
(obviously not HEAD with the hash patch applied...). I couldn't test
cluster because we can't cluster on hash indexes...

benchmark with 50,000,000 rows and 500 collisions:
index creation time:
head: 326116.647 ms
v5: 269509.026 ms

pgbench -n -c1 -T600 -f q.sql hash
head: tps = 3226.605611
v5: tps = 3412.688884 (excluding connections establishing)

50,000,000 rows and 32,768,000 collisions
index time:
head: 576600.967 ms
v5 : 487949.490 ms

pgbench -n -c1 -T500 -f q.sql hash
head: tps = 3105.270185
v5: tps = 3382.25782

You can see each result from 500 all the way up to 32,768,000
collision in the attached result.out

Attached files:
create_test_hash.sql: function I used to create the test tables
result.out: output from bench.pl which shows the pgbench results and
the create index times
bench.pl: stupid little perl script to test pgbench each of the
created tables from create_test_hash.pl

#50

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Tom Lane (#48)

Re: hash index improving v3

On Sun, Sep 14, 2008 at 8:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, one thing I noticed was that the hash index build time for the
"wide column" case got a lot worse after applying the patch (from 56 to
237 sec). The reason for this turned out to be that with the smaller
predicted index size, the code decided not to use the pre-sorting method
that was recently added. Reducing effective_cache_size to less than the
index size brought the time back down, to about 54 sec. So it would
seem that effective_cache_size is too large a cutoff value. I'm
considering changing hashbuild to switch over at shared_buffers instead
of effective_cache_size --- any thoughts about that?

Switching to shared_buffers gets my vote, on my test table with
50,000,000 rows it takes about 8 minutes to create an index using the
default effective_cache_size. With effective_cache_size set to 6GB
(machine has 8GB) its still going an hour later.

#51

Jonah H. Harris

jonah.harris@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#50)

Re: hash index improving v3

On Mon, Sep 22, 2008 at 11:25 PM, Alex Hunsaker <badalex@gmail.com> wrote:

On Sun, Sep 14, 2008 at 8:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

BTW, one thing I noticed was that the hash index build time for the
"wide column" case got a lot worse after applying the patch (from 56 to
237 sec). The reason for this turned out to be that with the smaller
predicted index size, the code decided not to use the pre-sorting method
that was recently added. Reducing effective_cache_size to less than the
index size brought the time back down, to about 54 sec. So it would
seem that effective_cache_size is too large a cutoff value. I'm
considering changing hashbuild to switch over at shared_buffers instead
of effective_cache_size --- any thoughts about that?

Switching to shared_buffers gets my vote, on my test table with
50,000,000 rows it takes about 8 minutes to create an index using the
default effective_cache_size. With effective_cache_size set to 6GB
(machine has 8GB) its still going an hour later.

Agreed. I think using shared_buffers as a cutoff is a much better idea as well.

--
Jonah H. Harris, Senior DBA
myYearbook.com

#52

Alex Hunsaker

badalex@gmail.com

over 17 years ago

In reply to: Alex Hunsaker (#49)

Re: hash index improving v3

On Mon, Sep 22, 2008 at 7:57 PM, Alex Hunsaker <badalex@gmail.com> wrote:

50,000,000 rows and 32,768,000 collisions

I should mention thats 50 million rows + 32 million more or 62,768,000
rows which explains some of the added index creation time...

Show quoted text

index time:
head: 576600.967 ms
v5 : 487949.490 ms

#53

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Jonah H. Harris (#51)

Re: hash index improving v3

"Jonah H. Harris" <jonah.harris@gmail.com> writes:

On Mon, Sep 22, 2008 at 11:25 PM, Alex Hunsaker <badalex@gmail.com> wrote:

On Sun, Sep 14, 2008 at 8:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm considering changing hashbuild to switch over at shared_buffers instead
of effective_cache_size --- any thoughts about that?

Switching to shared_buffers gets my vote, on my test table with
50,000,000 rows it takes about 8 minutes to create an index using the
default effective_cache_size. With effective_cache_size set to 6GB
(machine has 8GB) its still going an hour later.

Agreed. I think using shared_buffers as a cutoff is a much better idea as well.

Already done in CVS a week or so back, but thanks for following up with
some confirmation.

regards, tom lane

#54

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Tom Lane (#53)

Re: hash index improving v3

On Tue, 2008-09-23 at 00:05 -0400, Tom Lane wrote:

"Jonah H. Harris" <jonah.harris@gmail.com> writes:

On Mon, Sep 22, 2008 at 11:25 PM, Alex Hunsaker <badalex@gmail.com> wrote:

On Sun, Sep 14, 2008 at 8:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm considering changing hashbuild to switch over at shared_buffers instead
of effective_cache_size --- any thoughts about that?

Switching to shared_buffers gets my vote, on my test table with
50,000,000 rows it takes about 8 minutes to create an index using the
default effective_cache_size. With effective_cache_size set to 6GB
(machine has 8GB) its still going an hour later.

Agreed. I think using shared_buffers as a cutoff is a much better idea as well.

Already done in CVS a week or so back, but thanks for following up with
some confirmation.

I think nobody noticed that change, or the discussion.

I wasn't very happy with effective_cache_size and not happy with
shared_buffers either. If building hash indexes is memory critical then
we just need to say so and encourage others to set memory use correctly.
People are already aware that maintenance_work_mem needs to be increased
for large index builds and we will confuse people if we ignore that and
use another parameter instead. At the very least, a controllable
parameter is the only appropriate choice for production systems, not one
that cannot be changed without restart. It would also throw away any
chance of having pg_restore of hash indexes run in parallel, since it
will not be able to control memory usage. Setting maintenance_work_mem
higher than shared buffers is easily possible now and we should just use
that rather than try to prejudge how people will and can configure their
systems. If maintenance_work_mem default is too low, lets increase it.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#55

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Simon Riggs (#54)

Re: hash index improving v3

Simon Riggs <simon@2ndQuadrant.com> writes:

I wasn't very happy with effective_cache_size and not happy with
shared_buffers either. If building hash indexes is memory critical then
we just need to say so and encourage others to set memory use correctly.
People are already aware that maintenance_work_mem needs to be increased
for large index builds and we will confuse people if we ignore that and
use another parameter instead.

I think you've got this completely backwards. The general theory about
maintenance_work_mem is "set it as large as you can stand it". The
issue at hand here is that the crossover point for hash index sort
building seems to be a good deal less than all-the-memory-you-have.

Perhaps there is a case for giving this behavior its very own
configuration parameter; but seeing that we still don't have all that
much of a use case for hash indexes at all, I don't feel a need to do
that yet. In any case, tying it to maintenance_work_mem is certainly
wrong.

regards, tom lane

#56

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Tom Lane (#55)

Re: hash index improving v3

On Tue, 2008-09-23 at 00:48 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

I wasn't very happy with effective_cache_size and not happy with
shared_buffers either. If building hash indexes is memory critical then
we just need to say so and encourage others to set memory use correctly.
People are already aware that maintenance_work_mem needs to be increased
for large index builds and we will confuse people if we ignore that and
use another parameter instead.

I think you've got this completely backwards. The general theory about
maintenance_work_mem is "set it as large as you can stand it". The
issue at hand here is that the crossover point for hash index sort
building seems to be a good deal less than all-the-memory-you-have.

There's an optimal point for btree builds using sorts that is a good
deal less also, so I don't get that.

Plus, if you give it all-the-memory-you-have there won't be anything
left for anything else. You might set it that high for an empty database
load but you don't set it that high in production unless you want to see
swapping when you create large indexes.

maintenance_work_mem is already used for 3 separate operations that bear
little resemblance to each other. If it's appropriate for all of those
then its appropriate for this usage also. Doing it otherwise is going to
mean more than 50% of people do the wrong thing, which would be a shame
because what's been done looks very cool.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#57

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Simon Riggs (#56)

Re: hash index improving v3

Simon Riggs <simon@2ndQuadrant.com> writes:

maintenance_work_mem is already used for 3 separate operations that bear
little resemblance to each other. If it's appropriate for all of those
then its appropriate for this usage also.

No, it isn't.

The fundamental point here is that this isn't a memory allocation
parameter; it's a switchover threshold between two different behaviors.

regards, tom lane

#58

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Tom Lane (#57)

Re: hash index improving v3

On Tue, 2008-09-23 at 08:16 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

maintenance_work_mem is already used for 3 separate operations that bear
little resemblance to each other. If it's appropriate for all of those
then its appropriate for this usage also.

No, it isn't.

The fundamental point here is that this isn't a memory allocation
parameter; it's a switchover threshold between two different behaviors.

That's a little confusing since sorts switch their behaviour also, but
use (some form of) work_mem, which is *also* their max allocation. I see
the difficulty in understanding the algorithm's behaviour now.

So shared_buffers is the wrong parameter, but even if we had a parameter
it would be very difficult to set it.

Thinks: Why not just sort all of the time and skip the debate entirely?
I thought the main use case was for larger indexes, since that's when
the number of levels in the index is significantly less than btrees? Do
we need to optimise creation time of smaller hash indexes at all?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#59

Tom Lane

tgl@sss.pgh.pa.us

over 17 years ago

In reply to: Simon Riggs (#58)

Re: hash index improving v3

Simon Riggs <simon@2ndQuadrant.com> writes:

Thinks: Why not just sort all of the time and skip the debate entirely?

The sort is demonstrably a loser for smaller indexes. Admittedly,
if the index is small then the sort can't cost all that much, but if
the (correct) threshold is some large fraction of shared_buffers then
it could still take awhile on installations with lots-o-buffers.

The other side of that coin is that it's not clear this is really worth
arguing about, much less exposing a separate parameter for.

regards, tom lane

#60

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Tom Lane (#59)

Re: hash index improving v3

On Tue, 2008-09-23 at 09:13 -0400, Tom Lane wrote:

The other side of that coin is that it's not clear this is really worth
arguing about, much less exposing a separate parameter for.

Agreed.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#61

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Tom Lane (#59)

Re: hash index improving v3

On Tue, 2008-09-23 at 09:13 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Thinks: Why not just sort all of the time and skip the debate entirely?

The sort is demonstrably a loser for smaller indexes. Admittedly,
if the index is small then the sort can't cost all that much, but if
the (correct) threshold is some large fraction of shared_buffers then
it could still take awhile on installations with lots-o-buffers.

The other realisation is that for large indexes, giving them more
maintenance_work_mem probably will make them build faster 'cos we'll be
sorting. So "give big indexes more memory" is still true *enough* to be
broadly consistent, explainable and understandable. I do explain things
in more detail on some courses, but pithy rules help busy people.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#62

Bruce Momjian

bruce@momjian.us

over 17 years ago

In reply to: Tom Lane (#59)

Re: hash index improving v3

Can we consider this hash thread closed/completed?

---------------------------------------------------------------------------

Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Thinks: Why not just sort all of the time and skip the debate entirely?

The sort is demonstrably a loser for smaller indexes. Admittedly,
if the index is small then the sort can't cost all that much, but if
the (correct) threshold is some large fraction of shared_buffers then
it could still take awhile on installations with lots-o-buffers.

The other side of that coin is that it's not clear this is really worth
arguing about, much less exposing a separate parameter for.

regards, tom lane

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#63

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Bruce Momjian (#62)

Re: hash index improving v3

On Wed, 2008-09-24 at 12:04 -0400, Bruce Momjian wrote:

Can we consider this hash thread closed/completed?

Please

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support