MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

Started by Noah Mischover 12 years ago14 messages

noah@leadboat.com

over 12 years ago

1 attachment(s)

A memory chunk allocated through the existing palloc.h interfaces is limited
to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE() need
not check its own 1 GiB limit, and algorithms that grow a buffer by doubling
need not check for overflow. However, a handful of callers are quite happy to
navigate those hazards in exchange for the ability to allocate a larger chunk.

This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother recording
whether they were allocated as huge; one can start with palloc() and then
repalloc_huge() to grow the value. To demonstrate, I put this to use in
tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's
the trace_sort from building the pgbench_accounts primary key at scale factor
7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:

LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 391.21 sec

Compare:

LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec elapsed 1146.05 sec

This was made easier by tuplesort growth algorithm improvements in commit
8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before
(TODO item "Allow sorts to use more available memory"), and Tom floated the
idea[1]/messages/by-id/19908.1297696263@sss.pgh.pa.us behind the approach I've used. The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
150 GiB when sorting int4.

I have not added variants like palloc_huge() and palloc0_huge(), and I have
not added to the frontend palloc.h interface. There's no particular barrier
to doing any of that. I don't expect more than a dozen or so callers, so most
of the variations might go unused.

The comment at MaxAllocSize said that aset.c expects doubling the size of an
arbitrary allocation to never overflow, but I couldn't find the code in
question. AllocSetAlloc() does double sizes of blocks used to aggregate small
allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless,
that expectation does apply to dozens of repalloc() users outside aset.c, and
I preserved it for repalloc_huge(). 64-bit builds will never notice, and I
won't cry for the resulting 2 GiB limit on 32-bit.

Thanks,
nm

[1]: /messages/by-id/19908.1297696263@sss.pgh.pa.us

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

Attachments:

alloc-huge-v1.patchtext/plain; charset=us-asciiDownload

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***************
*** 557,562 **** AllocSetDelete(MemoryContext context)
--- 557,566 ----
   * AllocSetAlloc
   *		Returns pointer to allocated memory of given size; memory is added
   *		to the set.
+  *
+  * No request may exceed:
+  *		MAXALIGN_DOWN(SIZE_MAX) - ALLOC_BLOCKHDRSZ - ALLOC_CHUNKHDRSZ
+  * All callers use a much-lower limit.
   */
  static void *
  AllocSetAlloc(MemoryContext context, Size size)
*** a/src/backend/utils/mmgr/mcxt.c
--- b/src/backend/utils/mmgr/mcxt.c
***************
*** 451,464 **** MemoryContextContains(MemoryContext context, void *pointer)
  	header = (StandardChunkHeader *)
  		((char *) pointer - STANDARDCHUNKHEADERSIZE);
  
! 	/*
! 	 * If the context link doesn't match then we certainly have a non-member
! 	 * chunk.  Also check for a reasonable-looking size as extra guard against
! 	 * being fooled by bogus pointers.
! 	 */
! 	if (header->context == context && AllocSizeIsValid(header->size))
! 		return true;
! 	return false;
  }
  
  /*--------------------
--- 451,457 ----
  	header = (StandardChunkHeader *)
  		((char *) pointer - STANDARDCHUNKHEADERSIZE);
  
! 	return header->context == context;
  }
  
  /*--------------------
***************
*** 735,740 **** repalloc(void *pointer, Size size)
--- 728,790 ----
  }
  
  /*
+  * MemoryContextAllocHuge
+  *		Allocate (possibly-expansive) space within the specified context.
+  *
+  * See considerations in comment at MaxAllocHugeSize.
+  */
+ void *
+ MemoryContextAllocHuge(MemoryContext context, Size size)
+ {
+ 	AssertArg(MemoryContextIsValid(context));
+ 
+ 	if (!AllocHugeSizeIsValid(size))
+ 		elog(ERROR, "invalid memory alloc request size %lu",
+ 			 (unsigned long) size);
+ 
+ 	context->isReset = false;
+ 
+ 	return (*context->methods->alloc) (context, size);
+ }
+ 
+ /*
+  * repalloc_huge
+  *		Adjust the size of a previously allocated chunk, permitting a large
+  *		value.  The previous allocation need not have been "huge".
+  */
+ void *
+ repalloc_huge(void *pointer, Size size)
+ {
+ 	StandardChunkHeader *header;
+ 
+ 	/*
+ 	 * Try to detect bogus pointers handed to us, poorly though we can.
+ 	 * Presumably, a pointer that isn't MAXALIGNED isn't pointing at an
+ 	 * allocated chunk.
+ 	 */
+ 	Assert(pointer != NULL);
+ 	Assert(pointer == (void *) MAXALIGN(pointer));
+ 
+ 	/*
+ 	 * OK, it's probably safe to look at the chunk header.
+ 	 */
+ 	header = (StandardChunkHeader *)
+ 		((char *) pointer - STANDARDCHUNKHEADERSIZE);
+ 
+ 	AssertArg(MemoryContextIsValid(header->context));
+ 
+ 	if (!AllocHugeSizeIsValid(size))
+ 		elog(ERROR, "invalid memory alloc request size %lu",
+ 			 (unsigned long) size);
+ 
+ 	/* isReset must be false already */
+ 	Assert(!header->context->isReset);
+ 
+ 	return (*header->context->methods->realloc) (header->context,
+ 												 pointer, size);
+ }
+ 
+ /*
   * MemoryContextStrdup
   *		Like strdup(), but allocate from the specified context
   */
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
***************
*** 211,218 **** struct Tuplesortstate
  								 * tuples to return? */
  	bool		boundUsed;		/* true if we made use of a bounded heap */
  	int			bound;			/* if bounded, the maximum number of tuples */
! 	long		availMem;		/* remaining memory available, in bytes */
! 	long		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
  	MemoryContext sortcontext;	/* memory context holding all sort data */
--- 211,218 ----
  								 * tuples to return? */
  	bool		boundUsed;		/* true if we made use of a bounded heap */
  	int			bound;			/* if bounded, the maximum number of tuples */
! 	Size		availMem;		/* remaining memory available, in bytes */
! 	Size		allowedMem;		/* total memory allowed, in bytes */
  	int			maxTapes;		/* number of tapes (Knuth's T) */
  	int			tapeRange;		/* maxTapes-1 (Knuth's P) */
  	MemoryContext sortcontext;	/* memory context holding all sort data */
***************
*** 308,314 **** struct Tuplesortstate
  	int		   *mergenext;		/* first preread tuple for each source */
  	int		   *mergelast;		/* last preread tuple for each source */
  	int		   *mergeavailslots;	/* slots left for prereading each tape */
! 	long	   *mergeavailmem;	/* availMem for prereading each tape */
  	int			mergefreelist;	/* head of freelist of recycled slots */
  	int			mergefirstfree; /* first slot never used in this merge */
  
--- 308,314 ----
  	int		   *mergenext;		/* first preread tuple for each source */
  	int		   *mergelast;		/* last preread tuple for each source */
  	int		   *mergeavailslots;	/* slots left for prereading each tape */
! 	Size	   *mergeavailmem;	/* availMem for prereading each tape */
  	int			mergefreelist;	/* head of freelist of recycled slots */
  	int			mergefirstfree; /* first slot never used in this merge */
  
***************
*** 961,985 **** tuplesort_end(Tuplesortstate *state)
  }
  
  /*
!  * Grow the memtuples[] array, if possible within our memory constraint.
!  * Return TRUE if we were able to enlarge the array, FALSE if not.
   *
!  * Normally, at each increment we double the size of the array.  When we no
!  * longer have enough memory to do that, we attempt one last, smaller increase
!  * (and then clear the growmemtuples flag so we don't try any more).  That
!  * allows us to use allowedMem as fully as possible; sticking to the pure
!  * doubling rule could result in almost half of allowedMem going unused.
!  * Because availMem moves around with tuple addition/removal, we need some
!  * rule to prevent making repeated small increases in memtupsize, which would
!  * just be useless thrashing.  The growmemtuples flag accomplishes that and
!  * also prevents useless recalculations in this function.
   */
  static bool
  grow_memtuples(Tuplesortstate *state)
  {
  	int			newmemtupsize;
  	int			memtupsize = state->memtupsize;
! 	long		memNowUsed = state->allowedMem - state->availMem;
  
  	/* Forget it if we've already maxed out memtuples, per comment above */
  	if (!state->growmemtuples)
--- 961,986 ----
  }
  
  /*
!  * Grow the memtuples[] array, if possible within our memory constraint.  We
!  * must not exceed INT_MAX tuples in memory or the caller-provided memory
!  * limit.  Return TRUE if we were able to enlarge the array, FALSE if not.
   *
!  * Normally, at each increment we double the size of the array.  When doing
!  * that would exceed a limit, we attempt one last, smaller increase (and then
!  * clear the growmemtuples flag so we don't try any more).  That allows us to
!  * use memory as fully as permitted; sticking to the pure doubling rule could
!  * result in almost half going unused.  Because availMem moves around with
!  * tuple addition/removal, we need some rule to prevent making repeated small
!  * increases in memtupsize, which would just be useless thrashing.  The
!  * growmemtuples flag accomplishes that and also prevents useless
!  * recalculations in this function.
   */
  static bool
  grow_memtuples(Tuplesortstate *state)
  {
  	int			newmemtupsize;
  	int			memtupsize = state->memtupsize;
! 	Size		memNowUsed = state->allowedMem - state->availMem;
  
  	/* Forget it if we've already maxed out memtuples, per comment above */
  	if (!state->growmemtuples)
***************
*** 989,1002 **** grow_memtuples(Tuplesortstate *state)
  	if (memNowUsed <= state->availMem)
  	{
  		/*
! 		 * It is surely safe to double memtupsize if we've used no more than
! 		 * half of allowedMem.
! 		 *
! 		 * Note: it might seem that we need to worry about memtupsize * 2
! 		 * overflowing an int, but the MaxAllocSize clamp applied below
! 		 * ensures the existing memtupsize can't be large enough for that.
  		 */
! 		newmemtupsize = memtupsize * 2;
  	}
  	else
  	{
--- 990,1005 ----
  	if (memNowUsed <= state->availMem)
  	{
  		/*
! 		 * We've used no more than half of allowedMem; double our usage,
! 		 * clamping at INT_MAX.
  		 */
! 		if (memtupsize < INT_MAX / 2)
! 			newmemtupsize = memtupsize * 2;
! 		else
! 		{
! 			newmemtupsize = INT_MAX;
! 			state->growmemtuples = false;
! 		}
  	}
  	else
  	{
***************
*** 1012,1018 **** grow_memtuples(Tuplesortstate *state)
  		 * we've already seen, and thus we can extrapolate from the space
  		 * consumption so far to estimate an appropriate new size for the
  		 * memtuples array.  The optimal value might be higher or lower than
! 		 * this estimate, but it's hard to know that in advance.
  		 *
  		 * This calculation is safe against enlarging the array so much that
  		 * LACKMEM becomes true, because the memory currently used includes
--- 1015,1022 ----
  		 * we've already seen, and thus we can extrapolate from the space
  		 * consumption so far to estimate an appropriate new size for the
  		 * memtuples array.  The optimal value might be higher or lower than
! 		 * this estimate, but it's hard to know that in advance.  We again
! 		 * clamp at INT_MAX tuples.
  		 *
  		 * This calculation is safe against enlarging the array so much that
  		 * LACKMEM becomes true, because the memory currently used includes
***************
*** 1020,1035 **** grow_memtuples(Tuplesortstate *state)
  		 * new array elements even if no other memory were currently used.
  		 *
  		 * We do the arithmetic in float8, because otherwise the product of
! 		 * memtupsize and allowedMem could overflow.  (A little algebra shows
! 		 * that grow_ratio must be less than 2 here, so we are not risking
! 		 * integer overflow this way.)	Any inaccuracy in the result should be
! 		 * insignificant; but even if we computed a completely insane result,
! 		 * the checks below will prevent anything really bad from happening.
  		 */
  		double		grow_ratio;
  
  		grow_ratio = (double) state->allowedMem / (double) memNowUsed;
! 		newmemtupsize = (int) (memtupsize * grow_ratio);
  
  		/* We won't make any further enlargement attempts */
  		state->growmemtuples = false;
--- 1024,1041 ----
  		 * new array elements even if no other memory were currently used.
  		 *
  		 * We do the arithmetic in float8, because otherwise the product of
! 		 * memtupsize and allowedMem could overflow.  Any inaccuracy in the
! 		 * result should be insignificant; but even if we computed a
! 		 * completely insane result, the checks below will prevent anything
! 		 * really bad from happening.
  		 */
  		double		grow_ratio;
  
  		grow_ratio = (double) state->allowedMem / (double) memNowUsed;
! 		if (memtupsize * grow_ratio < INT_MAX)
! 			newmemtupsize = (int) (memtupsize * grow_ratio);
! 		else
! 			newmemtupsize = INT_MAX;
  
  		/* We won't make any further enlargement attempts */
  		state->growmemtuples = false;
***************
*** 1040,1051 **** grow_memtuples(Tuplesortstate *state)
  		goto noalloc;
  
  	/*
! 	 * On a 64-bit machine, allowedMem could be more than MaxAllocSize.  Clamp
! 	 * to ensure our request won't be rejected by palloc.
  	 */
! 	if ((Size) newmemtupsize >= MaxAllocSize / sizeof(SortTuple))
  	{
! 		newmemtupsize = (int) (MaxAllocSize / sizeof(SortTuple));
  		state->growmemtuples = false;	/* can't grow any more */
  	}
  
--- 1046,1058 ----
  		goto noalloc;
  
  	/*
! 	 * On a 32-bit machine, allowedMem could exceed MaxAllocHugeSize.  Clamp
! 	 * to ensure our request won't be rejected.  Note that we can easily
! 	 * exhaust address space before facing this outcome.
  	 */
! 	if ((Size) newmemtupsize >= MaxAllocHugeSize / sizeof(SortTuple))
  	{
! 		newmemtupsize = (int) (MaxAllocHugeSize / sizeof(SortTuple));
  		state->growmemtuples = false;	/* can't grow any more */
  	}
  
***************
*** 1060,1074 **** grow_memtuples(Tuplesortstate *state)
  	 * palloc would be treating both old and new arrays as separate chunks.
  	 * But we'll check LACKMEM explicitly below just in case.)
  	 */
! 	if (state->availMem < (long) ((newmemtupsize - memtupsize) * sizeof(SortTuple)))
  		goto noalloc;
  
  	/* OK, do it */
  	FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
  	state->memtupsize = newmemtupsize;
  	state->memtuples = (SortTuple *)
! 		repalloc(state->memtuples,
! 				 state->memtupsize * sizeof(SortTuple));
  	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  	if (LACKMEM(state))
  		elog(ERROR, "unexpected out-of-memory situation during sort");
--- 1067,1081 ----
  	 * palloc would be treating both old and new arrays as separate chunks.
  	 * But we'll check LACKMEM explicitly below just in case.)
  	 */
! 	if (state->availMem < (Size) ((newmemtupsize - memtupsize) * sizeof(SortTuple)))
  		goto noalloc;
  
  	/* OK, do it */
  	FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
  	state->memtupsize = newmemtupsize;
  	state->memtuples = (SortTuple *)
! 		repalloc_huge(state->memtuples,
! 					  state->memtupsize * sizeof(SortTuple));
  	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  	if (LACKMEM(state))
  		elog(ERROR, "unexpected out-of-memory situation during sort");
***************
*** 1715,1721 **** tuplesort_getdatum(Tuplesortstate *state, bool forward,
   * This is exported for use by the planner.  allowedMem is in bytes.
   */
  int
! tuplesort_merge_order(long allowedMem)
  {
  	int			mOrder;
  
--- 1722,1728 ----
   * This is exported for use by the planner.  allowedMem is in bytes.
   */
  int
! tuplesort_merge_order(Size allowedMem)
  {
  	int			mOrder;
  
***************
*** 1749,1755 **** inittapes(Tuplesortstate *state)
  	int			maxTapes,
  				ntuples,
  				j;
! 	long		tapeSpace;
  
  	/* Compute number of tapes to use: merge order plus 1 */
  	maxTapes = tuplesort_merge_order(state->allowedMem) + 1;
--- 1756,1762 ----
  	int			maxTapes,
  				ntuples,
  				j;
! 	Size		tapeSpace;
  
  	/* Compute number of tapes to use: merge order plus 1 */
  	maxTapes = tuplesort_merge_order(state->allowedMem) + 1;
***************
*** 1798,1804 **** inittapes(Tuplesortstate *state)
  	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
  	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
  	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
! 	state->mergeavailmem = (long *) palloc0(maxTapes * sizeof(long));
  	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
  	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
  	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
--- 1805,1811 ----
  	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
  	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
  	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
! 	state->mergeavailmem = (Size *) palloc0(maxTapes * sizeof(Size));
  	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
  	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
  	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
***************
*** 2026,2032 **** mergeonerun(Tuplesortstate *state)
  	int			srcTape;
  	int			tupIndex;
  	SortTuple  *tup;
! 	long		priorAvail,
  				spaceFreed;
  
  	/*
--- 2033,2039 ----
  	int			srcTape;
  	int			tupIndex;
  	SortTuple  *tup;
! 	Size		priorAvail,
  				spaceFreed;
  
  	/*
***************
*** 2100,2106 **** beginmerge(Tuplesortstate *state)
  	int			tapenum;
  	int			srcTape;
  	int			slotsPerTape;
! 	long		spacePerTape;
  
  	/* Heap should be empty here */
  	Assert(state->memtupcount == 0);
--- 2107,2113 ----
  	int			tapenum;
  	int			srcTape;
  	int			slotsPerTape;
! 	Size		spacePerTape;
  
  	/* Heap should be empty here */
  	Assert(state->memtupcount == 0);
***************
*** 2221,2227 **** mergeprereadone(Tuplesortstate *state, int srcTape)
  	unsigned int tuplen;
  	SortTuple	stup;
  	int			tupIndex;
! 	long		priorAvail,
  				spaceUsed;
  
  	if (!state->mergeactive[srcTape])
--- 2228,2234 ----
  	unsigned int tuplen;
  	SortTuple	stup;
  	int			tupIndex;
! 	Size		priorAvail,
  				spaceUsed;
  
  	if (!state->mergeactive[srcTape])
*** a/src/backend/utils/sort/tuplestore.c
--- b/src/backend/utils/sort/tuplestore.c
***************
*** 104,111 **** struct Tuplestorestate
  	bool		backward;		/* store extra length words in file? */
  	bool		interXact;		/* keep open through transactions? */
  	bool		truncated;		/* tuplestore_trim has removed tuples? */
! 	long		availMem;		/* remaining memory available, in bytes */
! 	long		allowedMem;		/* total memory allowed, in bytes */
  	BufFile    *myfile;			/* underlying file, or NULL if none */
  	MemoryContext context;		/* memory context for holding tuples */
  	ResourceOwner resowner;		/* resowner for holding temp files */
--- 104,111 ----
  	bool		backward;		/* store extra length words in file? */
  	bool		interXact;		/* keep open through transactions? */
  	bool		truncated;		/* tuplestore_trim has removed tuples? */
! 	Size		availMem;		/* remaining memory available, in bytes */
! 	Size		allowedMem;		/* total memory allowed, in bytes */
  	BufFile    *myfile;			/* underlying file, or NULL if none */
  	MemoryContext context;		/* memory context for holding tuples */
  	ResourceOwner resowner;		/* resowner for holding temp files */
***************
*** 531,555 **** tuplestore_ateof(Tuplestorestate *state)
  }
  
  /*
!  * Grow the memtuples[] array, if possible within our memory constraint.
!  * Return TRUE if we were able to enlarge the array, FALSE if not.
   *
!  * Normally, at each increment we double the size of the array.  When we no
!  * longer have enough memory to do that, we attempt one last, smaller increase
!  * (and then clear the growmemtuples flag so we don't try any more).  That
!  * allows us to use allowedMem as fully as possible; sticking to the pure
!  * doubling rule could result in almost half of allowedMem going unused.
!  * Because availMem moves around with tuple addition/removal, we need some
!  * rule to prevent making repeated small increases in memtupsize, which would
!  * just be useless thrashing.  The growmemtuples flag accomplishes that and
!  * also prevents useless recalculations in this function.
   */
  static bool
  grow_memtuples(Tuplestorestate *state)
  {
  	int			newmemtupsize;
  	int			memtupsize = state->memtupsize;
! 	long		memNowUsed = state->allowedMem - state->availMem;
  
  	/* Forget it if we've already maxed out memtuples, per comment above */
  	if (!state->growmemtuples)
--- 531,556 ----
  }
  
  /*
!  * Grow the memtuples[] array, if possible within our memory constraint.  We
!  * must not exceed INT_MAX tuples in memory or the caller-provided memory
!  * limit.  Return TRUE if we were able to enlarge the array, FALSE if not.
   *
!  * Normally, at each increment we double the size of the array.  When doing
!  * that would exceed a limit, we attempt one last, smaller increase (and then
!  * clear the growmemtuples flag so we don't try any more).  That allows us to
!  * use memory as fully as permitted; sticking to the pure doubling rule could
!  * result in almost half going unused.  Because availMem moves around with
!  * tuple addition/removal, we need some rule to prevent making repeated small
!  * increases in memtupsize, which would just be useless thrashing.  The
!  * growmemtuples flag accomplishes that and also prevents useless
!  * recalculations in this function.
   */
  static bool
  grow_memtuples(Tuplestorestate *state)
  {
  	int			newmemtupsize;
  	int			memtupsize = state->memtupsize;
! 	Size		memNowUsed = state->allowedMem - state->availMem;
  
  	/* Forget it if we've already maxed out memtuples, per comment above */
  	if (!state->growmemtuples)
***************
*** 559,572 **** grow_memtuples(Tuplestorestate *state)
  	if (memNowUsed <= state->availMem)
  	{
  		/*
! 		 * It is surely safe to double memtupsize if we've used no more than
! 		 * half of allowedMem.
! 		 *
! 		 * Note: it might seem that we need to worry about memtupsize * 2
! 		 * overflowing an int, but the MaxAllocSize clamp applied below
! 		 * ensures the existing memtupsize can't be large enough for that.
  		 */
! 		newmemtupsize = memtupsize * 2;
  	}
  	else
  	{
--- 560,575 ----
  	if (memNowUsed <= state->availMem)
  	{
  		/*
! 		 * We've used no more than half of allowedMem; double our usage,
! 		 * clamping at INT_MAX.
  		 */
! 		if (memtupsize < INT_MAX / 2)
! 			newmemtupsize = memtupsize * 2;
! 		else
! 		{
! 			newmemtupsize = INT_MAX;
! 			state->growmemtuples = false;
! 		}
  	}
  	else
  	{
***************
*** 582,588 **** grow_memtuples(Tuplestorestate *state)
  		 * we've already seen, and thus we can extrapolate from the space
  		 * consumption so far to estimate an appropriate new size for the
  		 * memtuples array.  The optimal value might be higher or lower than
! 		 * this estimate, but it's hard to know that in advance.
  		 *
  		 * This calculation is safe against enlarging the array so much that
  		 * LACKMEM becomes true, because the memory currently used includes
--- 585,592 ----
  		 * we've already seen, and thus we can extrapolate from the space
  		 * consumption so far to estimate an appropriate new size for the
  		 * memtuples array.  The optimal value might be higher or lower than
! 		 * this estimate, but it's hard to know that in advance.  We again
! 		 * clamp at INT_MAX tuples.
  		 *
  		 * This calculation is safe against enlarging the array so much that
  		 * LACKMEM becomes true, because the memory currently used includes
***************
*** 590,605 **** grow_memtuples(Tuplestorestate *state)
  		 * new array elements even if no other memory were currently used.
  		 *
  		 * We do the arithmetic in float8, because otherwise the product of
! 		 * memtupsize and allowedMem could overflow.  (A little algebra shows
! 		 * that grow_ratio must be less than 2 here, so we are not risking
! 		 * integer overflow this way.)	Any inaccuracy in the result should be
! 		 * insignificant; but even if we computed a completely insane result,
! 		 * the checks below will prevent anything really bad from happening.
  		 */
  		double		grow_ratio;
  
  		grow_ratio = (double) state->allowedMem / (double) memNowUsed;
! 		newmemtupsize = (int) (memtupsize * grow_ratio);
  
  		/* We won't make any further enlargement attempts */
  		state->growmemtuples = false;
--- 594,611 ----
  		 * new array elements even if no other memory were currently used.
  		 *
  		 * We do the arithmetic in float8, because otherwise the product of
! 		 * memtupsize and allowedMem could overflow.  Any inaccuracy in the
! 		 * result should be insignificant; but even if we computed a
! 		 * completely insane result, the checks below will prevent anything
! 		 * really bad from happening.
  		 */
  		double		grow_ratio;
  
  		grow_ratio = (double) state->allowedMem / (double) memNowUsed;
! 		if (memtupsize * grow_ratio < INT_MAX)
! 			newmemtupsize = (int) (memtupsize * grow_ratio);
! 		else
! 			newmemtupsize = INT_MAX;
  
  		/* We won't make any further enlargement attempts */
  		state->growmemtuples = false;
***************
*** 610,621 **** grow_memtuples(Tuplestorestate *state)
  		goto noalloc;
  
  	/*
! 	 * On a 64-bit machine, allowedMem could be more than MaxAllocSize.  Clamp
! 	 * to ensure our request won't be rejected by palloc.
  	 */
! 	if ((Size) newmemtupsize >= MaxAllocSize / sizeof(void *))
  	{
! 		newmemtupsize = (int) (MaxAllocSize / sizeof(void *));
  		state->growmemtuples = false;	/* can't grow any more */
  	}
  
--- 616,628 ----
  		goto noalloc;
  
  	/*
! 	 * On a 32-bit machine, allowedMem could exceed MaxAllocHugeSize.  Clamp
! 	 * to ensure our request won't be rejected.  Note that we can easily
! 	 * exhaust address space before facing this outcome.
  	 */
! 	if ((Size) newmemtupsize >= MaxAllocHugeSize / sizeof(void *))
  	{
! 		newmemtupsize = (int) (MaxAllocHugeSize / sizeof(void *));
  		state->growmemtuples = false;	/* can't grow any more */
  	}
  
***************
*** 630,644 **** grow_memtuples(Tuplestorestate *state)
  	 * palloc would be treating both old and new arrays as separate chunks.
  	 * But we'll check LACKMEM explicitly below just in case.)
  	 */
! 	if (state->availMem < (long) ((newmemtupsize - memtupsize) * sizeof(void *)))
  		goto noalloc;
  
  	/* OK, do it */
  	FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
  	state->memtupsize = newmemtupsize;
  	state->memtuples = (void **)
! 		repalloc(state->memtuples,
! 				 state->memtupsize * sizeof(void *));
  	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  	if (LACKMEM(state))
  		elog(ERROR, "unexpected out-of-memory situation during sort");
--- 637,651 ----
  	 * palloc would be treating both old and new arrays as separate chunks.
  	 * But we'll check LACKMEM explicitly below just in case.)
  	 */
! 	if (state->availMem < (Size) ((newmemtupsize - memtupsize) * sizeof(void *)))
  		goto noalloc;
  
  	/* OK, do it */
  	FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
  	state->memtupsize = newmemtupsize;
  	state->memtuples = (void **)
! 		repalloc_huge(state->memtuples,
! 					  state->memtupsize * sizeof(void *));
  	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
  	if (LACKMEM(state))
  		elog(ERROR, "unexpected out-of-memory situation during sort");
*** a/src/include/utils/memutils.h
--- b/src/include/utils/memutils.h
***************
*** 21,46 ****
  
  
  /*
!  * MaxAllocSize
!  *		Quasi-arbitrary limit on size of allocations.
   *
   * Note:
!  *		There is no guarantee that allocations smaller than MaxAllocSize
!  *		will succeed.  Allocation requests larger than MaxAllocSize will
!  *		be summarily denied.
   *
!  * XXX This is deliberately chosen to correspond to the limiting size
!  * of varlena objects under TOAST.	See VARSIZE_4B() and related macros
!  * in postgres.h.  Many datatypes assume that any allocatable size can
!  * be represented in a varlena header.
!  *
!  * XXX Also, various places in aset.c assume they can compute twice an
!  * allocation's size without overflow, so beware of raising this.
   */
  #define MaxAllocSize	((Size) 0x3fffffff)		/* 1 gigabyte - 1 */
  
  #define AllocSizeIsValid(size)	((Size) (size) <= MaxAllocSize)
  
  /*
   * All chunks allocated by any memory context manager are required to be
   * preceded by a StandardChunkHeader at a spacing of STANDARDCHUNKHEADERSIZE.
--- 21,49 ----
  
  
  /*
!  * MaxAllocSize, MaxAllocHugeSize
!  *		Quasi-arbitrary limits on size of allocations.
   *
   * Note:
!  *		There is no guarantee that smaller allocations will succeed, but
!  *		larger requests will be summarily denied.
   *
!  * palloc() enforces MaxAllocSize, chosen to correspond to the limiting size
!  * of varlena objects under TOAST.  See VARSIZE_4B() and related macros in
!  * postgres.h.  Many datatypes assume that any allocatable size can be
!  * represented in a varlena header.  Callers that never use the allocation as
!  * a varlena can access the higher limit with MemoryContextAllocHuge().  Both
!  * limits permit code to assume that it may compute (in size_t math) twice an
!  * allocation's size without overflow.
   */
  #define MaxAllocSize	((Size) 0x3fffffff)		/* 1 gigabyte - 1 */
  
  #define AllocSizeIsValid(size)	((Size) (size) <= MaxAllocSize)
  
+ #define MaxAllocHugeSize	((Size) -1 >> 1)	/* SIZE_MAX / 2 */
+ 
+ #define AllocHugeSizeIsValid(size)	((Size) (size) <= MaxAllocHugeSize)
+ 
  /*
   * All chunks allocated by any memory context manager are required to be
   * preceded by a StandardChunkHeader at a spacing of STANDARDCHUNKHEADERSIZE.
*** a/src/include/utils/palloc.h
--- b/src/include/utils/palloc.h
***************
*** 51,56 **** extern void *MemoryContextAlloc(MemoryContext context, Size size);
--- 51,60 ----
  extern void *MemoryContextAllocZero(MemoryContext context, Size size);
  extern void *MemoryContextAllocZeroAligned(MemoryContext context, Size size);
  
+ /* Higher-limit allocators. */
+ extern void *MemoryContextAllocHuge(MemoryContext context, Size size);
+ extern void *repalloc_huge(void *pointer, Size size);
+ 
  /*
   * The result of palloc() is always word-aligned, so we can skip testing
   * alignment of the pointer when deciding which MemSet variant to use.
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
***************
*** 106,112 **** extern void tuplesort_get_stats(Tuplesortstate *state,
  					const char **spaceType,
  					long *spaceUsed);
  
! extern int	tuplesort_merge_order(long allowedMem);
  
  /*
   * These routines may only be called if randomAccess was specified 'true'.
--- 106,112 ----
  					const char **spaceType,
  					long *spaceUsed);
  
! extern int	tuplesort_merge_order(Size allowedMem);
  
  /*
   * These routines may only be called if randomAccess was specified 'true'.

Pavel Stehule

pavel.stehule@gmail.com

over 12 years ago

In reply to: Noah Misch (#1)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

Pavel
Dne 13.5.2013 16:29 "Noah Misch" <noah@leadboat.com> napsal(a):

Show quoted text

A memory chunk allocated through the existing palloc.h interfaces is
limited
to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE()
need
not check its own 1 GiB limit, and algorithms that grow a buffer by
doubling
need not check for overflow. However, a handful of callers are quite
happy to
navigate those hazards in exchange for the ability to allocate a larger
chunk.

This patch introduces MemoryContextAllocHuge() and repalloc_huge() that
check
a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother
recording
whether they were allocated as huge; one can start with palloc() and then
repalloc_huge() to grow the value. To demonstrate, I put this to use in
tuplesort.c; the patch also updates tuplestore.c to keep them similar.
Here's
the trace_sort from building the pgbench_accounts primary key at scale
factor
7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:

LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec
elapsed 391.21 sec

Compare:

LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u
sec elapsed 1146.05 sec

This was made easier by tuplesort growth algorithm improvements in commit
8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before
(TODO item "Allow sorts to use more available memory"), and Tom floated the
idea[1] behind the approach I've used. The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
150 GiB when sorting int4.

I have not added variants like palloc_huge() and palloc0_huge(), and I have
not added to the frontend palloc.h interface. There's no particular
barrier
to doing any of that. I don't expect more than a dozen or so callers, so
most
of the variations might go unused.

The comment at MaxAllocSize said that aset.c expects doubling the size of
an
arbitrary allocation to never overflow, but I couldn't find the code in
question. AllocSetAlloc() does double sizes of blocks used to aggregate
small
allocations, so maxBlockSize had better stay under SIZE_MAX/2.
Nonetheless,
that expectation does apply to dozens of repalloc() users outside aset.c,
and
I preserved it for repalloc_huge(). 64-bit builds will never notice, and I
won't cry for the resulting 2 GiB limit on 32-bit.

Thanks,
nm

[1] /messages/by-id/19908.1297696263@sss.pgh.pa.us

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

over 12 years ago

In reply to: Noah Misch (#1)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

Noah,

* Noah Misch (noah@leadboat.com) wrote:

This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
a higher MaxAllocHugeSize limit of SIZE_MAX/2.

Nice! I've complained about this limit a few different times and just
never got around to addressing it.

This was made easier by tuplesort growth algorithm improvements in commit
8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before
(TODO item "Allow sorts to use more available memory"), and Tom floated the
idea[1] behind the approach I've used. The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
150 GiB when sorting int4.

That's frustratingly small. :(

[...]

--- 1024,1041 ----
* new array elements even if no other memory were currently used.
*
* We do the arithmetic in float8, because otherwise the product of
! 		 * memtupsize and allowedMem could overflow.  Any inaccuracy in the
! 		 * result should be insignificant; but even if we computed a
! 		 * completely insane result, the checks below will prevent anything
! 		 * really bad from happening.
*/
double		grow_ratio;
grow_ratio = (double) state->allowedMem / (double) memNowUsed;
! if (memtupsize * grow_ratio < INT_MAX)
! newmemtupsize = (int) (memtupsize * grow_ratio);
! else
! newmemtupsize = INT_MAX;

/* We won't make any further enlargement attempts */
state->growmemtuples = false;

I'm not a huge fan of moving directly to INT_MAX. Are we confident that
everything can handle that cleanly..? I feel like it might be a bit
safer to shy a bit short of INT_MAX (say, by 1K). Perhaps that's overly
paranoid, but there's an awful lot of callers and some loop which +2's
and then overflows would suck, eg:

int x = INT_MAX;
for (x-1; (x-1) < INT_MAX; x += 2) {
myarray[x] = 5;
}

Also, could this be used to support hashing larger sets..? If we change
NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
larger than INT_MAX since, with 8-byte pointers, that'd only be around
134M tuples.

Haven't had a chance to review the rest, but +1 on the overall idea. :)

Thanks!

Stephen

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Noah Misch (#1)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

On 13 May 2013 15:26, Noah Misch <noah@leadboat.com> wrote:

A memory chunk allocated through the existing palloc.h interfaces is limited
to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE() need
not check its own 1 GiB limit, and algorithms that grow a buffer by doubling
need not check for overflow. However, a handful of callers are quite happy to
navigate those hazards in exchange for the ability to allocate a larger chunk.

This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother recording
whether they were allocated as huge; one can start with palloc() and then
repalloc_huge() to grow the value.

I like the design and think its workable.

I'm concerned that people will accidentally use MaxAllocSize. Can we
put in a runtime warning if someone tests AllocSizeIsValid() with a
larger value?

To demonstrate, I put this to use in
tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's
the trace_sort from building the pgbench_accounts primary key at scale factor
7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:

LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 391.21 sec

Compare:

LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec elapsed 1146.05 sec

Cool.

I'd like to put in an explicit test for this somewhere. Obviously not
part of normal regression, but somewhere, at least, so we have
automated testing that we all agree on. (yes, I know we don't have
that for replication/recovery yet, but thats why I don't want to
repeat that mistake).

This was made easier by tuplesort growth algorithm improvements in commit
8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before
(TODO item "Allow sorts to use more available memory"), and Tom floated the
idea[1] behind the approach I've used. The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
150 GiB when sorting int4.

I have not added variants like palloc_huge() and palloc0_huge(), and I have
not added to the frontend palloc.h interface. There's no particular barrier
to doing any of that. I don't expect more than a dozen or so callers, so most
of the variations might go unused.

The comment at MaxAllocSize said that aset.c expects doubling the size of an
arbitrary allocation to never overflow, but I couldn't find the code in
question. AllocSetAlloc() does double sizes of blocks used to aggregate small
allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless,
that expectation does apply to dozens of repalloc() users outside aset.c, and
I preserved it for repalloc_huge(). 64-bit builds will never notice, and I
won't cry for the resulting 2 GiB limit on 32-bit.

Agreed. Can we document this for the relevant parameters?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Stephen Frost (#3)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

On 22 June 2013 08:46, Stephen Frost <sfrost@snowman.net> wrote:

The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
150 GiB when sorting int4.

That's frustratingly small. :(

But that has nothing to do with this patch, right? And is easily fixed, yes?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

over 12 years ago

In reply to: Simon Riggs (#5)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

* Simon Riggs (simon@2ndQuadrant.com) wrote:

On 22 June 2013 08:46, Stephen Frost <sfrost@snowman.net> wrote:

The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
150 GiB when sorting int4.

That's frustratingly small. :(

But that has nothing to do with this patch, right? And is easily fixed, yes?

I don't know about 'easily fixed' (consider supporting a HashJoin of >2B
records) but I do agree that dealing with places in the code where we are
using an int4 to keep track of the number of objects in memory is outside
the scope of this patch.

Hopefully we are properly range-checking and limiting ourselves to only
what a given node can support and not solely depending on MaxAllocSize
to keep us from overflowing some int4 which we're using as an index for
an array or as a count of how many objects we've currently got in
memory, but we'll want to consider carefully what happens with such
large sets as we're adding support into nodes for these Huge
allocations (along with the recent change to allow 1TB work_mem, which
may encourage users with systems large enough to actually try to set it
that high... :)

Thanks,

Stephen

Robert Haas

robertmhaas@gmail.com

over 12 years ago

In reply to: Stephen Frost (#3)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

On Sat, Jun 22, 2013 at 3:46 AM, Stephen Frost <sfrost@snowman.net> wrote:

I'm not a huge fan of moving directly to INT_MAX. Are we confident that
everything can handle that cleanly..? I feel like it might be a bit
safer to shy a bit short of INT_MAX (say, by 1K).

Maybe it would be better to stick with INT_MAX and fix any bugs we
find. If there are magic numbers short of INT_MAX that cause
problems, it would likely be better to find out about those problems
and adjust the relevant code, rather than trying to dodge them. We'll
have to confront all of those problems eventually as we come to
support larger and larger sorts; I don't see much value in putting it
off.

Especially since we're early in the release cycle.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

over 12 years ago

In reply to: Stephen Frost (#3)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

On Sat, Jun 22, 2013 at 03:46:49AM -0400, Stephen Frost wrote:

* Noah Misch (noah@leadboat.com) wrote:

The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
150 GiB when sorting int4.

That's frustratingly small. :(

I could appreciate a desire to remove that limit. The way to do that is to
audit all uses of "int" variables in tuplesort.c and tuplestore.c, changing
them to Size where they can be used as indexes into the memtuples array.
Nonetheless, this new limit is about 50x the current limit; you need an
(unpartitioned) table of 2B+ rows to encounter it. I'm happy with that.

! if (memtupsize * grow_ratio < INT_MAX)
! newmemtupsize = (int) (memtupsize * grow_ratio);
! else
! newmemtupsize = INT_MAX;

/* We won't make any further enlargement attempts */
state->growmemtuples = false;

I'm not a huge fan of moving directly to INT_MAX. Are we confident that
everything can handle that cleanly..? I feel like it might be a bit
safer to shy a bit short of INT_MAX (say, by 1K). Perhaps that's overly
paranoid, but there's an awful lot of callers and some loop which +2's
and then overflows would suck, eg:

Where are you seeing "an awful lot of callers"? The code that needs to be
correct with respect to the INT_MAX limit is all in tuplesort.c/tuplestore.c.
Consequently, I chose to verify that code rather than add a safety factor. (I
did add an unrelated safety factor to repalloc_huge() itself.)

Also, could this be used to support hashing larger sets..? If we change
NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
larger than INT_MAX since, with 8-byte pointers, that'd only be around
134M tuples.

The INT_MAX limit is an internal limit of tuplesort/tuplestore; other
consumers of the huge allocation APIs are only subject to that limit if they
find reasons to enforce it on themselves. (Incidentally, the internal limit
in question is INT_MAX tuples, not INT_MAX bytes.)

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

over 12 years ago

In reply to: Simon Riggs (#4)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

On Sat, Jun 22, 2013 at 11:36:45AM +0100, Simon Riggs wrote:

On 13 May 2013 15:26, Noah Misch <noah@leadboat.com> wrote:

I'm concerned that people will accidentally use MaxAllocSize. Can we
put in a runtime warning if someone tests AllocSizeIsValid() with a
larger value?

I don't see how we could. To preempt a repalloc() failure, you test with
AllocSizeIsValid(); testing a larger value is not a programming error.

To demonstrate, I put this to use in
tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's
the trace_sort from building the pgbench_accounts primary key at scale factor
7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:

LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 391.21 sec

Compare:

LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec elapsed 1146.05 sec

Cool.

I'd like to put in an explicit test for this somewhere. Obviously not
part of normal regression, but somewhere, at least, so we have
automated testing that we all agree on. (yes, I know we don't have
that for replication/recovery yet, but thats why I don't want to
repeat that mistake).

Probably the easiest way to test from nothing is to run "pgbench -i -s 7500"
under a high work_mem. I agree that an automated test suite dedicated to
coverage of scale-dependent matters would be valuable, though I'm disinclined
to start one in conjunction with this particular patch.

The comment at MaxAllocSize said that aset.c expects doubling the size of an
arbitrary allocation to never overflow, but I couldn't find the code in
question. AllocSetAlloc() does double sizes of blocks used to aggregate small
allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless,
that expectation does apply to dozens of repalloc() users outside aset.c, and
I preserved it for repalloc_huge(). 64-bit builds will never notice, and I
won't cry for the resulting 2 GiB limit on 32-bit.

Agreed. Can we document this for the relevant parameters?

I attempted to cover most of that in the comment above MaxAllocHugeSize, but I
did not mention the maxBlockSize constraint. I'll add an
Assert(AllocHugeSizeIsValid(maxBlockSize)) and a comment to
AllocSetContextCreate(). Did I miss documenting anything else notable?

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Stephen Frost

sfrost@snowman.net

over 12 years ago

In reply to: Noah Misch (#8)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

On Monday, June 24, 2013, Noah Misch wrote:

On Sat, Jun 22, 2013 at 03:46:49AM -0400, Stephen Frost wrote:

* Noah Misch (noah@leadboat.com <javascript:;>) wrote:

The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to

about

150 GiB when sorting int4.

That's frustratingly small. :(

I could appreciate a desire to remove that limit. The way to do that is to
audit all uses of "int" variables in tuplesort.c and tuplestore.c, changing
them to Size where they can be used as indexes into the memtuples array.

Right, that's about what I figured would need to be done.

Nonetheless, this new limit is about 50x the current limit; you need an
(unpartitioned) table of 2B+ rows to encounter it. I'm happy with that.

Definitely better but I could see cases with that many tuples in the
not-too-distant future, esp. when used with MinMax indexes...

! if (memtupsize * grow_ratio < INT_MAX)
! newmemtupsize = (int) (memtupsize * grow_ratio);
! else
! newmemtupsize = INT_MAX;

/* We won't make any further enlargement attempts */
state->growmemtuples = false;

I'm not a huge fan of moving directly to INT_MAX. Are we confident that
everything can handle that cleanly..? I feel like it might be a bit
safer to shy a bit short of INT_MAX (say, by 1K). Perhaps that's overly
paranoid, but there's an awful lot of callers and some loop which +2's
and then overflows would suck, eg:

Where are you seeing "an awful lot of callers"? The code that needs to be
correct with respect to the INT_MAX limit is all in
tuplesort.c/tuplestore.c.
Consequently, I chose to verify that code rather than add a safety factor.
(I
did add an unrelated safety factor to repalloc_huge() itself.)

Ok, I was thinking this code was used beyond tuplesort (I was thinking it
was actually associated with palloc). Apologies for the confusion. :)

Also, could this be used to support hashing larger sets..? If we change
NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
larger than INT_MAX since, with 8-byte pointers, that'd only be around
134M tuples.

The INT_MAX limit is an internal limit of tuplesort/tuplestore; other
consumers of the huge allocation APIs are only subject to that limit if
they
find reasons to enforce it on themselves. (Incidentally, the internal
limit
in question is INT_MAX tuples, not INT_MAX bytes.)

There's other places where we use integers for indexes into arrays of
tuples (at least hashing is another area..) and those are then also subject
to INT_MAX, which was really what I was getting at. We might move the
hashing code to use the _huge functions and would then need to adjust that
code to use Size for the index into the hash table array of pointers.

Thanks,

Stephen

#11

Jeff Janes

jeff.janes@gmail.com

over 12 years ago

In reply to: Noah Misch (#1)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

On Mon, May 13, 2013 at 7:26 AM, Noah Misch <noah@leadboat.com> wrote:

A memory chunk allocated through the existing palloc.h interfaces is
limited
to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE()
need
not check its own 1 GiB limit, and algorithms that grow a buffer by
doubling
need not check for overflow. However, a handful of callers are quite
happy to
navigate those hazards in exchange for the ability to allocate a larger
chunk.

This patch introduces MemoryContextAllocHuge() and repalloc_huge() that
check
a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother
recording
whether they were allocated as huge; one can start with palloc() and then
repalloc_huge() to grow the value.

Since it doesn't record the size, I assume the non-use as a varlena is
enforced only by coder discipline and not by the system?

! * represented in a varlena header. Callers that never use the
allocation as
! * a varlena can access the higher limit with MemoryContextAllocHuge().
Both
! * limits permit code to assume that it may compute (in size_t math)
twice an
! * allocation's size without overflow.

What is likely to happen if I accidentally let a pointer to huge memory
escape to someone who then passes it to varlena constructor without me
knowing it? (I tried sabotaging the code to make this happen, but I could
not figure out how to). Is there a place we can put an Assert to catch
this mistake under enable-cassert builds?

I have not yet done a detailed code review, but this applies and builds
cleanly, passes make check with and without enable-cassert, it does what it
says (and gives performance improvements when it does kick in), and we want
this. No doc changes should be needed, we probably don't want run an
automatic regression test of the size needed to usefully test this, and as
far as I know there is no infrastructure for "big memory only" tests.

The only danger I can think of is that it could sometimes make some sorts
slower, as using more memory than is necessary can sometimes slow down an
"external" sort (because the heap is then too big for the fastest CPU
cache). If you use more tapes, but not enough more to reduce the number of
passes needed, then you can get a slowdown.

I can't imagine that it would make things worse on average, though, as the
benefit of doing more sorts as quicksorts rather than merge sorts, or doing
mergesort with fewer number of passes, would outweigh sometimes doing a
slower mergesort. If someone has a pathological use pattern for which the
averages don't work out favorably for them, they could probably play with
work_mem to correct the problem. Whereas without the patch, people who
want more memory have no options.

People have mentioned additional things that could be done in this area,
but I don't think that applying this patch will make those things harder,
or back us into a corner. Taking an incremental approach seems suitable.

Cheers,

Jeff

#12

Noah Misch

noah@leadboat.com

over 12 years ago

In reply to: Jeff Janes (#11)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

On Wed, Jun 26, 2013 at 03:48:23PM -0700, Jeff Janes wrote:

On Mon, May 13, 2013 at 7:26 AM, Noah Misch <noah@leadboat.com> wrote:

This patch introduces MemoryContextAllocHuge() and repalloc_huge() that
check
a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother
recording
whether they were allocated as huge; one can start with palloc() and then
repalloc_huge() to grow the value.

Since it doesn't record the size, I assume the non-use as a varlena is
enforced only by coder discipline and not by the system?

We will rely on coder discipline, yes. The allocator actually does record a
size. I was referring to the fact that it can't distinguish the result of
repalloc(p, 7) from the result of repalloc_huge(p, 7).

What is likely to happen if I accidentally let a pointer to huge memory
escape to someone who then passes it to varlena constructor without me
knowing it? (I tried sabotaging the code to make this happen, but I could
not figure out how to). Is there a place we can put an Assert to catch
this mistake under enable-cassert builds?

Passing a too-large value gives a modulo effect. We could inject an
AssertMacro() into SET_VARSIZE(). But it's a hot path, and I don't think this
mistake is too likely.

The only danger I can think of is that it could sometimes make some sorts
slower, as using more memory than is necessary can sometimes slow down an
"external" sort (because the heap is then too big for the fastest CPU
cache). If you use more tapes, but not enough more to reduce the number of
passes needed, then you can get a slowdown.

Interesting point, though I don't fully understand it. The fastest CPU cache
will be a tiny L1 data cache; surely that's not the relevant parameter here?

I can't imagine that it would make things worse on average, though, as the
benefit of doing more sorts as quicksorts rather than merge sorts, or doing
mergesort with fewer number of passes, would outweigh sometimes doing a
slower mergesort. If someone has a pathological use pattern for which the
averages don't work out favorably for them, they could probably play with
work_mem to correct the problem. Whereas without the patch, people who
want more memory have no options.

Agreed.

People have mentioned additional things that could be done in this area,
but I don't think that applying this patch will make those things harder,
or back us into a corner. Taking an incremental approach seems suitable.

Committed with some cosmetic tweaks discussed upthread.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Jeff Janes

jeff.janes@gmail.com

over 12 years ago

In reply to: Stephen Frost (#3)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

On Sat, Jun 22, 2013 at 12:46 AM, Stephen Frost <sfrost@snowman.net> wrote:

Noah,

* Noah Misch (noah@leadboat.com) wrote:

This patch introduces MemoryContextAllocHuge() and repalloc_huge() that

check

a higher MaxAllocHugeSize limit of SIZE_MAX/2.

Nice! I've complained about this limit a few different times and just
never got around to addressing it.

This was made easier by tuplesort growth algorithm improvements in commit
8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before
(TODO item "Allow sorts to use more available memory"), and Tom floated

the

idea[1] behind the approach I've used. The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to

about

150 GiB when sorting int4.

That's frustratingly small. :(

I've added a ToDo item to remove that limit from sorts as well.

I was going to add another item to make nodeHash.c use the new huge
allocator, but after looking at it just now it was not clear to me that it
even has such a limitation. nbatch is limited by MaxAllocSize, but
nbuckets doesn't seem to be.

Cheers,

Jeff

#14

Stephen Frost

sfrost@snowman.net

over 12 years ago

In reply to: Jeff Janes (#13)

Re: MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

Jeff,

* Jeff Janes (jeff.janes@gmail.com) wrote:

I was going to add another item to make nodeHash.c use the new huge
allocator, but after looking at it just now it was not clear to me that it
even has such a limitation. nbatch is limited by MaxAllocSize, but
nbuckets doesn't seem to be.

nodeHash.c:ExecHashTableCreate() allocates ->buckets using:

palloc(nbuckets * sizeof(HashJoinTuple))

(where HashJoinTuple is actually just a pointer), and reallocates same
in ExecHashTableReset(). That limits the current implementation to only
about 134M buckets, no?

Now, what I was really suggesting wasn't so much changing those specific
calls; my point was really that there's a ton of stuff in the HashJoin
code that uses 32bit integers for things which, these days, might be too
small (nbuckets being one example, imv). There's a lot of code there
though and you'd have to really consider which things make sense to have
as int64's.

Thanks,

Stephen