Poor memory context performance in large hash joins

Started by Jeff Janesalmost 9 years ago20 messages

jeff.janes@gmail.com

almost 9 years ago

When doing a hash join with large work_mem, you can have a large number of
chunks. Then if ExecHashIncreaseNumBatches gets called, those chunks are
walked through, moving the tuples to new chunks (or to disk, if they no
longer match the batch's bitmask), and freeing the old chunks.

The number of new chunks can be almost as as large as the number of old
chunks, especially if there is a very popular value. The problem is that
every time an old chunk is freed, the code in aset.c around line 968 has to
walk over all the newly allocated chunks in the linked list before it can
find the old one being freed. This is an N^2 operation, and I think it has
horrible CPU cache hit rates as well.

Is there a good solution to this? Could the new chunks be put in a
different memory context, and then destroy the old context and install the
new one at the end of ExecHashIncreaseNumBatches? I couldn't find a destroy
method for memory contexts, it looks like you just reset the parent
instead. But I don't think that would work here.

Thanks,

Jeff

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Jeff Janes (#1)

Re: Poor memory context performance in large hash joins

On Thu, Feb 23, 2017 at 2:13 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Is there a good solution to this? Could the new chunks be put in a
different memory context, and then destroy the old context and install the
new one at the end of ExecHashIncreaseNumBatches? I couldn't find a destroy
method for memory contexts, it looks like you just reset the parent instead.
But I don't think that would work here.

Are you aware of the fact that tuplesort.c got a second memory context
for 9.6, entirely on performance grounds?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Jeff Janes (#1)

Re: Poor memory context performance in large hash joins

Jeff Janes <jeff.janes@gmail.com> writes:

The number of new chunks can be almost as as large as the number of old
chunks, especially if there is a very popular value. The problem is that
every time an old chunk is freed, the code in aset.c around line 968 has to
walk over all the newly allocated chunks in the linked list before it can
find the old one being freed. This is an N^2 operation, and I think it has
horrible CPU cache hit rates as well.

Maybe it's time to convert that to a doubly-linked list. Although if the
hash code is producing a whole lot of requests that are only a bit bigger
than the separate-block threshold, I'd say It's Doing It Wrong. It should
learn to aggregate them into larger requests.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Tom Lane (#3)

Re: Poor memory context performance in large hash joins

On 2017-02-23 17:28:26 -0500, Tom Lane wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

The number of new chunks can be almost as as large as the number of old
chunks, especially if there is a very popular value. The problem is that
every time an old chunk is freed, the code in aset.c around line 968 has to
walk over all the newly allocated chunks in the linked list before it can
find the old one being freed. This is an N^2 operation, and I think it has
horrible CPU cache hit rates as well.

Maybe it's time to convert that to a doubly-linked list.

Yes, I do think so. Given that we only have that for full blocks, not
for small chunks, the cost seems neglegible.

That would also, partially, address the performance issue
http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
addresses, in a more realistically backpatchable manner.

Jeff, do you have a handy demonstrator?

Although if the
hash code is producing a whole lot of requests that are only a bit bigger
than the separate-block threshold, I'd say It's Doing It Wrong. It should
learn to aggregate them into larger requests.

That's probably right, but we can't really address that in the
back-branches. And to me this sounds like something we should address
in the branches, not just in master. Even if we'd also fix the
hash-aggregation logic, I think such an O(n^2) behaviour in the
allocator is a bad idea in general, and we should fix it anyway.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Andres Freund (#4)

Re: Poor memory context performance in large hash joins

Andres Freund <andres@anarazel.de> writes:

On 2017-02-23 17:28:26 -0500, Tom Lane wrote:

Maybe it's time to convert that to a doubly-linked list.

Yes, I do think so. Given that we only have that for full blocks, not
for small chunks, the cost seems neglegible.
That would also, partially, address the performance issue
http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
addresses, in a more realistically backpatchable manner.

Yeah, I was wondering if we could get away with back-patching such a
change. In principle, nothing outside aset.c should know what's in the
header of an AllocBlock, but ...

Jeff, do you have a handy demonstrator?

A solid test case would definitely help to convince people that it was
worth taking some risk here.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Thomas Munro

thomas.munro@enterprisedb.com

almost 9 years ago

In reply to: Andres Freund (#4)

1 attachment(s)

Re: Poor memory context performance in large hash joins

On Fri, Feb 24, 2017 at 12:17 PM, Andres Freund <andres@anarazel.de> wrote:

Jeff, do you have a handy demonstrator?

If you want to hit ExecHashIncreaseNumBatches a bunch of times, see
the "ugly" example in the attached.

--
Thomas Munro
http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Tom Lane (#5)

Re: Poor memory context performance in large hash joins

On 2017-02-24 01:59:01 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2017-02-23 17:28:26 -0500, Tom Lane wrote:

Maybe it's time to convert that to a doubly-linked list.

Yes, I do think so. Given that we only have that for full blocks, not
for small chunks, the cost seems neglegible.
That would also, partially, address the performance issue
http://archives.postgresql.org/message-id/d15dff83-0b37-28ed-0809-95a5cc7292ad%402ndquadrant.com
addresses, in a more realistically backpatchable manner.

Yeah, I was wondering if we could get away with back-patching such a
change. In principle, nothing outside aset.c should know what's in the
header of an AllocBlock, but ...

You'd need to go through a fair amount of intentional pain to be
affected by a change AllocBlockData's structure. We could add the
->prev pointer to the end of AllocBlockData's definition to make it less
likely that one would be affected in that unlikely case - but I'm a bit
doubtful it's worth the trouble.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

jeff.janes@gmail.com

almost 9 years ago

In reply to: Tom Lane (#3)

Re: Poor memory context performance in large hash joins

On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

The number of new chunks can be almost as as large as the number of old
chunks, especially if there is a very popular value. The problem is that
every time an old chunk is freed, the code in aset.c around line 968 has

to

walk over all the newly allocated chunks in the linked list before it can
find the old one being freed. This is an N^2 operation, and I think it

has

horrible CPU cache hit rates as well.

Maybe it's time to convert that to a doubly-linked list.

I don't think that would help. You would need some heuristic to guess
whether the chunk you are looking for is near the front, or near the end.
And in this case, the desired chunk starts out at the front, and then keeps
moving down the list with each iteration as new things are added to the
front, until near the end of the ExecHashIncreaseNumBatches it is freeing
them from near the end. But in between, it is freeing them from the
middle, so I don't think a doubly-linked list would change it from N^2,
just lower the constant, even if you always knew which end to start at. Or
am I misunderstanding what the implications for a doubly-linked-list are?

What would really help here is if it remembered the next pointer of the
just-freed chunk, and started the scan from that location the next time,
cycling around to the head pointer if it doesn't find anything. But I
don't think that that is a very general solution.

Or, if you could pass a flag when creating the context telling it whether
memory will be freed mostly-LIFO or mostly-FIFO, and have it use a stack or
a queue accordingly.

Although if the
hash code is producing a whole lot of requests that are only a bit bigger
than the separate-block threshold, I'd say It's Doing It Wrong. It should
learn to aggregate them into larger requests.

Right now it is using compiled-in 32KB chunks. Should it use something
like max(32kb,work_mem/128) instead?

Cheers,

Jeff

Jeff Janes

jeff.janes@gmail.com

almost 9 years ago

In reply to: Andres Freund (#4)

Re: Poor memory context performance in large hash joins

On Thu, Feb 23, 2017 at 10:47 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-02-23 17:28:26 -0500, Tom Lane wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

The number of new chunks can be almost as as large as the number of old
chunks, especially if there is a very popular value. The problem is

that

every time an old chunk is freed, the code in aset.c around line 968

has to

walk over all the newly allocated chunks in the linked list before it

can

find the old one being freed. This is an N^2 operation, and I think

it has

horrible CPU cache hit rates as well.

Maybe it's time to convert that to a doubly-linked list.

Yes, I do think so. Given that we only have that for full blocks, not
for small chunks, the cost seems neglegible.

That would also, partially, address the performance issue
http://archives.postgresql.org/message-id/d15dff83-0b37-28ed
-0809-95a5cc7292ad%402ndquadrant.com
addresses, in a more realistically backpatchable manner.

Jeff, do you have a handy demonstrator?

Not exactly "handy", as it takes about 45 minutes to set up and uses >50GB
of disk and 16GB of RAM, but here is a demonstration.

create table foobar as select CASE when random()<(420.0/540.0) then 1 else
floor(random()*880000)::int END as titleid from
generate_series(1,540000000);

create table foobar2 as select distinct titleid from foobar ;

analyze;

set work_mem TO "13GB";

select count(*) from foobar2 where not exists (select 1 from foobar t where
t.titleid=foobar2.titleid);

This will run for effectively forever, unless it gets killed/dies due to
OOM first. If I have other things consuming some of my 16GB of RAM, then
it gets OOM (which is not a complaint: it is as one should expect given
that I told it work_mem was 13GB). If I have no other demands on RAM, then
I can't tell if it would eventually OOM or not because of how long it would
take to get that far.

This is inspired by the thread
/messages/by-id/CACw4T0p4Lzd6VpwptxgPgoTMh2dEKTQBGu7NTaJ1+A0PRx1BGg@mail.gmail.com

I ran into this while evaluating Tom's responding patch, but you don't need
to apply that patch to run this example and see the effect. I don't have
an example of a case that demonstrates the problem in the absence of a
degenerate hash bucket. I think there should be non-degenerate cases that
trigger it, but I haven't been able to produce one yet.

Although if the
hash code is producing a whole lot of requests that are only a bit bigger
than the separate-block threshold, I'd say It's Doing It Wrong. It

should

learn to aggregate them into larger requests.

That's probably right, but we can't really address that in the
back-branches. And to me this sounds like something we should address
in the branches, not just in master. Even if we'd also fix the
hash-aggregation logic, I think such an O(n^2) behaviour in the
allocator is a bad idea in general, and we should fix it anyway.

I don't know how important a back-patch would be. This is a toy case for
me. Presumably not for David Hinkle, except he doesn't want a hash join in
the first place. While I want one that finishes in a reasonable time. It
might depend on what happens to Tom's OOM patch.

It would be great if the allocator was made bullet-proof, but I don't think
adding reverse links (or anything else back-patchable) is going to be
enough to do that.

Cheers,

Jeff

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Jeff Janes (#8)

Re: Poor memory context performance in large hash joins

Jeff Janes <jeff.janes@gmail.com> writes:

On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe it's time to convert that to a doubly-linked list.

I don't think that would help. You would need some heuristic to guess
whether the chunk you are looking for is near the front, or near the end.

Uh, what? In a doubly-linked list, you can remove an element in O(1)
time, you don't need any searching. It basically becomes
item->prev->next = item->next;
item->next->prev = item->prev;
modulo possible special cases for the head and tail elements.

Although if the
hash code is producing a whole lot of requests that are only a bit bigger
than the separate-block threshold, I'd say It's Doing It Wrong. It should
learn to aggregate them into larger requests.

Right now it is using compiled-in 32KB chunks. Should it use something
like max(32kb,work_mem/128) instead?

I'd say it should double the size of the request each time. That's what
we do in most places.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Jeff Janes

jeff.janes@gmail.com

almost 9 years ago

In reply to: Tom Lane (#10)

Re: Poor memory context performance in large hash joins

On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeff Janes <jeff.janes@gmail.com> writes:

On Thu, Feb 23, 2017 at 2:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe it's time to convert that to a doubly-linked list.

I don't think that would help. You would need some heuristic to guess
whether the chunk you are looking for is near the front, or near the end.

Uh, what? In a doubly-linked list, you can remove an element in O(1)
time, you don't need any searching. It basically becomes
item->prev->next = item->next;
item->next->prev = item->prev;
modulo possible special cases for the head and tail elements.

Currently it is walking the chain to identify which block holds the chunk
in the first place, not just to get the pointer to the previous block. But
I see that that could be fixed by pointer arithmetic once there is a reason
to fix it. Which there isn't a reason to as long as you need to walk the
chain to get the prev pointer anyway. Like this:?

targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ);

Although if the
hash code is producing a whole lot of requests that are only a bit

bigger

than the separate-block threshold, I'd say It's Doing It Wrong. It

should

learn to aggregate them into larger requests.

Right now it is using compiled-in 32KB chunks. Should it use something
like max(32kb,work_mem/128) instead?

I'd say it should double the size of the request each time. That's what
we do in most places.

I thought the design goal there was that the space in old chunks could get
re-used into the new chunks in a reasonably fine-grained way. If the last
chunk contains just over half of all the data, that couldn't happen.

Cheers,

Jeff

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Jeff Janes (#11)

1 attachment(s)

Re: Poor memory context performance in large hash joins

Jeff Janes <jeff.janes@gmail.com> writes:

On Fri, Feb 24, 2017 at 10:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Uh, what? In a doubly-linked list, you can remove an element in O(1)
time, you don't need any searching.

Currently it is walking the chain to identify which block holds the chunk
in the first place, not just to get the pointer to the previous block. But
I see that that could be fixed by pointer arithmetic once there is a reason
to fix it. Which there isn't a reason to as long as you need to walk the
chain to get the prev pointer anyway. Like this:?
targetblock = (AllocBlock) (((char*)chunk) - ALLOC_BLOCKHDRSZ);

Right. We're just turning around the existing address calculation. It'd
be a good idea to provide some cross-check that we found a sane-looking
block header, but that's not that hard --- we know what ought to be in
aset, freeptr, and endptr for a single-chunk block's header, and that
seems like enough of a crosscheck to me.

Concretely, something like the attached. This passes regression tests
but I've not pushed on it any harder than that.

One argument against this is that it adds some nonzero amount of overhead
to block management; but considering that we are calling malloc or realloc
or free alongside any one of these manipulations, that overhead should be
pretty negligible. I'm a bit more worried about increasing the alloc
block header size by 8 bytes, but that would only really matter with a
very small allocChunkLimit.

regards, tom lane

Attachments:

doubly-linked-block-list.patchtext/x-diff; charset=us-ascii; name=doubly-linked-block-list.patchDownload

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 4dfc3ec..c250bae 100644
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
*************** typedef AllocSetContext *AllocSet;
*** 201,207 ****
  typedef struct AllocBlockData
  {
  	AllocSet	aset;			/* aset that owns this block */
! 	AllocBlock	next;			/* next block in aset's blocks list */
  	char	   *freeptr;		/* start of free space in this block */
  	char	   *endptr;			/* end of space in this block */
  }	AllocBlockData;
--- 201,208 ----
  typedef struct AllocBlockData
  {
  	AllocSet	aset;			/* aset that owns this block */
! 	AllocBlock	prev;			/* prev block in aset's blocks list, if any */
! 	AllocBlock	next;			/* next block in aset's blocks list, if any */
  	char	   *freeptr;		/* start of free space in this block */
  	char	   *endptr;			/* end of space in this block */
  }	AllocBlockData;
*************** AllocSetContextCreate(MemoryContext pare
*** 522,528 ****
--- 523,532 ----
  		block->aset = set;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
+ 		block->prev = NULL;
  		block->next = set->blocks;
+ 		if (block->next)
+ 			block->next->prev = block;
  		set->blocks = block;
  		/* Mark block as not to be released at reset time */
  		set->keeper = block;
*************** AllocSetReset(MemoryContext context)
*** 604,609 ****
--- 608,614 ----
  			VALGRIND_MAKE_MEM_NOACCESS(datastart, block->freeptr - datastart);
  #endif
  			block->freeptr = datastart;
+ 			block->prev = NULL;
  			block->next = NULL;
  		}
  		else
*************** AllocSetAlloc(MemoryContext context, Siz
*** 710,725 ****
  #endif
  
  		/*
! 		 * Stick the new block underneath the active allocation block, so that
! 		 * we don't lose the use of the space remaining therein.
  		 */
  		if (set->blocks != NULL)
  		{
  			block->next = set->blocks->next;
  			set->blocks->next = block;
  		}
  		else
  		{
  			block->next = NULL;
  			set->blocks = block;
  		}
--- 715,734 ----
  #endif
  
  		/*
! 		 * Stick the new block underneath the active allocation block, if any,
! 		 * so that we don't lose the use of the space remaining therein.
  		 */
  		if (set->blocks != NULL)
  		{
+ 			block->prev = set->blocks;
  			block->next = set->blocks->next;
+ 			if (block->next)
+ 				block->next->prev = block;
  			set->blocks->next = block;
  		}
  		else
  		{
+ 			block->prev = NULL;
  			block->next = NULL;
  			set->blocks = block;
  		}
*************** AllocSetAlloc(MemoryContext context, Siz
*** 900,906 ****
--- 909,918 ----
  		VALGRIND_MAKE_MEM_NOACCESS(block->freeptr,
  								   blksize - ALLOC_BLOCKHDRSZ);
  
+ 		block->prev = NULL;
  		block->next = set->blocks;
+ 		if (block->next)
+ 			block->next->prev = block;
  		set->blocks = block;
  	}
  
*************** AllocSetFree(MemoryContext context, void
*** 960,988 ****
  	{
  		/*
  		 * Big chunks are certain to have been allocated as single-chunk
! 		 * blocks.  Find the containing block and return it to malloc().
  		 */
! 		AllocBlock	block = set->blocks;
! 		AllocBlock	prevblock = NULL;
  
! 		while (block != NULL)
! 		{
! 			if (chunk == (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ))
! 				break;
! 			prevblock = block;
! 			block = block->next;
! 		}
! 		if (block == NULL)
  			elog(ERROR, "could not find block containing chunk %p", chunk);
- 		/* let's just make sure chunk is the only one in the block */
- 		Assert(block->freeptr == ((char *) block) +
- 			   (chunk->size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ));
  
  		/* OK, remove block from aset's list and free it */
! 		if (prevblock == NULL)
  			set->blocks = block->next;
  		else
! 			prevblock->next = block->next;
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block->freeptr - ((char *) block));
  #endif
--- 972,999 ----
  	{
  		/*
  		 * Big chunks are certain to have been allocated as single-chunk
! 		 * blocks.  Just unlink that block and return it to malloc().
  		 */
! 		AllocBlock	block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);
  
! 		/*
! 		 * Try to verify that we have a sane block pointer: it should
! 		 * reference the correct aset, and freeptr and endptr should point
! 		 * just past the chunk.
! 		 */
! 		if (block->aset != set ||
! 			block->freeptr != block->endptr ||
! 			block->freeptr != ((char *) block) +
! 			(chunk->size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ))
  			elog(ERROR, "could not find block containing chunk %p", chunk);
  
  		/* OK, remove block from aset's list and free it */
! 		if (block->prev == NULL)
  			set->blocks = block->next;
  		else
! 			block->prev->next = block->next;
! 		if (block->next != NULL)
! 			block->next->prev = block->prev;
  #ifdef CLOBBER_FREED_MEMORY
  		wipe_mem(block, block->freeptr - ((char *) block));
  #endif
*************** AllocSetRealloc(MemoryContext context, v
*** 1088,1114 ****
  	if (oldsize > set->allocChunkLimit)
  	{
  		/*
! 		 * The chunk must have been allocated as a single-chunk block.  Find
! 		 * the containing block and use realloc() to make it bigger with
! 		 * minimum space wastage.
  		 */
! 		AllocBlock	block = set->blocks;
! 		AllocBlock	prevblock = NULL;
  		Size		chksize;
  		Size		blksize;
  
! 		while (block != NULL)
! 		{
! 			if (chunk == (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ))
! 				break;
! 			prevblock = block;
! 			block = block->next;
! 		}
! 		if (block == NULL)
  			elog(ERROR, "could not find block containing chunk %p", chunk);
- 		/* let's just make sure chunk is the only one in the block */
- 		Assert(block->freeptr == ((char *) block) +
- 			   (chunk->size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ));
  
  		/* Do the realloc */
  		chksize = MAXALIGN(size);
--- 1099,1122 ----
  	if (oldsize > set->allocChunkLimit)
  	{
  		/*
! 		 * The chunk must have been allocated as a single-chunk block.  Use
! 		 * realloc() to make the containing block bigger with minimum space
! 		 * wastage.
  		 */
! 		AllocBlock	block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);
  		Size		chksize;
  		Size		blksize;
  
! 		/*
! 		 * Try to verify that we have a sane block pointer: it should
! 		 * reference the correct aset, and freeptr and endptr should point
! 		 * just past the chunk.
! 		 */
! 		if (block->aset != set ||
! 			block->freeptr != block->endptr ||
! 			block->freeptr != ((char *) block) +
! 			(chunk->size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ))
  			elog(ERROR, "could not find block containing chunk %p", chunk);
  
  		/* Do the realloc */
  		chksize = MAXALIGN(size);
*************** AllocSetRealloc(MemoryContext context, v
*** 1121,1130 ****
  		/* Update pointers since block has likely been moved */
  		chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
  		pointer = AllocChunkGetPointer(chunk);
! 		if (prevblock == NULL)
  			set->blocks = block;
  		else
! 			prevblock->next = block;
  		chunk->size = chksize;
  
  #ifdef MEMORY_CONTEXT_CHECKING
--- 1129,1140 ----
  		/* Update pointers since block has likely been moved */
  		chunk = (AllocChunk) (((char *) block) + ALLOC_BLOCKHDRSZ);
  		pointer = AllocChunkGetPointer(chunk);
! 		if (block->prev == NULL)
  			set->blocks = block;
  		else
! 			block->prev->next = block;
! 		if (block->next != NULL)
! 			block->next->prev = block;
  		chunk->size = chksize;
  
  #ifdef MEMORY_CONTEXT_CHECKING
*************** AllocSetCheck(MemoryContext context)
*** 1315,1323 ****
  {
  	AllocSet	set = (AllocSet) context;
  	char	   *name = set->header.name;
  	AllocBlock	block;
  
! 	for (block = set->blocks; block != NULL; block = block->next)
  	{
  		char	   *bpoz = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		long		blk_used = block->freeptr - bpoz;
--- 1325,1336 ----
  {
  	AllocSet	set = (AllocSet) context;
  	char	   *name = set->header.name;
+ 	AllocBlock	prevblock;
  	AllocBlock	block;
  
! 	for (prevblock = NULL, block = set->blocks;
! 		 block != NULL;
! 		 prevblock = block, block = block->next)
  	{
  		char	   *bpoz = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		long		blk_used = block->freeptr - bpoz;
*************** AllocSetCheck(MemoryContext context)
*** 1335,1340 ****
--- 1348,1363 ----
  		}
  
  		/*
+ 		 * Check block header fields
+ 		 */
+ 		if (block->aset != set ||
+ 			block->prev != prevblock ||
+ 			block->freeptr < bpoz ||
+ 			block->freeptr > block->endptr)
+ 			elog(WARNING, "problem in alloc set %s: corrupt header in block %p",
+ 				 name, block);
+ 
+ 		/*
  		 * Chunk walker
  		 */
  		while (bpoz < block->freeptr)

#13

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Tom Lane (#12)

Re: Poor memory context performance in large hash joins

On 2017-02-24 18:04:18 -0500, Tom Lane wrote:

Concretely, something like the attached. This passes regression tests
but I've not pushed on it any harder than that.

Heh, I'd just gotten something that didn't immediately crash anymore ;)

Running your patch against Jeff's test-case, verified before that I
could easily reproduce the O(N^2) cost.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Andres Freund (#13)

Re: Poor memory context performance in large hash joins

On 2017-02-24 15:12:37 -0800, Andres Freund wrote:

On 2017-02-24 18:04:18 -0500, Tom Lane wrote:

Concretely, something like the attached. This passes regression tests
but I've not pushed on it any harder than that.

Heh, I'd just gotten something that didn't immediately crash anymore ;)

Running your patch against Jeff's test-case, verified before that I
could easily reproduce the O(N^2) cost.

Oh, that didn't take as long as I was afraid (optimized/non-assert build):

postgres[26268][1]=# SET work_mem = '13GB';
SET
Time: 2.591 ms
postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid);
┌───────┐
│ count │
├───────┤
│ 0 │
└───────┘
(1 row)

Time: 268043.710 ms (04:28.044)

Profile:
13.49% postgres [.] ExecHashTableInsert
11.16% postgres [.] BufFileRead
9.20% postgres [.] ExecHashIncreaseNumBatches
9.19% libc-2.24.so [.] __memmove_avx_unaligned_erms
6.74% postgres [.] dense_alloc.isra.0
5.53% postgres [.] ExecStoreMinimalTuple
5.14% postgres [.] ExecHashJoinGetSavedTuple.isra.0
4.68% postgres [.] AllocSetAlloc
4.12% postgres [.] AllocSetFree
2.31% postgres [.] palloc
2.06% [kernel] [k] free_pcppages_bulk

I'd previously run the test for 30min, with something like 99% of the
profile being in AllocSetFree().

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Andres Freund (#14)

Re: Poor memory context performance in large hash joins

On 2017-02-24 15:18:04 -0800, Andres Freund wrote:

On 2017-02-24 15:12:37 -0800, Andres Freund wrote:

On 2017-02-24 18:04:18 -0500, Tom Lane wrote:

Concretely, something like the attached. This passes regression tests
but I've not pushed on it any harder than that.

Heh, I'd just gotten something that didn't immediately crash anymore ;)

Running your patch against Jeff's test-case, verified before that I
could easily reproduce the O(N^2) cost.

Oh, that didn't take as long as I was afraid (optimized/non-assert build):

postgres[26268][1]=# SET work_mem = '13GB';
SET
Time: 2.591 ms
postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid);
Time: 268043.710 ms (04:28.044)

As another datapoint, I measured this patch against the problem from
/messages/by-id/20170227111732.vrx5v72ighehwpkf@alap3.anarazel.de
(see top post in thread), and it indeed fixes the runtime issue -
there's still considerably higher memory usage and some runtime
overhead, but the quadratic behaviour is gone.

I think we should go forward with something like this patch in all
branches, and only use Tomas' patch in master, because they're
considerably larger.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 9 years ago

In reply to: Andres Freund (#15)

Re: Poor memory context performance in large hash joins

On 02/27/2017 12:55 PM, Andres Freund wrote:

On 2017-02-24 15:18:04 -0800, Andres Freund wrote:

On 2017-02-24 15:12:37 -0800, Andres Freund wrote:

On 2017-02-24 18:04:18 -0500, Tom Lane wrote:

Concretely, something like the attached. This passes regression tests
but I've not pushed on it any harder than that.

Heh, I'd just gotten something that didn't immediately crash anymore ;)

Running your patch against Jeff's test-case, verified before that I
could easily reproduce the O(N^2) cost.

Oh, that didn't take as long as I was afraid (optimized/non-assert build):

postgres[26268][1]=# SET work_mem = '13GB';
SET
Time: 2.591 ms
postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid);
Time: 268043.710 ms (04:28.044)

As another datapoint, I measured this patch against the problem from
/messages/by-id/20170227111732.vrx5v72ighehwpkf@alap3.anarazel.de
(see top post in thread), and it indeed fixes the runtime issue -
there's still considerably higher memory usage and some runtime
overhead, but the quadratic behaviour is gone.

I think we should go forward with something like this patch in all
branches, and only use Tomas' patch in master, because they're
considerably larger.

So you've tried to switch hashjoin to the slab allocators? Or what have
you compared?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Tomas Vondra (#16)

Re: Poor memory context performance in large hash joins

On 2017-02-27 19:20:56 +0100, Tomas Vondra wrote:

On 02/27/2017 12:55 PM, Andres Freund wrote:

On 2017-02-24 15:18:04 -0800, Andres Freund wrote:

On 2017-02-24 15:12:37 -0800, Andres Freund wrote:

On 2017-02-24 18:04:18 -0500, Tom Lane wrote:

Concretely, something like the attached. This passes regression tests
but I've not pushed on it any harder than that.

Heh, I'd just gotten something that didn't immediately crash anymore ;)

Running your patch against Jeff's test-case, verified before that I
could easily reproduce the O(N^2) cost.

Oh, that didn't take as long as I was afraid (optimized/non-assert build):

postgres[26268][1]=# SET work_mem = '13GB';
SET
Time: 2.591 ms
postgres[26268][1]=# select count(*) from foobar2 where not exists (select 1 from foobar t where t.titleid=foobar2.titleid);
Time: 268043.710 ms (04:28.044)

As another datapoint, I measured this patch against the problem from
/messages/by-id/20170227111732.vrx5v72ighehwpkf@alap3.anarazel.de
(see top post in thread), and it indeed fixes the runtime issue -
there's still considerably higher memory usage and some runtime
overhead, but the quadratic behaviour is gone.

I think we should go forward with something like this patch in all
branches, and only use Tomas' patch in master, because they're
considerably larger.

So you've tried to switch hashjoin to the slab allocators? Or what have you
compared?

No, sorry for not being more explicit about this. Meant that Tom's
patch addresses the performance issue in the reorderbuffer.c to a good
degree (i.e. gets rid of the quadratic cost, even though constants are
higher than w/ your patches). As the patch here is a lot smaller, it
seems like a better choice for the back-branches than backporting
slab.c/generation.c.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Andres Freund (#15)

Re: Poor memory context performance in large hash joins

Andres Freund <andres@anarazel.de> writes:

On 2017-02-24 18:04:18 -0500, Tom Lane wrote:

Concretely, something like the attached. This passes regression tests
but I've not pushed on it any harder than that.

I think we should go forward with something like this patch in all
branches, and only use Tomas' patch in master, because they're
considerably larger.

While I'm willing to back-patch the proposed patch to some extent,
I'm a bit hesitant to shove it into all branches, because I don't
think that usage patterns that create a problem occurred till recently.
You won't get a whole lot of standalone-block allocations unless you
have code that allocates many chunks bigger than 8K without applying a
double-the-request-each-time type of strategy. And even then, it doesn't
hurt unless you try to resize those chunks, or delete them in a non-LIFO
order.

The hashjoin issue is certainly new, and the reorderbuffer issue can't
go back further than 9.4. So my inclination is to patch back to 9.4
and call it good.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Tom Lane (#18)

Re: Poor memory context performance in large hash joins

On 2017-03-07 13:06:39 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2017-02-24 18:04:18 -0500, Tom Lane wrote:

Concretely, something like the attached. This passes regression tests
but I've not pushed on it any harder than that.

I think we should go forward with something like this patch in all
branches, and only use Tomas' patch in master, because they're
considerably larger.

While I'm willing to back-patch the proposed patch to some extent,
I'm a bit hesitant to shove it into all branches, because I don't
think that usage patterns that create a problem occurred till recently.
You won't get a whole lot of standalone-block allocations unless you
have code that allocates many chunks bigger than 8K without applying a
double-the-request-each-time type of strategy. And even then, it doesn't
hurt unless you try to resize those chunks, or delete them in a non-LIFO
order.

The hashjoin issue is certainly new, and the reorderbuffer issue can't
go back further than 9.4. So my inclination is to patch back to 9.4
and call it good.

That works for me. If we find further cases later, we can easily enough
backpatch it then.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Andres Freund (#19)

Re: Poor memory context performance in large hash joins

Andres Freund <andres@anarazel.de> writes:

On 2017-03-07 13:06:39 -0500, Tom Lane wrote:

The hashjoin issue is certainly new, and the reorderbuffer issue can't
go back further than 9.4. So my inclination is to patch back to 9.4
and call it good.

That works for me. If we find further cases later, we can easily enough
backpatch it then.

Done like that.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers