bulk_multi_insert infinite loops with large rows and small fill factors

Started by David Gouldabout 13 years ago11 messages
#1David Gould
daveg@sonic.net
2 attachment(s)

COPY IN loops in heap_multi_insert() extending the table until it fills the
disk when trying to insert a wide row into a table with a low fill-factor.
Internally fill-factor is implemented by reserving some space space on a
page. For large enough rows and small enough fill-factor bulk_multi_insert()
can't fit the row even on a new empty page, so it keeps allocating new pages
but is never able to place the row. It should always put at least one row on
an empty page.

In the excerpt below saveFreeSpace is the reserved space for the fill-factor.

while (ndone < ntuples)
{ ...
/*
* Find buffer where at least the next tuple will fit. If the page is
* all-visible, this will also pin the requisite visibility map page.
*/
buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
...
/* Put as many tuples as fit on this page */
for (nthispage = 0; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple heaptup = heaptuples[ndone + nthispage];

if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
break;

RelationPutHeapTuple(relation, buffer, heaptup);
}
...Do a bunch of dirtying and logging etc ...
}

This was introduced in 9.2 as part of the bulk insert speedup.

One more point, in the case where we don't insert any rows, we still do all the
dirtying and logging work even though we did not modify the page. I have tried
skip all this if no rows are added (nthispage == 0), but my access method foo
is sadly out of date, so someone should take a skeptical look at that.

A test case and patch against 9.2.2 is attached. It fixes the problem and passes
make check. Most of the diff is just indentation changes. Whoever tries this will
want to test this on a small partition by itself.

-dg

--
David Gould 510 282 0869 daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

Attachments:

multi_insert_fail.sqltext/x-sqlDownload
multi_insert_fail.difftext/x-patchDownload
*** postgresql-9.2.2/src/backend/access/heap/heapam.c	2012-12-03 12:16:10.000000000 -0800
--- postgresql-9.2.2dg/src/backend/access/heap/heapam.c	2012-12-12 01:55:58.174653706 -0800
***************
*** 2158,2163 ****
--- 2158,2164 ----
  		Buffer		buffer;
  		Buffer		vmbuffer = InvalidBuffer;
  		bool		all_visible_cleared = false;
+ 		bool		page_is_empty;
  		int			nthispage;
  
  		/*
***************
*** 2173,2299 ****
  		START_CRIT_SECTION();
  
  		/* Put as many tuples as fit on this page */
  		for (nthispage = 0; ndone + nthispage < ntuples; nthispage++)
  		{
  			HeapTuple	heaptup = heaptuples[ndone + nthispage];
  
! 			if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)
  				break;
! 
  			RelationPutHeapTuple(relation, buffer, heaptup);
  		}
  
- 		if (PageIsAllVisible(page))
- 		{
- 			all_visible_cleared = true;
- 			PageClearAllVisible(page);
- 			visibilitymap_clear(relation,
- 								BufferGetBlockNumber(buffer),
- 								vmbuffer);
- 		}
- 
  		/*
! 		 * XXX Should we set PageSetPrunable on this page ? See heap_insert()
  		 */
! 
! 		MarkBufferDirty(buffer);
! 
! 		/* XLOG stuff */
! 		if (needwal)
  		{
! 			XLogRecPtr	recptr;
! 			xl_heap_multi_insert *xlrec;
! 			XLogRecData rdata[2];
! 			uint8		info = XLOG_HEAP2_MULTI_INSERT;
! 			char	   *tupledata;
! 			int			totaldatalen;
! 			char	   *scratchptr = scratch;
! 			bool		init;
  
  			/*
! 			 * If the page was previously empty, we can reinit the page
! 			 * instead of restoring the whole thing.
  			 */
- 			init = (ItemPointerGetOffsetNumber(&(heaptuples[ndone]->t_self)) == FirstOffsetNumber &&
- 					PageGetMaxOffsetNumber(page) == FirstOffsetNumber + nthispage - 1);
  
! 			/* allocate xl_heap_multi_insert struct from the scratch area */
! 			xlrec = (xl_heap_multi_insert *) scratchptr;
! 			scratchptr += SizeOfHeapMultiInsert;
  
! 			/*
! 			 * Allocate offsets array. Unless we're reinitializing the page,
! 			 * in that case the tuples are stored in order starting at
! 			 * FirstOffsetNumber and we don't need to store the offsets
! 			 * explicitly.
! 			 */
! 			if (!init)
! 				scratchptr += nthispage * sizeof(OffsetNumber);
  
! 			/* the rest of the scratch space is used for tuple data */
! 			tupledata = scratchptr;
  
! 			xlrec->all_visible_cleared = all_visible_cleared;
! 			xlrec->node = relation->rd_node;
! 			xlrec->blkno = BufferGetBlockNumber(buffer);
! 			xlrec->ntuples = nthispage;
! 
! 			/*
! 			 * Write out an xl_multi_insert_tuple and the tuple data itself
! 			 * for each tuple.
! 			 */
! 			for (i = 0; i < nthispage; i++)
! 			{
! 				HeapTuple	heaptup = heaptuples[ndone + i];
! 				xl_multi_insert_tuple *tuphdr;
! 				int			datalen;
  
  				if (!init)
! 					xlrec->offsets[i] = ItemPointerGetOffsetNumber(&heaptup->t_self);
! 				/* xl_multi_insert_tuple needs two-byte alignment. */
! 				tuphdr = (xl_multi_insert_tuple *) SHORTALIGN(scratchptr);
! 				scratchptr = ((char *) tuphdr) + SizeOfMultiInsertTuple;
! 
! 				tuphdr->t_infomask2 = heaptup->t_data->t_infomask2;
! 				tuphdr->t_infomask = heaptup->t_data->t_infomask;
! 				tuphdr->t_hoff = heaptup->t_data->t_hoff;
! 
! 				/* write bitmap [+ padding] [+ oid] + data */
! 				datalen = heaptup->t_len - offsetof(HeapTupleHeaderData, t_bits);
! 				memcpy(scratchptr,
! 					   (char *) heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits),
! 					   datalen);
! 				tuphdr->datalen = datalen;
! 				scratchptr += datalen;
! 			}
! 			totaldatalen = scratchptr - tupledata;
! 			Assert((scratchptr - scratch) < BLCKSZ);
  
! 			rdata[0].data = (char *) xlrec;
! 			rdata[0].len = tupledata - scratch;
! 			rdata[0].buffer = InvalidBuffer;
! 			rdata[0].next = &rdata[1];
! 
! 			rdata[1].data = tupledata;
! 			rdata[1].len = totaldatalen;
! 			rdata[1].buffer = buffer;
! 			rdata[1].buffer_std = true;
! 			rdata[1].next = NULL;
  
! 			/*
! 			 * If we're going to reinitialize the whole page using the WAL
! 			 * record, hide buffer reference from XLogInsert.
! 			 */
! 			if (init)
! 			{
! 				rdata[1].buffer = InvalidBuffer;
! 				info |= XLOG_HEAP_INIT_PAGE;
! 			}
  
! 			recptr = XLogInsert(RM_HEAP2_ID, info, rdata);
  
! 			PageSetLSN(page, recptr);
! 			PageSetTLI(page, ThisTimeLineID);
  		}
  
  		END_CRIT_SECTION();
--- 2174,2309 ----
  		START_CRIT_SECTION();
  
  		/* Put as many tuples as fit on this page */
+ 		page_is_empty = PageGetMaxOffsetNumber(page) == 0;
  		for (nthispage = 0; ndone + nthispage < ntuples; nthispage++)
  		{
  			HeapTuple	heaptup = heaptuples[ndone + nthispage];
  
! 			if (PageGetHeapFreeSpace(page) <
! 			    MAXALIGN(heaptup->t_len) + (page_is_empty ? 0 : saveFreeSpace))
  				break;
! 			page_is_empty = false;
  			RelationPutHeapTuple(relation, buffer, heaptup);
  		}
  
  		/*
! 		 * If nthispage > 0 then we modified the (possibly new) page,
! 		 * otherwise there was not enough space for even one new tuple.
  		 */
! 		if (nthispage > 0)
  		{
! 			if (PageIsAllVisible(page))
! 			{
! 				all_visible_cleared = true;
! 				PageClearAllVisible(page);
! 				visibilitymap_clear(relation,
! 									BufferGetBlockNumber(buffer),
! 									vmbuffer);
! 			}
  
  			/*
! 			 * XXX Should we set PageSetPrunable on this page ? See heap_insert()
  			 */
  
! 			MarkBufferDirty(buffer);
  
! 			/* XLOG stuff */
! 			if (needwal)
! 			{
! 				XLogRecPtr	recptr;
! 				xl_heap_multi_insert *xlrec;
! 				XLogRecData rdata[2];
! 				uint8		info = XLOG_HEAP2_MULTI_INSERT;
! 				char	   *tupledata;
! 				int			totaldatalen;
! 				char	   *scratchptr = scratch;
! 				bool		init;
  
! 				/*
! 				 * If the page was previously empty, we can reinit the page
! 				 * instead of restoring the whole thing.
! 				 */
! 				init = (ItemPointerGetOffsetNumber(&(heaptuples[ndone]->t_self)) == FirstOffsetNumber &&
! 						PageGetMaxOffsetNumber(page) == FirstOffsetNumber + nthispage - 1);
  
! 				/* allocate xl_heap_multi_insert struct from the scratch area */
! 				xlrec = (xl_heap_multi_insert *) scratchptr;
! 				scratchptr += SizeOfHeapMultiInsert;
  
+ 				/*
+ 				 * Allocate offsets array. Unless we're reinitializing the page,
+ 				 * in that case the tuples are stored in order starting at
+ 				 * FirstOffsetNumber and we don't need to store the offsets
+ 				 * explicitly.
+ 				 */
  				if (!init)
! 					scratchptr += nthispage * sizeof(OffsetNumber);
  
! 				/* the rest of the scratch space is used for tuple data */
! 				tupledata = scratchptr;
  
! 				xlrec->all_visible_cleared = all_visible_cleared;
! 				xlrec->node = relation->rd_node;
! 				xlrec->blkno = BufferGetBlockNumber(buffer);
! 				xlrec->ntuples = nthispage;
  
! 				/*
! 				 * Write out an xl_multi_insert_tuple and the tuple data itself
! 				 * for each tuple.
! 				 */
! 				for (i = 0; i < nthispage; i++)
! 				{
! 					HeapTuple	heaptup = heaptuples[ndone + i];
! 					xl_multi_insert_tuple *tuphdr;
! 					int			datalen;
! 
! 					if (!init)
! 						xlrec->offsets[i] = ItemPointerGetOffsetNumber(&heaptup->t_self);
! 					/* xl_multi_insert_tuple needs two-byte alignment. */
! 					tuphdr = (xl_multi_insert_tuple *) SHORTALIGN(scratchptr);
! 					scratchptr = ((char *) tuphdr) + SizeOfMultiInsertTuple;
! 
! 					tuphdr->t_infomask2 = heaptup->t_data->t_infomask2;
! 					tuphdr->t_infomask = heaptup->t_data->t_infomask;
! 					tuphdr->t_hoff = heaptup->t_data->t_hoff;
! 
! 					/* write bitmap [+ padding] [+ oid] + data */
! 					datalen = heaptup->t_len - offsetof(HeapTupleHeaderData, t_bits);
! 					memcpy(scratchptr,
! 						   (char *) heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits),
! 						   datalen);
! 					tuphdr->datalen = datalen;
! 					scratchptr += datalen;
! 				}
! 				totaldatalen = scratchptr - tupledata;
! 				Assert((scratchptr - scratch) < BLCKSZ);
! 
! 				rdata[0].data = (char *) xlrec;
! 				rdata[0].len = tupledata - scratch;
! 				rdata[0].buffer = InvalidBuffer;
! 				rdata[0].next = &rdata[1];
! 
! 				rdata[1].data = tupledata;
! 				rdata[1].len = totaldatalen;
! 				rdata[1].buffer = buffer;
! 				rdata[1].buffer_std = true;
! 				rdata[1].next = NULL;
! 
! 				/*
! 				 * If we're going to reinitialize the whole page using the WAL
! 				 * record, hide buffer reference from XLogInsert.
! 				 */
! 				if (init)
! 				{
! 					rdata[1].buffer = InvalidBuffer;
! 					info |= XLOG_HEAP_INIT_PAGE;
! 				}
! 
! 				recptr = XLogInsert(RM_HEAP2_ID, info, rdata);
  
! 				PageSetLSN(page, recptr);
! 				PageSetTLI(page, ThisTimeLineID);
! 			}
  		}
  
  		END_CRIT_SECTION();
***************
*** 2304,2310 ****
  
  		ndone += nthispage;
  	}
- 
  	/*
  	 * If tuples are cachable, mark them for invalidation from the caches in
  	 * case we abort.  Note it is OK to do this after releasing the buffer,
--- 2314,2319 ----
#2Andres Freund
andres@2ndquadrant.com
In reply to: David Gould (#1)
Re: bulk_multi_insert infinite loops with large rows and small fill factors

On 2012-12-12 03:04:19 -0800, David Gould wrote:

COPY IN loops in heap_multi_insert() extending the table until it fills the
disk when trying to insert a wide row into a table with a low fill-factor.
Internally fill-factor is implemented by reserving some space space on a
page. For large enough rows and small enough fill-factor bulk_multi_insert()
can't fit the row even on a new empty page, so it keeps allocating new pages
but is never able to place the row. It should always put at least one row on
an empty page.

Heh. Nice one. Did you hit that in practice?

One more point, in the case where we don't insert any rows, we still do all the
dirtying and logging work even though we did not modify the page. I have tried
skip all this if no rows are added (nthispage == 0), but my access method foo
is sadly out of date, so someone should take a skeptical look at that.

A test case and patch against 9.2.2 is attached. It fixes the problem and passes
make check. Most of the diff is just indentation changes. Whoever tries this will
want to test this on a small partition by itself.

ISTM this would be fixed with a smaller footprint by just making

if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)

if (!PageIsEmpty(page) &&
PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)

I think that should work?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Andres Freund (#2)
Re: Re: bulk_multi_insert infinite loops with large rows and small fill factors

On 12.12.2012 13:27, Andres Freund wrote:

On 2012-12-12 03:04:19 -0800, David Gould wrote:

One more point, in the case where we don't insert any rows, we still do all the
dirtying and logging work even though we did not modify the page. I have tried
skip all this if no rows are added (nthispage == 0), but my access method foo
is sadly out of date, so someone should take a skeptical look at that.

A test case and patch against 9.2.2 is attached. It fixes the problem and passes
make check. Most of the diff is just indentation changes. Whoever tries this will
want to test this on a small partition by itself.

ISTM this would be fixed with a smaller footprint by just making

if (PageGetHeapFreeSpace(page)< MAXALIGN(heaptup->t_len) + saveFreeSpace)

if (!PageIsEmpty(page)&&
PageGetHeapFreeSpace(page)< MAXALIGN(heaptup->t_len) + saveFreeSpace)

I think that should work?

Yeah, seems that it should, although PageIsEmpty() is no guarantee that
the tuple fits, because even though PageIsEmpty() returns true, there
might be dead line pointers consuming so much space that the tuple at
hand doesn't fit. However, RelationGetBufferForTuple() won't return such
a page, it guarantees that the first tuple does indeed fit on the page
it returns. For the same reason, the later check that at least one tuple
was actually placed on the page is not necessary.

I committed a slightly different version, which unconditionally puts the
first tuple on the page, and only applies the freespace check to the
subsequent tuples. Since RelationGetBufferForTuple() guarantees that the
first tuple fits, we can trust that, like heap_insert does.

--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2172,8 +2172,12 @@ heap_multi_insert(Relation relation, HeapTuple 
*tuples, int ntuples,
  		/* NO EREPORT(ERROR) from here till changes are logged */
  		START_CRIT_SECTION();
-		/* Put as many tuples as fit on this page */
-		for (nthispage = 0; ndone + nthispage < ntuples; nthispage++)
+		/*
+		 * RelationGetBufferForTuple has ensured that the first tuple fits.
+		 * Put that on the page, and then as many other tuples as fit.
+		 */
+		RelationPutHeapTuple(relation, buffer, heaptuples[ndone]);
+		for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
  		{
  			HeapTuple	heaptup = heaptuples[ndone + nthispage];

Thanks for the report!

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4David Gould
daveg@sonic.net
In reply to: Andres Freund (#2)
Re: Re: bulk_multi_insert infinite loops with large rows and small fill factors

On Wed, 12 Dec 2012 12:27:11 +0100
Andres Freund <andres@2ndquadrant.com> wrote:

On 2012-12-12 03:04:19 -0800, David Gould wrote:

COPY IN loops in heap_multi_insert() extending the table until it fills the

Heh. Nice one. Did you hit that in practice?

Yeah, with a bunch of hosts that run postgres on a ramdisk, and that copy
happens late in the initial setup script for new hosts. The first batch of
new hosts to be setup with 9.2 filled the ramdisk, oomed and fell over
within a minute. Since the script setups up a lot of stuff we had no idea
at first who oomed.

ISTM this would be fixed with a smaller footprint by just making

if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)

if (!PageIsEmpty(page) &&
PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len) + saveFreeSpace)

I think that should work?

I like PageIsEmpty() better (and would have used if I I knew), but I'm not
so crazy about the negation.

-dg

--
David Gould 510 282 0869 daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: David Gould (#4)
Re: Re: bulk_multi_insert infinite loops with large rows and small fill factors

On 12.12.2012 14:17, David Gould wrote:

On Wed, 12 Dec 2012 12:27:11 +0100
Andres Freund<andres@2ndquadrant.com> wrote:

On 2012-12-12 03:04:19 -0800, David Gould wrote:

COPY IN loops in heap_multi_insert() extending the table until it fills the

Heh. Nice one. Did you hit that in practice?

Yeah, with a bunch of hosts that run postgres on a ramdisk, and that copy
happens late in the initial setup script for new hosts. The first batch of
new hosts to be setup with 9.2 filled the ramdisk, oomed and fell over
within a minute. Since the script setups up a lot of stuff we had no idea
at first who oomed.

The bug's been fixed now, but note that huge tuples like this will
always cause the table to be extended. Even if there are completely
empty pages in the table, after a vacuum. Even a completely empty
existing page is not considered spacious enough in this case, because
it's still too small when you take fillfactor into account, so the
insertion will always extend the table. If you regularly run into this
situation, you might want to raise your fillfactor..

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6David Gould
daveg@sonic.net
In reply to: Heikki Linnakangas (#3)
Re: Re: bulk_multi_insert infinite loops with large rows and small fill factors

On Wed, 12 Dec 2012 13:56:08 +0200
Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

However, RelationGetBufferForTuple() won't return such
a page, it guarantees that the first tuple does indeed fit on the page
it returns. For the same reason, the later check that at least one tuple
was actually placed on the page is not necessary.

I committed a slightly different version, which unconditionally puts the
first tuple on the page, and only applies the freespace check to the
subsequent tuples. Since RelationGetBufferForTuple() guarantees that the
first tuple fits, we can trust that, like heap_insert does.

--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2172,8 +2172,12 @@ heap_multi_insert(Relation relation, HeapTuple 
*tuples, int ntuples,
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
-		/* Put as many tuples as fit on this page */
-		for (nthispage = 0; ndone + nthispage < ntuples; nthispage++)
+		/*
+		 * ()  has ensured that the first tuple fits.
+		 * Put that on the page, and then as many other tuples as fit.
+		 */
+		RelationPutHeapTuple(relation, buffer, heaptuples[ndone]);
+		for (nthispage = 1; ndone + nthispage < ntuples; nthispage++)
{
HeapTuple	heaptup = heaptuples[ndone + nthispage];

I don't know if this is the same thing. At least in the comments I was
reading trying to figure this out there was some concern that someone
else could change the space on the page. Does RelationGetBufferForTuple()
guarantee against this too?

-dg

--
David Gould 510 282 0869 daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: David Gould (#6)
Re: Re: bulk_multi_insert infinite loops with large rows and small fill factors

On 12.12.2012 14:24, David Gould wrote:

I don't know if this is the same thing. At least in the comments I was
reading trying to figure this out there was some concern that someone
else could change the space on the page. Does RelationGetBufferForTuple()
guarantee against this too?

Yeah, RelationGetBufferForTuple grabs a lock on the page before
returning it. For comparison, plain heap_insert does simply this:

buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
InvalidBuffer, options, bistate,
&vmbuffer, NULL);

/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();

RelationPutHeapTuple(relation, buffer, heaptup);

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8David Gould
daveg@sonic.net
In reply to: Heikki Linnakangas (#5)
Re: Re: bulk_multi_insert infinite loops with large rows and small fill factors

On Wed, 12 Dec 2012 14:23:12 +0200
Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

The bug's been fixed now, but note that huge tuples like this will
always cause the table to be extended. Even if there are completely
empty pages in the table, after a vacuum. Even a completely empty
existing page is not considered spacious enough in this case, because
it's still too small when you take fillfactor into account, so the
insertion will always extend the table. If you regularly run into this
situation, you might want to raise your fillfactor..

Actually, we'd like it lower. Ideally, one row per page.

We lose noticable performance when we raise fill-factor above 10. Even 20 is
slower.

During busy times these hosts sometimes fall into a stable state
with very high cpu use mostly in s_lock() and LWLockAcquire() and I think
PinBuffer plus very high system cpu in the scheduler (I don't have the perf
trace in front of me so take this with a grain of salt). In this mode they
fall from the normal 7000 queries per second to below 3000. Once in this
state they tend to stay that way. If we turn down the number of incoming
requests they go back to normal. Our conjecture is that most requests are
for only a few keys and so we have multiple sessions contending for few
pages and convoying in the buffer manager. The table is under 20k rows, but
the hot items are probably only a couple hundred different rows. The busy
processes are doing reads only, but there is some update activity on this
table too.

Ah, found an email with the significant part of the perf output:

... set number of client threads = number of postgres backends = 70. That way
all my threads have constant access to a backend and they just spin in a tight
loop running the same query over and over (with different values). ... this
seems to have tapped into 9.2's resonant frequency, right now we're spending
almost all our time spin locking.

...

762377.00 71.0% s_lock /usr/local/bin/postgres
22279.00 2.1% LWLockAcquire /usr/local/bin/postgres
18916.00 1.8% LWLockRelease /usr/local/bin/postgres

I was trying to resurrect the pthread s_lock() patch to see if that helps,
but it did not apply at all and I have not had time to persue it.

We have tried lots of number of processes and get the best result with
about ten less active postgresql backends than HT cores. System is 128GB
with:

processor : 79
vendor_id : GenuineIntel
cpu family : 6
model : 47
model name : Intel(R) Xeon(R) CPU E7-L8867 @ 2.13GHz
stepping : 2
cpu MHz : 2128.478
cache size : 30720 KB

-dg

--
David Gould 510 282 0869 daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#5)
Re: Re: bulk_multi_insert infinite loops with large rows and small fill factors

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

The bug's been fixed now, but note that huge tuples like this will
always cause the table to be extended. Even if there are completely
empty pages in the table, after a vacuum. Even a completely empty
existing page is not considered spacious enough in this case, because
it's still too small when you take fillfactor into account, so the
insertion will always extend the table.

Seems like that's a bug in itself: there's no reason to reject an empty
existing page.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: David Gould (#8)
Re: Re: bulk_multi_insert infinite loops with large rows and small fill factors

On Wed, Dec 12, 2012 at 8:29 AM, David Gould <daveg@sonic.net> wrote:

We lose noticable performance when we raise fill-factor above 10. Even 20 is
slower.

Whoa.

During busy times these hosts sometimes fall into a stable state
with very high cpu use mostly in s_lock() and LWLockAcquire() and I think
PinBuffer plus very high system cpu in the scheduler (I don't have the perf
trace in front of me so take this with a grain of salt). In this mode they
fall from the normal 7000 queries per second to below 3000.

I have seen signs of something similar to this when running pgbench -S
tests at high concurrency. I've never been able to track down where
the problem is happening. My belief is that once a spinlock starts to
be contended, there's some kind of death spiral that can't be arrested
until the workload eases up. But I haven't had much luck identifying
exactly which spinlock is the problem or if it even is just one...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11David Gould
daveg@sonic.net
In reply to: Robert Haas (#10)
Re: Re: bulk_multi_insert infinite loops with large rows and small fill factors

On Fri, 14 Dec 2012 15:39:44 -0500
Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 12, 2012 at 8:29 AM, David Gould <daveg@sonic.net> wrote:

We lose noticable performance when we raise fill-factor above 10. Even 20 is
slower.

Whoa.

Any interest in a fill-factor patch to place exactly one row per page? That
would be the least contended. There are applications where it might help.

During busy times these hosts sometimes fall into a stable state
with very high cpu use mostly in s_lock() and LWLockAcquire() and I think
PinBuffer plus very high system cpu in the scheduler (I don't have the perf
trace in front of me so take this with a grain of salt). In this mode they
fall from the normal 7000 queries per second to below 3000.

I have seen signs of something similar to this when running pgbench -S
tests at high concurrency. I've never been able to track down where

I think I may have seen that with pgbench -S too. I did not have time to
investigate more, but out of a sequence of three minute runs I got most
runs at 300k+ qps and but a couple were around 200k qps.

the problem is happening. My belief is that once a spinlock starts to
be contended, there's some kind of death spiral that can't be arrested
until the workload eases up. But I haven't had much luck identifying
exactly which spinlock is the problem or if it even is just one...

I agree about the death spiral. I think what happens is all the backends
get synchcronized by waiting and they are more likely to contend again.

-dg

--
David Gould 510 282 0869 daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers