Relation extension scalability

Started by Andres Freundalmost 11 years ago124 messages
#1Andres Freund
andres@2ndquadrant.com
1 attachment(s)

Hello,

Currently bigger shared_buffers settings don't combine well with
relations being extended frequently. Especially if many/most pages have
a high usagecount and/or are dirty and the system is IO constrained.

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

I've prototyped solving this for heap relations moving the smgrnblocks()
+ smgrextend() calls to RelationGetBufferForTuple(). With some care
(including a retry loop) it's possible to only do those two under the
extension lock. That indeed fixes problems in some of my tests.

I'm not sure whether the above is the best solution however. For one I
think it's not necessarily a good idea to opencode this in hio.c - I've
not observed it, but this probably can happen for btrees and such as
well. For another, this is still a exclusive lock while we're doing IO:
smgrextend() wants a page to write out, so we have to be careful not to
overwrite things.

There's two things that seem to make sense to me:

First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
and introduce a bufmgr function specifically for extension.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend(). If we instead only do the write once we've read
in & locked the page exclusively there's no need for the extension
lock. We probably still should write out the new page to the OS
immediately once we've initialized it; to avoid creating sparse files.

The other reason we need the extension lock is that code like
lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
pages that are about to be initilized by the extending backend. I think
we should just remove that code and deal with the problem by retrying in
the extending backend; that's why I think moving extension to a
different file might be helpful.

I've attached my POC for heap extension, but it's really just a POC at
this point.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-WIP-Saner-heap-extension.patchtext/x-patch; charset=us-asciiDownload
>From f1f38829160d0f2998cd187f2f920cdc7b1fa709 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 29 Mar 2015 20:55:32 +0200
Subject: [PATCH] WIP: Saner heap extension.

---
 src/backend/access/heap/hio.c | 110 ++++++++++++++++++++++++------------------
 1 file changed, 63 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6d091f6..178c417 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -15,6 +15,8 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+
 #include "access/heapam.h"
 #include "access/hio.h"
 #include "access/htup_details.h"
@@ -420,63 +422,77 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	/*
 	 * Have to extend the relation.
 	 *
-	 * We have to use a lock to ensure no one else is extending the rel at the
-	 * same time, else we will both try to initialize the same new page.  We
-	 * can skip locking for new or temp relations, however, since no one else
-	 * could be accessing them.
+	 * To avoid, as it used to be the case, holding the extension lock during
+	 * victim buffer search for the new buffer, we extend the relation here
+	 * instead of relying on bufmgr.c. We still have to hold the extension
+	 * lock to prevent a race between two backends initializing the same page.
 	 */
-	needLock = !RELATION_IS_LOCAL(relation);
+	while(true)
+	{
+		char		emptybuf[BLCKSZ];
 
-	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+		/*
+		 * We have to use a lock to ensure no one else is extending the rel at
+		 * the same time, else we will both try to initialize the same new
+		 * page.  We can skip locking for new or temp relations, however,
+		 * since no one else could be accessing them.
+		 */
+		needLock = !RELATION_IS_LOCAL(relation);
+		RelationOpenSmgr(relation);
 
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
+		MemSet((char *) emptybuf, 0, BLCKSZ);
+		PageInit((Page) emptybuf, BLCKSZ, 0);
 
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+		/*
+		 * Acquire extension lock to avoid two backends extending the
+		 * relation at the same time. This could be avoided by using
+		 * lseek(SEEK_END, +BLKSZ) *without* immediately writing to the
+		 * block. Then read in the page and only initialize after locking
+		 * it. Unclear whether it's a benefit or whether it might be too
+		 * likely to result in sparse files.
+		 */
+		if (needLock)
+			LockRelationForExtension(relation, ExclusiveLock);
 
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		targetBlock = smgrnblocks(relation->rd_smgr, MAIN_FORKNUM);
+		smgrextend(relation->rd_smgr, MAIN_FORKNUM, targetBlock,
+					   emptybuf, false);
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.  Note that we cannot release this lock before
-	 * we have buffer lock on the new page, or we risk a race condition
-	 * against vacuumlazy.c --- see comments therein.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		if (needLock)
+			UnlockRelationForExtension(relation, ExclusiveLock);
 
-	/*
-	 * We need to initialize the empty new page.  Double-check that it really
-	 * is empty (this should never happen, but if it does we don't want to
-	 * risk wiping out valid data).
-	 */
-	page = BufferGetPage(buffer);
+		buffer = ReadBufferBI(relation, targetBlock, bistate);
+
+		/*
+		 * We can be certain that locking the otherBuffer first is OK,
+		 * since it must have a lower page number.
+		 */
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
-	if (!PageIsNew(page))
-		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-			 BufferGetBlockNumber(buffer),
-			 RelationGetRelationName(relation));
+		page = BufferGetPage(buffer);
 
-	PageInit(page, BufferGetPageSize(buffer), 0);
+		Assert(!PageIsNew(page));
 
-	if (len > PageGetHeapFreeSpace(page))
-	{
-		/* We should not get here given the test at the top */
-		elog(PANIC, "tuple is too big: size %zu", len);
+		/*
+		 * While unlikely, it's possible that another backend managed to use
+		 * up the free space till we got the exclusive lock. That'd require
+		 * the page to be vacuumed (to be put on the free space list) and then
+		 * be used; possible but fairly unlikely in practice. If it happens,
+		 * just retry.
+		 */
+		if (len <= PageGetHeapFreeSpace(page))
+			break;
+
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
+
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
+
+		CHECK_FOR_INTERRUPTS();
 	}
 
 	/*
-- 
2.3.0.149.gf3f4077

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#1)
Re: Relation extension scalability

Andres Freund <andres@2ndquadrant.com> writes:

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while.

Right, so basically we want to get those steps out of the exclusive lock
scope.

There's two things that seem to make sense to me:

First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
and introduce a bufmgr function specifically for extension.

I think that removing P_NEW is likely to require a fair amount of
refactoring of calling code, so I'm not thrilled with doing that.
On the other hand, perhaps all that code would have to be touched
anyway to modify the scope over which the extension lock is held.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend().

I'm afraid this would break stuff rather thoroughly, in particular
handling of out-of-disk-space cases. And I really don't see how you get
consistent behavior at all for multiple concurrent callers if there's no
locking.

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added". On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Relation extension scalability

On 2015-03-29 15:21:44 -0400, Tom Lane wrote:

There's two things that seem to make sense to me:

First, decouple relation extension from ReadBuffer*, i.e. remove P_NEW
and introduce a bufmgr function specifically for extension.

I think that removing P_NEW is likely to require a fair amount of
refactoring of calling code, so I'm not thrilled with doing that.
On the other hand, perhaps all that code would have to be touched
anyway to modify the scope over which the extension lock is held.

It's not *that* many locations that need to extend relations. In my
playing around it seemed to me they all would need to be modified
anyway; if we want to remove/reduce the scope of extension locks to deal
with the fact that somebody else could have started to use the buffer.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend().

I'm afraid this would break stuff rather thoroughly, in particular
handling of out-of-disk-space cases.

Hm. Not a bad point.

And I really don't see how you get
consistent behavior at all for multiple concurrent callers if there's no
locking.

What I was thinking is something like this:

while (true)
{
targetBuffer = AcquireFromFSMEquivalent();

if (targetBuffer == InvalidBuffer)
targetBuffer = ExtendRelation();

LockBuffer(targetBuffer, BUFFER_LOCK_EXCLUSIVE);

if (BufferHasEnoughSpace(targetBuffer))
break;

LockBuffer(BUFFER_LOCK_UNLOCK);
}

where ExtendRelation() would basically work like
while (true)
{
targetBlock = (lseek(fd, BLCKSZ, SEEK_END) - BLCKSZ)/8192;
buffer = ReadBuffer(rel, targetBlock);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = BufferGetPage(buffer);
if (PageIsNew(page))
{
PageInit(page);
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
FlushBuffer(buffer);
break;
}
else
{
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
continue;
}
}

Obviously it's a tad more complex than that pseudocode, but I think that
basically should work. Except that it, as you can say, lead to some
oddities with out of space handling. I think it should actually be ok,
it might just be confusing to the user.

I think we might be able to address those issues by not using
lseek(SEEK_SET) but instead
fcntl(fd, F_SETFL, O_APPEND, 1);
write(fd, pre-init-block, BLCKSZ);
fcntl(fd, F_SETFL, O_APPEND, 0);
newblock = (lseek(SEEK_CUR) - BLCKSZ)/BLCKSZ;

by using O_APPEND and a pre-initialized block we can be sure to write a
block at the end, that's valid, and shouldn't run afould of any out of
space issues that we don't already have.

Unfortunately I'm not sure whether fcntl for O_APPEND is portable :(

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added". On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.

Yea, I was thinking that as well. We simply could skip the reading step
by setting up the contents in the buffer manager without a read in this
case...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: Relation extension scalability

Andres Freund <andres@2ndquadrant.com> writes:

On 2015-03-29 15:21:44 -0400, Tom Lane wrote:

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added". On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.

Yea, I was thinking that as well. We simply could skip the reading step
by setting up the contents in the buffer manager without a read in this
case...

No, you can't, at least not if the point is to not be holding any
exclusive lock by the time you go to talk to the buffer manager. There
will be nothing stopping some other backend from writing into that page of
the file before you can get hold of it. If the buffer they used to do the
write has itself gotten recycled, there is nothing left at all to tell you
your page image is out of date.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@2ndquadrant.com
In reply to: Tom Lane (#4)
Re: Relation extension scalability

On 2015-03-29 16:07:49 -0400, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2015-03-29 15:21:44 -0400, Tom Lane wrote:

One idea that might help is to change smgrextend's API so that it doesn't
need a buffer to write from, but just has an API of "add a prezeroed block
on-disk and tell me the number of the block you added". On the other
hand, that would then require reading in the block after allocating a
buffer to hold it (I don't think you can safely assume otherwise) so the
added read step might eat any savings.

Yea, I was thinking that as well. We simply could skip the reading step
by setting up the contents in the buffer manager without a read in this
case...

No, you can't, at least not if the point is to not be holding any
exclusive lock by the time you go to talk to the buffer manager. There
will be nothing stopping some other backend from writing into that page of
the file before you can get hold of it. If the buffer they used to do the
write has itself gotten recycled, there is nothing left at all to tell you
your page image is out of date.

That's why I'd proposed restructuring things so that the actual
extension/write to the file only happens once we have the buffer manager
exclusive lock on the individual buffer. While not trvia to implement it
doesn't look prohibitively complex.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: Relation extension scalability

On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

Interesting. I had always assumed the bottleneck was waiting for the
filesystem to extend the relation.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend(). If we instead only do the write once we've read
in & locked the page exclusively there's no need for the extension
lock. We probably still should write out the new page to the OS
immediately once we've initialized it; to avoid creating sparse files.

The other reason we need the extension lock is that code like
lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
pages that are about to be initilized by the extending backend. I think
we should just remove that code and deal with the problem by retrying in
the extending backend; that's why I think moving extension to a
different file might be helpful.

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block. Once
we've initialized the block, a subsequent failure to write or fsync it
will be hard to recover from; basically, we won't be able to
checkpoint any more. If we discover the problem while the block is
still all-zeroes, the transaction that uncovers the problem errors
out, but the system as a whole is still OK.

Or at least, I think. Maybe I'm misunderstanding.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#6)
Re: Relation extension scalability

On 2015-03-29 20:02:06 -0400, Robert Haas wrote:

On Sun, Mar 29, 2015 at 2:56 PM, Andres Freund <andres@2ndquadrant.com>

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

Interesting. I had always assumed the bottleneck was waiting for the
filesystem to extend the relation.

That might be the case sometimes, but it's not what I've actually
observed so far. I think most modern filesystems doing preallocation
resolved this to some degree.

Secondly I think we could maybe remove the requirement of needing an
extension lock alltogether. It's primarily required because we're
worried that somebody else can come along, read the page, and initialize
it before us. ISTM that could be resolved by *not* writing any data via
smgrextend()/mdextend(). If we instead only do the write once we've read
in & locked the page exclusively there's no need for the extension
lock. We probably still should write out the new page to the OS
immediately once we've initialized it; to avoid creating sparse files.

The other reason we need the extension lock is that code like
lazy_scan_heap() and btvacuumscan() that tries to avoid initializing
pages that are about to be initilized by the extending backend. I think
we should just remove that code and deal with the problem by retrying in
the extending backend; that's why I think moving extension to a
different file might be helpful.

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block.

Well, we only write and register a fsync. Afaics we don't actually
perform the fsync it at that point. I don't think having to do the
fsync() necessarily precludes removing the extension lock.

Once we've initialized the block, a subsequent failure to write or
fsync it will be hard to recover from;

At the very least the buffer shouldn't become dirty before we
successfully wrote once, right. It seems quite doable to achieve that
without the lock though. We'll have to do the write without going
through the buffer manager, but that seems doable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#6)
Re: Relation extension scalability

Robert Haas <robertmhaas@gmail.com> writes:

I thought the primary reason we did this is because we wanted to
write-and-fsync the block so that, if we're out of disk space, any
attendant failure will happen before we put data into the block. Once
we've initialized the block, a subsequent failure to write or fsync it
will be hard to recover from; basically, we won't be able to
checkpoint any more. If we discover the problem while the block is
still all-zeroes, the transaction that uncovers the problem errors
out, but the system as a whole is still OK.

Yeah. As Andres says, the fsync is not an important part of that,
but we do expect that ENOSPC will happen during the initial write()
if it's going to happen.

To some extent that's an obsolete assumption, I'm afraid --- I believe
that some modern filesystems don't necessarily overwrite the previous
version of a block, which would mean that they are capable of failing
with ENOSPC even during a re-write of a previously-written block.
However, the possibility of filesystem misfeasance of that sort doesn't
excuse us from having a clear recovery strategy for failures during
relation extension.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#1)
Re: Relation extension scalability

On Mon, Mar 30, 2015 at 12:26 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

Hello,

Currently bigger shared_buffers settings don't combine well with
relations being extended frequently. Especially if many/most pages have
a high usagecount and/or are dirty and the system is IO constrained.

As a quick recap, relation extension basically works like:
1) We lock the relation for extension
2) ReadBuffer*(P_NEW) is being called, to extend the relation
3) smgrnblocks() is used to find the new target block
4) We search for a victim buffer (via BufferAlloc()) to put the new
block into
5) If dirty the victim buffer is cleaned
6) The relation is extended using smgrextend()
7) The page is initialized

The problems come from 4) and 5) potentially each taking a fair
while. If the working set mostly fits into shared_buffers 4) can
requiring iterating over all shared buffers several times to find a
victim buffer. If the IO subsystem is buys and/or we've hit the kernel's
dirty limits 5) can take a couple seconds.

In the past, I have observed in one of the Write-oriented tests that
backend's have to flush the pages by themselves many a times, so
in above situation that can lead to more severe bottleneck.

I've prototyped solving this for heap relations moving the smgrnblocks()
+ smgrextend() calls to RelationGetBufferForTuple(). With some care
(including a retry loop) it's possible to only do those two under the
extension lock. That indeed fixes problems in some of my tests.

So do this means that the problem is because of contention on extension
lock?

I'm not sure whether the above is the best solution however.

Another thing to note here is that during extension we are extending
just one block, won't it make sense to increment it by some bigger
number (we can even take input from user for the same where user
can specify how much to autoextend a relation when the relation doesn't
have any empty space). During mdextend(), we might increase just one
block, however we can register the request for background process to
increase the size similar to what is done for fsync.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#10Andres Freund
andres@2ndquadrant.com
In reply to: Amit Kapila (#9)
Re: Relation extension scalability

On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:

In the past, I have observed in one of the Write-oriented tests that
backend's have to flush the pages by themselves many a times, so
in above situation that can lead to more severe bottleneck.

Yes.

I've prototyped solving this for heap relations moving the smgrnblocks()
+ smgrextend() calls to RelationGetBufferForTuple(). With some care
(including a retry loop) it's possible to only do those two under the
extension lock. That indeed fixes problems in some of my tests.

So do this means that the problem is because of contention on extension
lock?

Yes, at least commonly. Obviously the extension lock would be less of a
problem if we were better at having clean victim buffer ready.

I'm not sure whether the above is the best solution however.

Another thing to note here is that during extension we are extending
just one block, won't it make sense to increment it by some bigger
number (we can even take input from user for the same where user
can specify how much to autoextend a relation when the relation doesn't
have any empty space). During mdextend(), we might increase just one
block, however we can register the request for background process to
increase the size similar to what is done for fsync.

I think that's pretty much a separate patch. Made easier by moving
things out of under the lock maybe. Other than that I'd prefer not to
mix things. There's a whole bunch of unrelated complexity that I don't
want to attach to the topic at the same time (autovacuum truncayting
again and so on).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11David Steele
david@pgmasters.net
In reply to: Andres Freund (#10)
Re: Relation extension scalability

On 3/30/15 6:45 AM, Andres Freund wrote:

On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:

Another thing to note here is that during extension we are extending
just one block, won't it make sense to increment it by some bigger
number (we can even take input from user for the same where user
can specify how much to autoextend a relation when the relation doesn't
have any empty space). During mdextend(), we might increase just one
block, however we can register the request for background process to
increase the size similar to what is done for fsync.

I think that's pretty much a separate patch. Made easier by moving
things out of under the lock maybe. Other than that I'd prefer not to
mix things. There's a whole bunch of unrelated complexity that I don't
want to attach to the topic at the same time (autovacuum truncayting
again and so on).

Agreed that it makes more sense for this to be in a separate patch, but
I definitely like the idea.

A user configurable setting would be fine, but better would be to learn
from the current growth rate of the table and extend based on that.

For, instance, if a table is very large but is only growing by a few
rows a day, there's probably no need for a large extent. Conversely, an
initially small table growing by 1GB per minute would definitely benefit
from large extents and it would be good to be able to track growth and
compute extent sizes accordingly.

Of course, a manual setting to start with would cover most use cases.
Large tables in a database are generally in the minority and known in
advance.

--
- David Steele
david@pgmasters.net

#12Stephen Frost
sfrost@snowman.net
In reply to: David Steele (#11)
Re: Relation extension scalability

* David Steele (david@pgmasters.net) wrote:

On 3/30/15 6:45 AM, Andres Freund wrote:

On 2015-03-30 09:33:57 +0530, Amit Kapila wrote:

Another thing to note here is that during extension we are extending
just one block, won't it make sense to increment it by some bigger
number (we can even take input from user for the same where user
can specify how much to autoextend a relation when the relation doesn't
have any empty space). During mdextend(), we might increase just one
block, however we can register the request for background process to
increase the size similar to what is done for fsync.

I think that's pretty much a separate patch. Made easier by moving
things out of under the lock maybe. Other than that I'd prefer not to
mix things. There's a whole bunch of unrelated complexity that I don't
want to attach to the topic at the same time (autovacuum truncayting
again and so on).

Agreed that it makes more sense for this to be in a separate patch, but
I definitely like the idea.

A user configurable setting would be fine, but better would be to learn
from the current growth rate of the table and extend based on that.

For, instance, if a table is very large but is only growing by a few
rows a day, there's probably no need for a large extent. Conversely, an
initially small table growing by 1GB per minute would definitely benefit
from large extents and it would be good to be able to track growth and
compute extent sizes accordingly.

Of course, a manual setting to start with would cover most use cases.
Large tables in a database are generally in the minority and known in
advance.

If we're able to extend based on page-level locks rather than the global
relation locking that we're doing now, then I'm not sure we really need
to adjust how big the extents are any more. The reason for making
bigger extents is because of the locking problem we have now when lots
of backends want to extend a relation, but, if I'm following correctly,
that'd go away with Andres' approach.

We don't have full patches for either of these and so I don't mind
saying that, basically, I'd prefer to see if we still have a big
bottleneck here with lots of backends trying to extend the same relation
before we work on adding this particular feature in as it might end up
being unnecessary. Now, if someone shows up tomorrow with a patch to do
this and Andres' approach ends up not progressing, then we should
certainly consider it (in due time and with consideration to the
activities going on for 9.5, of course).

Thanks!

Stephen

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Stephen Frost (#12)
Re: Relation extension scalability

On Mon, Mar 30, 2015 at 8:57 PM, Stephen Frost <sfrost@snowman.net> wrote:

If we're able to extend based on page-level locks rather than the global
relation locking that we're doing now, then I'm not sure we really need
to adjust how big the extents are any more. The reason for making
bigger extents is because of the locking problem we have now when lots
of backends want to extend a relation, but, if I'm following correctly,
that'd go away with Andres' approach.

The benefit of extending in bigger chunks in background is that backend
would need to perform such an operation at relatively lesser frequency
which in itself could be a win.

We don't have full patches for either of these and so I don't mind
saying that, basically, I'd prefer to see if we still have a big
bottleneck here with lots of backends trying to extend the same relation
before we work on adding this particular feature in as it might end up
being unnecessary.

Agreed, I think it is better to first see the results of current
patch on which Andres is working and then if someone is interested
and can show any real benefit with the patch to extend relation
in bigger chunks, then that might be worth consideration.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Amit Kapila (#13)
Re: Relation extension scalability

On 3/30/15 10:48 PM, Amit Kapila wrote:

If we're able to extend based on page-level locks rather than the global
relation locking that we're doing now, then I'm not sure we really need
to adjust how big the extents are any more. The reason for making
bigger extents is because of the locking problem we have now when lots
of backends want to extend a relation, but, if I'm following correctly,
that'd go away with Andres' approach.

The benefit of extending in bigger chunks in background is that backend
would need to perform such an operation at relatively lesser frequency
which in itself could be a win.

The other potential advantage (and I have to think this could be a BIG
advantage) is extending by a large amount makes it more likely you'll
get contiguous blocks on the storage. That's going to make a big
difference for SeqScan speed. It'd be interesting if someone with access
to some real systems could test that. In particular, seqscan of a
possibly fragmented table vs one of the same size but created at once.
For extra credit, compare to dd bs=8192 of a file of the same size as
the overall table.

What I've seen in the real world is very, very poor SeqScan performance
of tables that were relatively large. So bad that I had to SeqScan 8-16
tables in parallel to max out the IO system the same way I could with a
single dd bs=8k of a large file (in this case, something like 480MB/s).
A single SeqScan would only do something like 30MB/s.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Jim Nasby (#14)
Re: Relation extension scalability

On 02-04-2015 AM 09:24, Jim Nasby wrote:

The other potential advantage (and I have to think this could be a BIG
advantage) is extending by a large amount makes it more likely you'll get
contiguous blocks on the storage. That's going to make a big difference for
SeqScan speed. It'd be interesting if someone with access to some real systems
could test that. In particular, seqscan of a possibly fragmented table vs one
of the same size but created at once. For extra credit, compare to dd bs=8192
of a file of the same size as the overall table.

Orthogonal to topic of the thread but this comment made me recall a proposal
couple years ago[0]/messages/by-id/CADupcHW1POmSuNoNMdVaWLTq-a3X_A3ZQMuSjHs4rCexiPgxAQ@mail.gmail.com to add (posix_)fallocate to mdextend(). Wonder if it helps
the case?

Amit

[0]: /messages/by-id/CADupcHW1POmSuNoNMdVaWLTq-a3X_A3ZQMuSjHs4rCexiPgxAQ@mail.gmail.com
/messages/by-id/CADupcHW1POmSuNoNMdVaWLTq-a3X_A3ZQMuSjHs4rCexiPgxAQ@mail.gmail.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Stephen Frost
sfrost@snowman.net
In reply to: Amit Langote (#15)
Re: Relation extension scalability

* Amit Langote (Langote_Amit_f8@lab.ntt.co.jp) wrote:

On 02-04-2015 AM 09:24, Jim Nasby wrote:

The other potential advantage (and I have to think this could be a BIG
advantage) is extending by a large amount makes it more likely you'll get
contiguous blocks on the storage. That's going to make a big difference for
SeqScan speed. It'd be interesting if someone with access to some real systems
could test that. In particular, seqscan of a possibly fragmented table vs one
of the same size but created at once. For extra credit, compare to dd bs=8192
of a file of the same size as the overall table.

Orthogonal to topic of the thread but this comment made me recall a proposal
couple years ago[0] to add (posix_)fallocate to mdextend(). Wonder if it helps
the case?

As I recall, it didn't, and further, modern filesystems are pretty good
about avoiding fragmentation anyway..

I'm not saying Jim's completely off-base with this idea, I'm just not
sure that it'll really buy us much.

Thanks,

Stephen

#17Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: Andres Freund (#1)
Re: Relation extension scalability

On Sun, Mar 29, 2015 at 11:56 AM, Andres Freund <andres@2ndquadrant.com> wrote:

I'm not sure whether the above is the best solution however. For one I
think it's not necessarily a good idea to opencode this in hio.c - I've
not observed it, but this probably can happen for btrees and such as
well. For another, this is still a exclusive lock while we're doing IO:
smgrextend() wants a page to write out, so we have to be careful not to
overwrite things.

I think relaxing a global lock will fix the contention mostly.
However, several people suggested that extending with many pages have
other benefits. This hints for a more fundamental change in our
storage model. Currently we map one file per relation. While it is
simple and robust, considering partitioned table, maybe later columnar
storage are integrated into the core, this model needs some further
thoughts. Think about a 1000 partitioned table with 100 columns: that
is 100K files, no to speak of other forks - surely we can continue
challenging file system's limit or playing around vfds, but we have a
chance now to think ahead.

Most commercial database employs a DMS storage model, where it manages
object mapping and freespace itself. So different objects are sharing
storage within several files. Surely it has historic reasons, but it
has several advantages over current model:
- remove fd pressure
- remove double buffering (by introducing ADIO)
- controlled layout and access pattern (sequential and read-ahead)
- better quota management
- performance potentially better

Considering platforms supported and the stableness period needed, we
shall support both current storage model and DMS model. I will stop
here to see if this deserves further discussion.

Regards,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Qingqing Zhou
zhouqq.postgres@gmail.com
In reply to: Qingqing Zhou (#17)
Re: Relation extension scalability

On Fri, Apr 17, 2015 at 11:19 AM, Qingqing Zhou
<zhouqq.postgres@gmail.com> wrote:

Most commercial database employs a DMS storage model, where it manages
object mapping and freespace itself. So different objects are sharing
storage within several files. Surely it has historic reasons, but it
has several advantages over current model:
- remove fd pressure
- remove double buffering (by introducing ADIO)
- controlled layout and access pattern (sequential and read-ahead)
- better quota management
- performance potentially better

Considering platforms supported and the stableness period needed, we
shall support both current storage model and DMS model. I will stop
here to see if this deserves further discussion.

Sorry, it might considered double-posting here but I am wondering have
we ever discussed this before? If we already have some conclusions on
this, could anyone share me a link?

Thanks,
Qingqing

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
1 attachment(s)
Re: Relation extension scalability

Hi,

I, every now and then, spent a bit of time making this more efficient
over the last few weeks.

I had a bit of a problem to reproduce the problems I'd seen in
production on physical hardware (found EC2 to be to variable to
benchmark this), but luckily 2ndQuadrant today allowed me access to
their four socket machine[1][10:28:11 PM] Tomas Vondra: 4x Intel Xeon E5­4620 Eight Core 2.2GHz Processor’s generation Sandy Bridge EP each core handles 2 threads, so 16 threads total 256GB (16x16GB) ECC REG System Validated Memory (1333 MHz) 2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB) 17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB) LSI MegaRAID 9271­8iCC controller and Cache Vault Kit (1GB cache) 2 x Nvidia Tesla K20 Active GPU Cards (GK110GL) of the AXLE project. Thanks Simon and
Tomas!

First, some mostly juicy numbers:

My benchmark was a parallel COPY into a single wal logged target
table:
CREATE TABLE data(data text);
The source data has been generated with
narrow:
COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinary' WITH BINARY;
wide:
COPY (select repeat(random()::text, 10) FROM generate_series(1, 10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY;

Between every test I ran a TRUNCATE data; CHECKPOINT;

For each number of clients I ran pgbench for 70 seconds. I'd previously
determined using -P 1 that the numbers are fairly stable. Longer runs
would have been nice, but then I'd not have finished in time.

shared_buffers = 48GB, narrow table contents:
client tps after: tps before:
1 180.255577 210.125143
2 338.231058 391.875088
4 638.814300 405.243901
8 1126.852233 370.922271
16 1242.363623 498.487008
32 1229.648854 484.477042
48 1223.288397 468.127943
64 1198.007422 438.238119
96 1201.501278 370.556354
128 1198.554929 288.213032
196 1189.603398 193.841993
256 1144.082291 191.293781
512 643.323675 200.782105

shared_buffers = 1GB, narrow table contents:
client tps after: tps before:
1 191.137410 210.787214
2 351.293017 384.086634
4 649.800991 420.703149
8 1103.770749 355.947915
16 1287.192256 489.050768
32 1226.329585 464.936427
48 1187.266489 443.386440
64 1182.698974 402.251258
96 1208.315983 331.290851
128 1183.469635 269.250601
196 1202.847382 202.788617
256 1177.924515 190.876852
512 572.457773 192.413191

1
shared_buffers = 48GB, wide table contents:
client tps after: tps before:
1 59.685215 68.445331
2 102.034688 103.210277
4 179.434065 78.982315
8 222.613727 76.195353
16 232.162484 77.520265
32 231.979136 71.654421
48 231.981216 64.730114
64 230.955979 57.444215
96 228.016910 56.324725
128 227.693947 45.701038
196 227.410386 37.138537
256 224.626948 35.265530
512 105.356439 34.397636

shared_buffers = 1GB, wide table contents:
(ran out of patience)

Note that the peak performance with the patch is significantly better,
but there's currently a noticeable regression in single threaded
performance. That undoubtedly needs to be addressed.

So, to get to the actual meat: My goal was to essentially get rid of an
exclusive lock over relation extension alltogether. I think I found a
way to do that that addresses the concerns made in this thread.

Thew new algorithm basically is:
1) Acquire victim buffer, clean it, and mark it as pinned
2) Get the current size of the relation, save buffer into blockno
3) Try to insert an entry into the buffer table for blockno
4) If the page is already in the buffer table, increment blockno by 1,
goto 3)
5) Try to read the page. In most cases it'll not yet exist. But the page
might concurrently have been written by another backend and removed
from shared buffers already. If already existing, goto 1)
6) Zero out the page on disk.

I think this does handle the concurrency issues.

This patch very clearly is in the POC stage. But I do think the approach
is generally sound. I'd like to see some comments before deciding
whether to carry on.

Greetings,

Andres Freund

PS: Yes, I know that precision in the benchmark isn't warranted, but I'm
too lazy to truncate them.

[1]: [10:28:11 PM] Tomas Vondra: 4x Intel Xeon E5­4620 Eight Core 2.2GHz Processor’s generation Sandy Bridge EP each core handles 2 threads, so 16 threads total 256GB (16x16GB) ECC REG System Validated Memory (1333 MHz) 2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB) 17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB) LSI MegaRAID 9271­8iCC controller and Cache Vault Kit (1GB cache) 2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)
[10:28:11 PM] Tomas Vondra: 4x Intel Xeon E5­4620 Eight Core 2.2GHz
Processor’s generation Sandy Bridge EP
each core handles 2 threads, so 16 threads total
256GB (16x16GB) ECC REG System Validated Memory (1333 MHz)
2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB)
17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB)
LSI MegaRAID 9271­8iCC controller and Cache Vault Kit (1GB cache)
2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)

Attachments:

0001-WIP-Saner-heap-extension.patchtext/x-patch; charset=us-asciiDownload
>From fc095897a6f4207d384559a095f80a36cf49648c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 29 Mar 2015 20:55:32 +0200
Subject: [PATCH] WIP: Saner heap extension.

---
 src/backend/access/heap/hio.c       |  86 ++++----
 src/backend/commands/vacuumlazy.c   |  39 ++--
 src/backend/storage/buffer/bufmgr.c | 377 ++++++++++++++++++++++++++----------
 src/backend/storage/smgr/md.c       |  62 ++++++
 src/backend/storage/smgr/smgr.c     |  20 +-
 src/include/storage/buf_internals.h |   1 +
 src/include/storage/bufmgr.h        |   1 +
 src/include/storage/smgr.h          |   7 +-
 8 files changed, 417 insertions(+), 176 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..b47f9fe 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -15,6 +15,8 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+
 #include "access/heapam.h"
 #include "access/hio.h"
 #include "access/htup_details.h"
@@ -237,7 +239,6 @@ RelationGetBufferForTuple(Relation relation, Size len,
 				saveFreeSpace;
 	BlockNumber targetBlock,
 				otherBlock;
-	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -433,63 +434,50 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	/*
 	 * Have to extend the relation.
 	 *
-	 * We have to use a lock to ensure no one else is extending the rel at the
-	 * same time, else we will both try to initialize the same new page.  We
-	 * can skip locking for new or temp relations, however, since no one else
-	 * could be accessing them.
+	 * To avoid, as it used to be the case, holding the extension lock during
+	 * victim buffer search for the new buffer, we extend the relation here
+	 * instead of relying on bufmgr.c. We still have to hold the extension
+	 * lock to prevent a race between two backends initializing the same page.
 	 */
-	needLock = !RELATION_IS_LOCAL(relation);
-
-	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	while(true)
+	{
+		buffer = ExtendRelation(relation, MAIN_FORKNUM, bistate->strategy);
 
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.  Note that we cannot release this lock before
-	 * we have buffer lock on the new page, or we risk a race condition
-	 * against vacuumlazy.c --- see comments therein.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		/*
+		 * While unlikely, it's possible that another backend managed to
+		 * initialize the page and use up the free space till we got the
+		 * exclusive lock. That'd require the page to be vacuumed (to be put
+		 * on the free space list) and then be used; possible but fairly
+		 * unlikely in practice. If it happens and there's not enough space,
+		 * just retry.
+		 */
+		if (PageIsNew(page))
+		{
+			PageInit(page, BLCKSZ, 0);
 
-	/*
-	 * We need to initialize the empty new page.  Double-check that it really
-	 * is empty (this should never happen, but if it does we don't want to
-	 * risk wiping out valid data).
-	 */
-	page = BufferGetPage(buffer);
+			Assert(len <= PageGetHeapFreeSpace(page));
+			break;
+		}
+		else if (len <= PageGetHeapFreeSpace(page))
+			break;
 
-	if (!PageIsNew(page))
-		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-			 BufferGetBlockNumber(buffer),
-			 RelationGetRelationName(relation));
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
 
-	PageInit(page, BufferGetPageSize(buffer), 0);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
 
-	if (len > PageGetHeapFreeSpace(page))
-	{
-		/* We should not get here given the test at the top */
-		elog(PANIC, "tuple is too big: size %zu", len);
+		CHECK_FOR_INTERRUPTS();
 	}
 
 	/*
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..896731c 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -674,35 +674,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			/*
 			 * An all-zeroes page could be left over if a backend extends the
 			 * relation but crashes before initializing the page. Reclaim such
-			 * pages for use.
-			 *
-			 * We have to be careful here because we could be looking at a
-			 * page that someone has just added to the relation and not yet
-			 * been able to initialize (see RelationGetBufferForTuple). To
-			 * protect against that, release the buffer lock, grab the
-			 * relation extension lock momentarily, and re-lock the buffer. If
-			 * the page is still uninitialized by then, it must be left over
-			 * from a crashed backend, and we can initialize it.
-			 *
-			 * We don't really need the relation lock when this is a new or
-			 * temp relation, but it's probably not worth the code space to
-			 * check that, since this surely isn't a critical path.
-			 *
-			 * Note: the comparable code in vacuum.c need not worry because
-			 * it's got exclusive lock on the whole relation.
+			 * pages for use.  It is also possible that we're looking at a
+			 * page that has just added but not yet initialized (see
+			 * RelationGetBufferForTuple). In that case we just initialize the
+			 * page here. That means the page will end up in the free space
+			 * map a little earlier, but that seems fine.
 			 */
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
-			LockBufferForCleanup(buf);
-			if (PageIsNew(page))
-			{
-				ereport(WARNING,
-				(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
-						relname, blkno)));
-				PageInit(page, BufferGetPageSize(buf), 0);
-				empty_pages++;
-			}
+			ereport(DEBUG2,
+					(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
+							relname, blkno)));
+			PageInit(page, BufferGetPageSize(buf), 0);
+			empty_pages++;
+
 			freespace = PageGetHeapFreeSpace(page);
 			MarkBufferDirty(buf);
 			UnlockReleaseBuffer(buf);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e4b25587..4613666 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -392,6 +392,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
 				  bool *hit);
+static volatile BufferDesc *GetVictimBuffer(BufferAccessStrategy strategy, BufFlags *oldFlags);
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
@@ -483,6 +484,176 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 #endif   /* USE_PREFETCH */
 }
 
+Buffer
+ExtendRelation(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy)
+{
+	BlockNumber	blockno;
+	Buffer		buf_id;
+	volatile BufferDesc *buf;
+	BufFlags	oldFlags;
+	Block		bufBlock;
+	bool		isLocalBuf = RelationUsesLocalBuffers(reln);
+	int			readblocks;
+
+	BufferTag	oldTag;			/* previous identity of selected buffer */
+	uint32		oldHash;		/* hash value for oldTag */
+	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
+
+	BufferTag	newTag;
+	uint32		newHash;
+	LWLock	   *newPartitionLock;
+
+	/* FIXME: This obviously isn't acceptable for integration */
+	if (isLocalBuf)
+	{
+		return ReadBufferExtended(reln, forkNum, P_NEW, RBM_NORMAL, strategy);
+	}
+
+	/* Open it at the smgr level if not already done */
+	RelationOpenSmgr(reln);
+
+	/* Make sure we will have room to remember the buffer pin */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+retry_victim:
+	/* we'll need a clean unassociated victim buffer */
+	while (true)
+	{
+		bool		gotIt = false;
+
+		/*
+		 * Returns a buffer that was unpinned and not dirty at the time of the
+		 * check.
+		 */
+		buf = GetVictimBuffer(strategy, &oldFlags);
+
+		if (oldFlags & BM_TAG_VALID)
+		{
+			oldTag = buf->tag;
+			oldHash = BufTableHashCode(&oldTag);
+			oldPartitionLock = BufMappingPartitionLock(oldHash);
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
+		}
+
+		LockBufHdr(buf);
+
+		/* somebody else might have re-pinned the buffer by now */
+		if (buf->refcount != 1 || (buf->flags & BM_DIRTY))
+		{
+			UnlockBufHdr(buf);
+		}
+		else
+		{
+			buf->flags &= ~(BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | BM_NEW);
+
+			UnlockBufHdr(buf);
+
+			gotIt = true;
+
+			if (oldFlags & BM_TAG_VALID)
+				BufTableDelete(&oldTag, oldHash);
+		}
+
+		if (oldFlags & BM_TAG_VALID)
+			LWLockRelease(oldPartitionLock);
+
+		if (gotIt)
+			break;
+		else
+			UnpinBuffer(buf, true);
+	}
+
+	/*
+	 * At this state we have an empty victim buffer; pinned to prevent it from
+	 * being reused.
+	 */
+
+	/*
+	 * First try the current end of the relation. If a concurrent process has
+	 * acquired that, try the next one after that.
+	 */
+	blockno = smgrnblocks(reln->rd_smgr, forkNum);
+
+	while (true)
+	{
+		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node, forkNum, blockno);
+
+		newHash = BufTableHashCode(&newTag);
+		newPartitionLock = BufMappingPartitionLock(newHash);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+
+		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+		if (buf_id >= 0)
+		{
+			/* somebody else got this block, try the next one */
+			LWLockRelease(newPartitionLock);
+			blockno++;
+			continue;
+		}
+
+		LockBufHdr(buf);
+
+		buf->tag = newTag;
+		if (reln->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+			buf->flags |= BM_NEW | BM_TAG_VALID | BM_PERMANENT;
+		else
+			buf->flags |= BM_NEW | BM_TAG_VALID;
+		buf->usage_count = 1;
+
+		UnlockBufHdr(buf);
+		LWLockRelease(newPartitionLock);
+
+		break;
+	}
+
+	/*
+	 * By here we made a entry into the buffer table, but haven't yet
+	 * read/written the page.  We can't just initialize the page, potentially
+	 * while we were busy with the above, another backend could have extended
+	 * the relation, written something, and the buffer could already have been
+	 * reused for something else.
+	 */
+
+	if (!StartBufferIO(buf, true))
+	{
+		/*
+		 * Somebody else is already using this block. Just try another one.
+		 */
+		UnpinBuffer(buf, true);
+		goto retry_victim;
+	}
+
+	/*
+	 * FIXME: if we die here we might have a problem: Everyone trying to read
+	 * this block will get a failure. Need to add checks for BM_NEW against
+	 * that. That's not really new to this code tho.
+	 */
+
+	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(buf) : BufHdrGetBlock(buf);
+
+	readblocks = smgrtryread(reln->rd_smgr, forkNum, blockno, bufBlock);
+
+	if (readblocks != BLCKSZ)
+	{
+		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		smgrextend(reln->rd_smgr, forkNum, blockno, (char *) bufBlock, false);
+
+		/* Set BM_VALID, terminate IO, and wake up any waiters */
+		TerminateBufferIO(buf, false, BM_VALID);
+	}
+	else
+	{
+		/* Set BM_VALID, terminate IO, and wake up any waiters */
+		TerminateBufferIO(buf, false, BM_VALID);
+		UnpinBuffer(buf, true);
+
+		goto retry_victim;
+	}
+
+	return BufferDescriptorGetBuffer(buf);
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
@@ -847,6 +1018,112 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+static volatile BufferDesc *
+GetVictimBuffer(BufferAccessStrategy strategy, BufFlags *oldFlags)
+{
+	volatile BufferDesc *buf;
+
+	/*
+	 * Ensure, while the spinlock's not yet held, that there's a free refcount
+	 * entry.
+	 */
+	ReservePrivateRefCountEntry();
+
+retry:
+	/*
+	 * Select a victim buffer.  The buffer is returned with its header
+	 * spinlock still held!
+	 */
+	buf = StrategyGetBuffer(strategy);
+
+	Assert(buf->refcount == 0);
+
+	/* Must copy buffer flags while we still hold the spinlock */
+	*oldFlags = buf->flags;
+
+	/* Pin the buffer and then release the buffer spinlock */
+	PinBuffer_Locked(buf);
+
+	/*
+	 * If the buffer was dirty, try to write it out.  There is a race
+	 * condition here, in that someone might dirty it after we released it
+	 * above, or even while we are writing it out (since our share-lock
+	 * won't prevent hint-bit updates).  We will recheck the dirty bit
+	 * after re-locking the buffer header.
+	 */
+	if (*oldFlags & BM_DIRTY)
+	{
+		/*
+		 * We need a share-lock on the buffer contents to write it out
+		 * (else we might write invalid data, eg because someone else is
+		 * compacting the page contents while we write).  We must use a
+		 * conditional lock acquisition here to avoid deadlock.  Even
+		 * though the buffer was not pinned (and therefore surely not
+		 * locked) when StrategyGetBuffer returned it, someone else could
+		 * have pinned and exclusive-locked it by the time we get here. If
+		 * we try to get the lock unconditionally, we'd block waiting for
+		 * them; if they later block waiting for us, deadlock ensues.
+		 * (This has been observed to happen when two backends are both
+		 * trying to split btree index pages, and the second one just
+		 * happens to be trying to split the page the first one got from
+		 * StrategyGetBuffer.)
+		 */
+		if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
+		{
+			/*
+			 * If using a nondefault strategy, and writing the buffer
+			 * would require a WAL flush, let the strategy decide whether
+			 * to go ahead and write/reuse the buffer or to choose another
+			 * victim.  We need lock to inspect the page LSN, so this
+			 * can't be done inside StrategyGetBuffer.
+			 */
+			if (strategy != NULL)
+			{
+				XLogRecPtr	lsn;
+
+				/* Read the LSN while holding buffer header lock */
+				LockBufHdr(buf);
+				lsn = BufferGetLSN(buf);
+				UnlockBufHdr(buf);
+
+				if (XLogNeedsFlush(lsn) &&
+					StrategyRejectBuffer(strategy, buf))
+				{
+					/* Drop lock/pin and loop around for another buffer */
+					LWLockRelease(buf->content_lock);
+					UnpinBuffer(buf, true);
+					goto retry;
+				}
+			}
+
+			/* OK, do the I/O */
+			TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
+										   smgr->smgr_rnode.node.spcNode,
+											smgr->smgr_rnode.node.dbNode,
+										  smgr->smgr_rnode.node.relNode);
+
+			FlushBuffer(buf, NULL);
+			LWLockRelease(buf->content_lock);
+
+			TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
+										   smgr->smgr_rnode.node.spcNode,
+											smgr->smgr_rnode.node.dbNode,
+										  smgr->smgr_rnode.node.relNode);
+		}
+		else
+		{
+			/*
+			 * Someone else has locked the buffer, so give it up and loop
+			 * back to get another one.
+			 */
+			UnpinBuffer(buf, true);
+			goto retry;
+		}
+	}
+
+	return buf;
+}
+
 /*
  * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
  *		buffer.  If no buffer exists already, selects a replacement
@@ -940,102 +1217,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
-		/*
-		 * Ensure, while the spinlock's not yet held, that there's a free
-		 * refcount entry.
-		 */
-		ReservePrivateRefCountEntry();
-
-		/*
-		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!
-		 */
-		buf = StrategyGetBuffer(strategy);
-
-		Assert(buf->refcount == 0);
-
-		/* Must copy buffer flags while we still hold the spinlock */
-		oldFlags = buf->flags;
-
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
-
-		/*
-		 * If the buffer was dirty, try to write it out.  There is a race
-		 * condition here, in that someone might dirty it after we released it
-		 * above, or even while we are writing it out (since our share-lock
-		 * won't prevent hint-bit updates).  We will recheck the dirty bit
-		 * after re-locking the buffer header.
-		 */
-		if (oldFlags & BM_DIRTY)
-		{
-			/*
-			 * We need a share-lock on the buffer contents to write it out
-			 * (else we might write invalid data, eg because someone else is
-			 * compacting the page contents while we write).  We must use a
-			 * conditional lock acquisition here to avoid deadlock.  Even
-			 * though the buffer was not pinned (and therefore surely not
-			 * locked) when StrategyGetBuffer returned it, someone else could
-			 * have pinned and exclusive-locked it by the time we get here. If
-			 * we try to get the lock unconditionally, we'd block waiting for
-			 * them; if they later block waiting for us, deadlock ensues.
-			 * (This has been observed to happen when two backends are both
-			 * trying to split btree index pages, and the second one just
-			 * happens to be trying to split the page the first one got from
-			 * StrategyGetBuffer.)
-			 */
-			if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
-			{
-				/*
-				 * If using a nondefault strategy, and writing the buffer
-				 * would require a WAL flush, let the strategy decide whether
-				 * to go ahead and write/reuse the buffer or to choose another
-				 * victim.  We need lock to inspect the page LSN, so this
-				 * can't be done inside StrategyGetBuffer.
-				 */
-				if (strategy != NULL)
-				{
-					XLogRecPtr	lsn;
-
-					/* Read the LSN while holding buffer header lock */
-					LockBufHdr(buf);
-					lsn = BufferGetLSN(buf);
-					UnlockBufHdr(buf);
-
-					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
-					{
-						/* Drop lock/pin and loop around for another buffer */
-						LWLockRelease(buf->content_lock);
-						UnpinBuffer(buf, true);
-						continue;
-					}
-				}
-
-				/* OK, do the I/O */
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
-											   smgr->smgr_rnode.node.spcNode,
-												smgr->smgr_rnode.node.dbNode,
-											  smgr->smgr_rnode.node.relNode);
-
-				FlushBuffer(buf, NULL);
-				LWLockRelease(buf->content_lock);
-
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
-											   smgr->smgr_rnode.node.spcNode,
-												smgr->smgr_rnode.node.dbNode,
-											  smgr->smgr_rnode.node.relNode);
-			}
-			else
-			{
-				/*
-				 * Someone else has locked the buffer, so give it up and loop
-				 * back to get another one.
-				 */
-				UnpinBuffer(buf, true);
-				continue;
-			}
-		}
+		/* returns a nondirty buffer, with potentially valid contents */
+		buf = GetVictimBuffer(strategy, &oldFlags);
 
 		/*
 		 * To change the association of a valid buffer, we'll need to have
@@ -1171,7 +1354,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * 1 so that the buffer can survive one clock-sweep pass.)
 	 */
 	buf->tag = newTag;
-	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
+	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | BM_NEW);
 	if (relpersistence == RELPERSISTENCE_PERMANENT)
 		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
 	else
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..0038c91 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -729,6 +729,68 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+
+/*
+ *	mdtryread() -- Read the specified block from a relation.
+ */
+int
+mdtryread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer)
+{
+	off_t		seekpos;
+	int			nbytes;
+	MdfdVec    *v;
+
+	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
+										reln->smgr_rnode.node.spcNode,
+										reln->smgr_rnode.node.dbNode,
+										reln->smgr_rnode.node.relNode,
+										reln->smgr_rnode.backend);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_RETURN_NULL);
+
+	/* would need another segment */
+	if (v == NULL)
+		return 0;
+
+	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek to block %u in file \"%s\": %m",
+						blocknum, FilePathName(v->mdfd_vfd))));
+
+	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
+
+	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
+									   reln->smgr_rnode.node.spcNode,
+									   reln->smgr_rnode.node.dbNode,
+									   reln->smgr_rnode.node.relNode,
+									   reln->smgr_rnode.backend,
+									   nbytes,
+									   BLCKSZ);
+
+	if (nbytes < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read block %u in file \"%s\": %m",
+						blocknum, FilePathName(v->mdfd_vfd))));
+
+	if (nbytes > 0 && nbytes < BLCKSZ)
+	{
+		ereport(LOG,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
+						blocknum, FilePathName(v->mdfd_vfd),
+						nbytes, BLCKSZ)));
+	}
+
+	return nbytes;
+}
+
 /*
  *	mdwrite() -- Write the supplied block at the appropriate location.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..f0e9a7b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -51,6 +51,8 @@ typedef struct f_smgr
 											  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
+	int			(*smgr_tryread) (SMgrRelation reln, ForkNumber forknum,
+										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -66,7 +68,7 @@ typedef struct f_smgr
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
-		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
+		mdprefetch, mdread, mdtryread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
 		mdpreckpt, mdsync, mdpostckpt
 	}
 };
@@ -626,6 +628,22 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer);
 }
 
+
+/*
+ *	smgtryrread() -- read a particular block from a relation into the supplied
+ *				  buffer.
+ *
+ *		This routine is called from the buffer manager in order to
+ *		instantiate pages in the shared buffer cache.  All storage managers
+ *		return pages in the format that POSTGRES expects.
+ */
+int
+smgrtryread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 char *buffer)
+{
+	return (*(smgrsw[reln->smgr_which].smgr_tryread)) (reln, forknum, blocknum, buffer);
+}
+
 /*
  *	smgrwrite() -- Write the supplied buffer out.
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..5f961af 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -40,6 +40,7 @@
 #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
 #define BM_PERMANENT			(1 << 8)		/* permanent relation (not
 												 * unlogged) */
+#define BM_NEW					(1 << 9)		/* Not guaranteed to exist on disk */
 
 typedef bits16 BufFlags;
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..b52591f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -153,6 +153,7 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 						  ForkNumber forkNum, BlockNumber blockNum,
 						  ReadBufferMode mode, BufferAccessStrategy strategy);
+extern Buffer ExtendRelation(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..07a331c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -94,6 +94,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
+extern int smgrtryread(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
 		  BlockNumber blocknum, char *buffer, bool skipFsync);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -114,12 +116,15 @@ extern void mdclose(SMgrRelation reln, ForkNumber forknum);
 extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
+extern void mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdappend(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
+extern int mdtryread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
 		BlockNumber blocknum, char *buffer, bool skipFsync);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-- 
2.3.0.149.gf3f4077.dirty

#20Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#19)
1 attachment(s)
Re: Relation extension scalability

Hi,

Eeek, the attached patch included a trivial last minute screwup
(dereferencing bistate unconditionally...). Fixed version attached.

Andres

Attachments:

0001-WIP-Saner-heap-extension.patchtext/x-patch; charset=us-asciiDownload
>From fc095897a6f4207d384559a095f80a36cf49648c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 29 Mar 2015 20:55:32 +0200
Subject: [PATCH] WIP: Saner heap extension.

---
 src/backend/access/heap/hio.c       |  86 ++++----
 src/backend/commands/vacuumlazy.c   |  39 ++--
 src/backend/storage/buffer/bufmgr.c | 377 ++++++++++++++++++++++++++----------
 src/backend/storage/smgr/md.c       |  62 ++++++
 src/backend/storage/smgr/smgr.c     |  20 +-
 src/include/storage/buf_internals.h |   1 +
 src/include/storage/bufmgr.h        |   1 +
 src/include/storage/smgr.h          |   7 +-
 8 files changed, 417 insertions(+), 176 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..b47f9fe 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -15,6 +15,8 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+
 #include "access/heapam.h"
 #include "access/hio.h"
 #include "access/htup_details.h"
@@ -237,7 +239,6 @@ RelationGetBufferForTuple(Relation relation, Size len,
 				saveFreeSpace;
 	BlockNumber targetBlock,
 				otherBlock;
-	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -433,63 +434,50 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	/*
 	 * Have to extend the relation.
 	 *
-	 * We have to use a lock to ensure no one else is extending the rel at the
-	 * same time, else we will both try to initialize the same new page.  We
-	 * can skip locking for new or temp relations, however, since no one else
-	 * could be accessing them.
+	 * To avoid, as it used to be the case, holding the extension lock during
+	 * victim buffer search for the new buffer, we extend the relation here
+	 * instead of relying on bufmgr.c. We still have to hold the extension
+	 * lock to prevent a race between two backends initializing the same page.
 	 */
-	needLock = !RELATION_IS_LOCAL(relation);
-
-	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	while(true)
+	{
+		buffer = ExtendRelation(relation, MAIN_FORKNUM, bistate->strategy);
 
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.  Note that we cannot release this lock before
-	 * we have buffer lock on the new page, or we risk a race condition
-	 * against vacuumlazy.c --- see comments therein.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		/*
+		 * While unlikely, it's possible that another backend managed to
+		 * initialize the page and use up the free space till we got the
+		 * exclusive lock. That'd require the page to be vacuumed (to be put
+		 * on the free space list) and then be used; possible but fairly
+		 * unlikely in practice. If it happens and there's not enough space,
+		 * just retry.
+		 */
+		if (PageIsNew(page))
+		{
+			PageInit(page, BLCKSZ, 0);
 
-	/*
-	 * We need to initialize the empty new page.  Double-check that it really
-	 * is empty (this should never happen, but if it does we don't want to
-	 * risk wiping out valid data).
-	 */
-	page = BufferGetPage(buffer);
+			Assert(len <= PageGetHeapFreeSpace(page));
+			break;
+		}
+		else if (len <= PageGetHeapFreeSpace(page))
+			break;
 
-	if (!PageIsNew(page))
-		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-			 BufferGetBlockNumber(buffer),
-			 RelationGetRelationName(relation));
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
 
-	PageInit(page, BufferGetPageSize(buffer), 0);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
 
-	if (len > PageGetHeapFreeSpace(page))
-	{
-		/* We should not get here given the test at the top */
-		elog(PANIC, "tuple is too big: size %zu", len);
+		CHECK_FOR_INTERRUPTS();
 	}
 
 	/*
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..896731c 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -674,35 +674,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			/*
 			 * An all-zeroes page could be left over if a backend extends the
 			 * relation but crashes before initializing the page. Reclaim such
-			 * pages for use.
-			 *
-			 * We have to be careful here because we could be looking at a
-			 * page that someone has just added to the relation and not yet
-			 * been able to initialize (see RelationGetBufferForTuple). To
-			 * protect against that, release the buffer lock, grab the
-			 * relation extension lock momentarily, and re-lock the buffer. If
-			 * the page is still uninitialized by then, it must be left over
-			 * from a crashed backend, and we can initialize it.
-			 *
-			 * We don't really need the relation lock when this is a new or
-			 * temp relation, but it's probably not worth the code space to
-			 * check that, since this surely isn't a critical path.
-			 *
-			 * Note: the comparable code in vacuum.c need not worry because
-			 * it's got exclusive lock on the whole relation.
+			 * pages for use.  It is also possible that we're looking at a
+			 * page that has just added but not yet initialized (see
+			 * RelationGetBufferForTuple). In that case we just initialize the
+			 * page here. That means the page will end up in the free space
+			 * map a little earlier, but that seems fine.
 			 */
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
-			LockBufferForCleanup(buf);
-			if (PageIsNew(page))
-			{
-				ereport(WARNING,
-				(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
-						relname, blkno)));
-				PageInit(page, BufferGetPageSize(buf), 0);
-				empty_pages++;
-			}
+			ereport(DEBUG2,
+					(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
+							relname, blkno)));
+			PageInit(page, BufferGetPageSize(buf), 0);
+			empty_pages++;
+
 			freespace = PageGetHeapFreeSpace(page);
 			MarkBufferDirty(buf);
 			UnlockReleaseBuffer(buf);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e4b25587..4613666 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -392,6 +392,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
 				  bool *hit);
+static volatile BufferDesc *GetVictimBuffer(BufferAccessStrategy strategy, BufFlags *oldFlags);
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
@@ -483,6 +484,176 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 #endif   /* USE_PREFETCH */
 }
 
+Buffer
+ExtendRelation(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy)
+{
+	BlockNumber	blockno;
+	Buffer		buf_id;
+	volatile BufferDesc *buf;
+	BufFlags	oldFlags;
+	Block		bufBlock;
+	bool		isLocalBuf = RelationUsesLocalBuffers(reln);
+	int			readblocks;
+
+	BufferTag	oldTag;			/* previous identity of selected buffer */
+	uint32		oldHash;		/* hash value for oldTag */
+	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
+
+	BufferTag	newTag;
+	uint32		newHash;
+	LWLock	   *newPartitionLock;
+
+	/* FIXME: This obviously isn't acceptable for integration */
+	if (isLocalBuf)
+	{
+		return ReadBufferExtended(reln, forkNum, P_NEW, RBM_NORMAL, strategy);
+	}
+
+	/* Open it at the smgr level if not already done */
+	RelationOpenSmgr(reln);
+
+	/* Make sure we will have room to remember the buffer pin */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+retry_victim:
+	/* we'll need a clean unassociated victim buffer */
+	while (true)
+	{
+		bool		gotIt = false;
+
+		/*
+		 * Returns a buffer that was unpinned and not dirty at the time of the
+		 * check.
+		 */
+		buf = GetVictimBuffer(strategy, &oldFlags);
+
+		if (oldFlags & BM_TAG_VALID)
+		{
+			oldTag = buf->tag;
+			oldHash = BufTableHashCode(&oldTag);
+			oldPartitionLock = BufMappingPartitionLock(oldHash);
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
+		}
+
+		LockBufHdr(buf);
+
+		/* somebody else might have re-pinned the buffer by now */
+		if (buf->refcount != 1 || (buf->flags & BM_DIRTY))
+		{
+			UnlockBufHdr(buf);
+		}
+		else
+		{
+			buf->flags &= ~(BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | BM_NEW);
+
+			UnlockBufHdr(buf);
+
+			gotIt = true;
+
+			if (oldFlags & BM_TAG_VALID)
+				BufTableDelete(&oldTag, oldHash);
+		}
+
+		if (oldFlags & BM_TAG_VALID)
+			LWLockRelease(oldPartitionLock);
+
+		if (gotIt)
+			break;
+		else
+			UnpinBuffer(buf, true);
+	}
+
+	/*
+	 * At this state we have an empty victim buffer; pinned to prevent it from
+	 * being reused.
+	 */
+
+	/*
+	 * First try the current end of the relation. If a concurrent process has
+	 * acquired that, try the next one after that.
+	 */
+	blockno = smgrnblocks(reln->rd_smgr, forkNum);
+
+	while (true)
+	{
+		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node, forkNum, blockno);
+
+		newHash = BufTableHashCode(&newTag);
+		newPartitionLock = BufMappingPartitionLock(newHash);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+
+		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+		if (buf_id >= 0)
+		{
+			/* somebody else got this block, try the next one */
+			LWLockRelease(newPartitionLock);
+			blockno++;
+			continue;
+		}
+
+		LockBufHdr(buf);
+
+		buf->tag = newTag;
+		if (reln->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+			buf->flags |= BM_NEW | BM_TAG_VALID | BM_PERMANENT;
+		else
+			buf->flags |= BM_NEW | BM_TAG_VALID;
+		buf->usage_count = 1;
+
+		UnlockBufHdr(buf);
+		LWLockRelease(newPartitionLock);
+
+		break;
+	}
+
+	/*
+	 * By here we made a entry into the buffer table, but haven't yet
+	 * read/written the page.  We can't just initialize the page, potentially
+	 * while we were busy with the above, another backend could have extended
+	 * the relation, written something, and the buffer could already have been
+	 * reused for something else.
+	 */
+
+	if (!StartBufferIO(buf, true))
+	{
+		/*
+		 * Somebody else is already using this block. Just try another one.
+		 */
+		UnpinBuffer(buf, true);
+		goto retry_victim;
+	}
+
+	/*
+	 * FIXME: if we die here we might have a problem: Everyone trying to read
+	 * this block will get a failure. Need to add checks for BM_NEW against
+	 * that. That's not really new to this code tho.
+	 */
+
+	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(buf) : BufHdrGetBlock(buf);
+
+	readblocks = smgrtryread(reln->rd_smgr, forkNum, blockno, bufBlock);
+
+	if (readblocks != BLCKSZ)
+	{
+		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		smgrextend(reln->rd_smgr, forkNum, blockno, (char *) bufBlock, false);
+
+		/* Set BM_VALID, terminate IO, and wake up any waiters */
+		TerminateBufferIO(buf, false, BM_VALID);
+	}
+	else
+	{
+		/* Set BM_VALID, terminate IO, and wake up any waiters */
+		TerminateBufferIO(buf, false, BM_VALID);
+		UnpinBuffer(buf, true);
+
+		goto retry_victim;
+	}
+
+	return BufferDescriptorGetBuffer(buf);
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
@@ -847,6 +1018,112 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+static volatile BufferDesc *
+GetVictimBuffer(BufferAccessStrategy strategy, BufFlags *oldFlags)
+{
+	volatile BufferDesc *buf;
+
+	/*
+	 * Ensure, while the spinlock's not yet held, that there's a free refcount
+	 * entry.
+	 */
+	ReservePrivateRefCountEntry();
+
+retry:
+	/*
+	 * Select a victim buffer.  The buffer is returned with its header
+	 * spinlock still held!
+	 */
+	buf = StrategyGetBuffer(strategy);
+
+	Assert(buf->refcount == 0);
+
+	/* Must copy buffer flags while we still hold the spinlock */
+	*oldFlags = buf->flags;
+
+	/* Pin the buffer and then release the buffer spinlock */
+	PinBuffer_Locked(buf);
+
+	/*
+	 * If the buffer was dirty, try to write it out.  There is a race
+	 * condition here, in that someone might dirty it after we released it
+	 * above, or even while we are writing it out (since our share-lock
+	 * won't prevent hint-bit updates).  We will recheck the dirty bit
+	 * after re-locking the buffer header.
+	 */
+	if (*oldFlags & BM_DIRTY)
+	{
+		/*
+		 * We need a share-lock on the buffer contents to write it out
+		 * (else we might write invalid data, eg because someone else is
+		 * compacting the page contents while we write).  We must use a
+		 * conditional lock acquisition here to avoid deadlock.  Even
+		 * though the buffer was not pinned (and therefore surely not
+		 * locked) when StrategyGetBuffer returned it, someone else could
+		 * have pinned and exclusive-locked it by the time we get here. If
+		 * we try to get the lock unconditionally, we'd block waiting for
+		 * them; if they later block waiting for us, deadlock ensues.
+		 * (This has been observed to happen when two backends are both
+		 * trying to split btree index pages, and the second one just
+		 * happens to be trying to split the page the first one got from
+		 * StrategyGetBuffer.)
+		 */
+		if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
+		{
+			/*
+			 * If using a nondefault strategy, and writing the buffer
+			 * would require a WAL flush, let the strategy decide whether
+			 * to go ahead and write/reuse the buffer or to choose another
+			 * victim.  We need lock to inspect the page LSN, so this
+			 * can't be done inside StrategyGetBuffer.
+			 */
+			if (strategy != NULL)
+			{
+				XLogRecPtr	lsn;
+
+				/* Read the LSN while holding buffer header lock */
+				LockBufHdr(buf);
+				lsn = BufferGetLSN(buf);
+				UnlockBufHdr(buf);
+
+				if (XLogNeedsFlush(lsn) &&
+					StrategyRejectBuffer(strategy, buf))
+				{
+					/* Drop lock/pin and loop around for another buffer */
+					LWLockRelease(buf->content_lock);
+					UnpinBuffer(buf, true);
+					goto retry;
+				}
+			}
+
+			/* OK, do the I/O */
+			TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
+										   smgr->smgr_rnode.node.spcNode,
+											smgr->smgr_rnode.node.dbNode,
+										  smgr->smgr_rnode.node.relNode);
+
+			FlushBuffer(buf, NULL);
+			LWLockRelease(buf->content_lock);
+
+			TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
+										   smgr->smgr_rnode.node.spcNode,
+											smgr->smgr_rnode.node.dbNode,
+										  smgr->smgr_rnode.node.relNode);
+		}
+		else
+		{
+			/*
+			 * Someone else has locked the buffer, so give it up and loop
+			 * back to get another one.
+			 */
+			UnpinBuffer(buf, true);
+			goto retry;
+		}
+	}
+
+	return buf;
+}
+
 /*
  * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
  *		buffer.  If no buffer exists already, selects a replacement
@@ -940,102 +1217,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
-		/*
-		 * Ensure, while the spinlock's not yet held, that there's a free
-		 * refcount entry.
-		 */
-		ReservePrivateRefCountEntry();
-
-		/*
-		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!
-		 */
-		buf = StrategyGetBuffer(strategy);
-
-		Assert(buf->refcount == 0);
-
-		/* Must copy buffer flags while we still hold the spinlock */
-		oldFlags = buf->flags;
-
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
-
-		/*
-		 * If the buffer was dirty, try to write it out.  There is a race
-		 * condition here, in that someone might dirty it after we released it
-		 * above, or even while we are writing it out (since our share-lock
-		 * won't prevent hint-bit updates).  We will recheck the dirty bit
-		 * after re-locking the buffer header.
-		 */
-		if (oldFlags & BM_DIRTY)
-		{
-			/*
-			 * We need a share-lock on the buffer contents to write it out
-			 * (else we might write invalid data, eg because someone else is
-			 * compacting the page contents while we write).  We must use a
-			 * conditional lock acquisition here to avoid deadlock.  Even
-			 * though the buffer was not pinned (and therefore surely not
-			 * locked) when StrategyGetBuffer returned it, someone else could
-			 * have pinned and exclusive-locked it by the time we get here. If
-			 * we try to get the lock unconditionally, we'd block waiting for
-			 * them; if they later block waiting for us, deadlock ensues.
-			 * (This has been observed to happen when two backends are both
-			 * trying to split btree index pages, and the second one just
-			 * happens to be trying to split the page the first one got from
-			 * StrategyGetBuffer.)
-			 */
-			if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
-			{
-				/*
-				 * If using a nondefault strategy, and writing the buffer
-				 * would require a WAL flush, let the strategy decide whether
-				 * to go ahead and write/reuse the buffer or to choose another
-				 * victim.  We need lock to inspect the page LSN, so this
-				 * can't be done inside StrategyGetBuffer.
-				 */
-				if (strategy != NULL)
-				{
-					XLogRecPtr	lsn;
-
-					/* Read the LSN while holding buffer header lock */
-					LockBufHdr(buf);
-					lsn = BufferGetLSN(buf);
-					UnlockBufHdr(buf);
-
-					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
-					{
-						/* Drop lock/pin and loop around for another buffer */
-						LWLockRelease(buf->content_lock);
-						UnpinBuffer(buf, true);
-						continue;
-					}
-				}
-
-				/* OK, do the I/O */
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
-											   smgr->smgr_rnode.node.spcNode,
-												smgr->smgr_rnode.node.dbNode,
-											  smgr->smgr_rnode.node.relNode);
-
-				FlushBuffer(buf, NULL);
-				LWLockRelease(buf->content_lock);
-
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
-											   smgr->smgr_rnode.node.spcNode,
-												smgr->smgr_rnode.node.dbNode,
-											  smgr->smgr_rnode.node.relNode);
-			}
-			else
-			{
-				/*
-				 * Someone else has locked the buffer, so give it up and loop
-				 * back to get another one.
-				 */
-				UnpinBuffer(buf, true);
-				continue;
-			}
-		}
+		/* returns a nondirty buffer, with potentially valid contents */
+		buf = GetVictimBuffer(strategy, &oldFlags);
 
 		/*
 		 * To change the association of a valid buffer, we'll need to have
@@ -1171,7 +1354,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * 1 so that the buffer can survive one clock-sweep pass.)
 	 */
 	buf->tag = newTag;
-	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
+	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | BM_NEW);
 	if (relpersistence == RELPERSISTENCE_PERMANENT)
 		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
 	else
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..0038c91 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -729,6 +729,68 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+
+/*
+ *	mdtryread() -- Read the specified block from a relation.
+ */
+int
+mdtryread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer)
+{
+	off_t		seekpos;
+	int			nbytes;
+	MdfdVec    *v;
+
+	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
+										reln->smgr_rnode.node.spcNode,
+										reln->smgr_rnode.node.dbNode,
+										reln->smgr_rnode.node.relNode,
+										reln->smgr_rnode.backend);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_RETURN_NULL);
+
+	/* would need another segment */
+	if (v == NULL)
+		return 0;
+
+	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek to block %u in file \"%s\": %m",
+						blocknum, FilePathName(v->mdfd_vfd))));
+
+	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
+
+	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
+									   reln->smgr_rnode.node.spcNode,
+									   reln->smgr_rnode.node.dbNode,
+									   reln->smgr_rnode.node.relNode,
+									   reln->smgr_rnode.backend,
+									   nbytes,
+									   BLCKSZ);
+
+	if (nbytes < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read block %u in file \"%s\": %m",
+						blocknum, FilePathName(v->mdfd_vfd))));
+
+	if (nbytes > 0 && nbytes < BLCKSZ)
+	{
+		ereport(LOG,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
+						blocknum, FilePathName(v->mdfd_vfd),
+						nbytes, BLCKSZ)));
+	}
+
+	return nbytes;
+}
+
 /*
  *	mdwrite() -- Write the supplied block at the appropriate location.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..f0e9a7b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -51,6 +51,8 @@ typedef struct f_smgr
 											  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
+	int			(*smgr_tryread) (SMgrRelation reln, ForkNumber forknum,
+										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -66,7 +68,7 @@ typedef struct f_smgr
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
-		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
+		mdprefetch, mdread, mdtryread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
 		mdpreckpt, mdsync, mdpostckpt
 	}
 };
@@ -626,6 +628,22 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer);
 }
 
+
+/*
+ *	smgtryrread() -- read a particular block from a relation into the supplied
+ *				  buffer.
+ *
+ *		This routine is called from the buffer manager in order to
+ *		instantiate pages in the shared buffer cache.  All storage managers
+ *		return pages in the format that POSTGRES expects.
+ */
+int
+smgrtryread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 char *buffer)
+{
+	return (*(smgrsw[reln->smgr_which].smgr_tryread)) (reln, forknum, blocknum, buffer);
+}
+
 /*
  *	smgrwrite() -- Write the supplied buffer out.
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..5f961af 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -40,6 +40,7 @@
 #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
 #define BM_PERMANENT			(1 << 8)		/* permanent relation (not
 												 * unlogged) */
+#define BM_NEW					(1 << 9)		/* Not guaranteed to exist on disk */
 
 typedef bits16 BufFlags;
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..b52591f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -153,6 +153,7 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 						  ForkNumber forkNum, BlockNumber blockNum,
 						  ReadBufferMode mode, BufferAccessStrategy strategy);
+extern Buffer ExtendRelation(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..07a331c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -94,6 +94,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
+extern int smgrtryread(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
 		  BlockNumber blocknum, char *buffer, bool skipFsync);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -114,12 +116,15 @@ extern void mdclose(SMgrRelation reln, ForkNumber forknum);
 extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
+extern void mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdappend(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
+extern int mdtryread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
 		BlockNumber blocknum, char *buffer, bool skipFsync);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-- 
2.3.0.149.gf3f4077.dirty

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#19)
Re: Relation extension scalability

Andres Freund <andres@anarazel.de> writes:

So, to get to the actual meat: My goal was to essentially get rid of an
exclusive lock over relation extension alltogether. I think I found a
way to do that that addresses the concerns made in this thread.

Thew new algorithm basically is:
1) Acquire victim buffer, clean it, and mark it as pinned
2) Get the current size of the relation, save buffer into blockno
3) Try to insert an entry into the buffer table for blockno
4) If the page is already in the buffer table, increment blockno by 1,
goto 3)
5) Try to read the page. In most cases it'll not yet exist. But the page
might concurrently have been written by another backend and removed
from shared buffers already. If already existing, goto 1)
6) Zero out the page on disk.

I think this does handle the concurrency issues.

The need for (5) kind of destroys my faith in this really being safe: it
says there are non-obvious race conditions here.

For instance, what about this scenario:
* Session 1 tries to extend file, allocates buffer for page 42 (so it's
now between steps 4 and 5).
* Session 2 tries to extend file, sees buffer for 42 exists, allocates
buffer for page 43 (so it's also now between 4 and 5).
* Session 2 tries to read page 43, sees it's not there, and writes out
page 43 with zeroes (now it's done).
* Session 1 tries to read page 42, sees it's there and zero-filled
(not because anybody wrote it, but because holes in files read as 0).

At this point session 1 will go and create page 44, won't it, and you
just wasted a page. Now we do have mechanisms for reclaiming such pages
but they may not kick in until VACUUM, so you could end up with a whole
lot of table bloat.

Also, the file is likely to end up badly physically fragmented when the
skipped pages are finally filled in. One of the good things about the
relation extension lock is that the kernel sees the file as being extended
strictly sequentially, which it should handle fairly well as far as
filesystem layout goes. This way might end up creating a mess on-disk.

Perhaps even more to the point, you've added a read() kernel call that was
previously not there at all, without having removed either the lseek() or
the write(). Perhaps that scales better when what you're measuring is
saturation conditions on a many-core machine, but I have a very hard time
believing that it's not a significant net loss under less-contended
conditions.

I'm inclined to think that a better solution in the long run is to keep
the relation extension lock but find a way to extend files more than
one page per lock acquisition.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#21)
Re: Relation extension scalability

On 2015-07-19 11:28:25 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

So, to get to the actual meat: My goal was to essentially get rid of an
exclusive lock over relation extension alltogether. I think I found a
way to do that that addresses the concerns made in this thread.

Thew new algorithm basically is:
1) Acquire victim buffer, clean it, and mark it as pinned
2) Get the current size of the relation, save buffer into blockno
3) Try to insert an entry into the buffer table for blockno
4) If the page is already in the buffer table, increment blockno by 1,
goto 3)
5) Try to read the page. In most cases it'll not yet exist. But the page
might concurrently have been written by another backend and removed
from shared buffers already. If already existing, goto 1)
6) Zero out the page on disk.

I think this does handle the concurrency issues.

The need for (5) kind of destroys my faith in this really being safe: it
says there are non-obvious race conditions here.

It's not simple, I agree. I'm doubtful that an significantly simpler
approach exists.

For instance, what about this scenario:
* Session 1 tries to extend file, allocates buffer for page 42 (so it's
now between steps 4 and 5).
* Session 2 tries to extend file, sees buffer for 42 exists, allocates
buffer for page 43 (so it's also now between 4 and 5).
* Session 2 tries to read page 43, sees it's not there, and writes out
page 43 with zeroes (now it's done).
* Session 1 tries to read page 42, sees it's there and zero-filled
(not because anybody wrote it, but because holes in files read as 0).

At this point session 1 will go and create page 44, won't it, and you
just wasted a page.

My local code now recognizes that case and uses the page. We just need
to do an PageIsNew().

Also, the file is likely to end up badly physically fragmented when the
skipped pages are finally filled in. One of the good things about the
relation extension lock is that the kernel sees the file as being extended
strictly sequentially, which it should handle fairly well as far as
filesystem layout goes. This way might end up creating a mess on-disk.

I don't think that'll actually happen with any recent
filesystems. Pretty much all of them do delayed allocation. But it
definitely is a concern with older filesystems.

I've just measured and with ext4 the number of extents per segment in a
300GB relation don't show a significant difference when compared between
the existing and new code.

We could try to address this by optionally using posix_fallocate() to do
the actual extension - then there'll not be sparse regions, but actually
allocated disk blocks.

Perhaps even more to the point, you've added a read() kernel call that was
previously not there at all, without having removed either the lseek() or
the write(). Perhaps that scales better when what you're measuring is
saturation conditions on a many-core machine, but I have a very hard time
believing that it's not a significant net loss under less-contended
conditions.

Yes, this has me worried too.

I'm inclined to think that a better solution in the long run is to keep
the relation extension lock but find a way to extend files more than
one page per lock acquisition.

I doubt that'll help as much. As long as you have to search and write
out buffers under an exclusive lock that'll be painful. You might be
able to make that an infrequent occurrance by extending in larger
amounts and entering the returned pages into the FSM, but you'll have
rather noticeable latency increases everytime that happens. And not just
in the extending relation - all the other relations will wait for the
one doing the extending. We could move that into some background
process, but at that point things have gotten seriously complex.

The more radical solution would be to have some place in memory that'd
store the current number of blocks. Then all the extension specific
locking we'd need would be around incrementing that. But how and where
to store that isn't easy.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#22)
Re: Relation extension scalability

Andres Freund <andres@anarazel.de> writes:

On 2015-07-19 11:28:25 -0400, Tom Lane wrote:

At this point session 1 will go and create page 44, won't it, and you
just wasted a page.

My local code now recognizes that case and uses the page. We just need
to do an PageIsNew().

Er, what? How can you tell whether an all-zero page was or was not
just written by another session?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#23)
Re: Relation extension scalability

On 2015-07-19 11:56:47 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2015-07-19 11:28:25 -0400, Tom Lane wrote:

At this point session 1 will go and create page 44, won't it, and you
just wasted a page.

My local code now recognizes that case and uses the page. We just need
to do an PageIsNew().

Er, what? How can you tell whether an all-zero page was or was not
just written by another session?

The check is only done while holding the io lock on the relevant page
(have to hold that anyway), after reading it in ourselves, just before
setting BM_VALID. As we only can get to that point when there wasn't any
other entry for the page in the buffer table, that guarantees that no
other backend isn't currently expanding into that page. Others might
wait to read it, but those'll wait behind the IO lock.

The situation the read() protect us against is that two backends try to
extend to the same block, but after one of them succeeded the buffer is
written out and reused for an independent page. So there is no in-memory
state telling the slower backend that that page has already been used.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#24)
Re: Relation extension scalability

On Sun, Jul 19 2015 9:37 PM Andres Wrote,

The situation the read() protect us against is that two backends try to
extend to the same block, but after one of them succeeded the buffer is
written out and reused for an independent page. So there is no in-memory
state telling the slower backend that that page has already been used.

I was looking into this patch, and done some performance testing..

Currently i have done testing my my local machine, later i will perform on
big machine once i get access to that.

Just wanted to share the current result what i get i my local machine
Machine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM of
RAM).

Test Script:
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

*Summary of the results:*
1. When data fits into shared buffer improvement is not visible, but when
it don't then some improvement is visible in my local machine (still does
not seems to be scaling, may be we can see some different behaviour in big
machine), Thats because in first case it need not to flush the buffer out.

2. As per Tom's analysis since we are doing extra read it will reduce
performance in lower no of client where RelationExtensionLock is not
bottleneck and same is visible in test result.

As discussed earlier, what about keeping the RelationExtentionLock as it is
and just do the victim buffer search and buffer flushing outside the lock,
that way we can save extra read. Correct me if i have missed something in
this..

Shared Buffer 512 MB
-----------------------------
Client: Tps Base Tps Patch
1 145 126
2 211 246
4 248 302
8 225 234

Shared Buffer 5GB
-----------------------------
Client: Tps Base Tps Patch
1 165 156
2 237 244
4 294 296
8 253 247

Also observed one problem with patch,

@@ -433,63 +434,50 @@ RelationGetBufferForTuple(Relation relation, Size len,
+ while(true)
+ {
+ buffer = ExtendRelation(relation, MAIN_FORKNUM, bistate->strategy);

bistate can be NULL if direct insert instead of copy case

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Sun, Jul 19, 2015 at 9:37 PM, Andres Freund <andres@anarazel.de> wrote:

Show quoted text

On 2015-07-19 11:56:47 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2015-07-19 11:28:25 -0400, Tom Lane wrote:

At this point session 1 will go and create page 44, won't it, and you
just wasted a page.

My local code now recognizes that case and uses the page. We just need
to do an PageIsNew().

Er, what? How can you tell whether an all-zero page was or was not
just written by another session?

The check is only done while holding the io lock on the relevant page
(have to hold that anyway), after reading it in ourselves, just before
setting BM_VALID. As we only can get to that point when there wasn't any
other entry for the page in the buffer table, that guarantees that no
other backend isn't currently expanding into that page. Others might
wait to read it, but those'll wait behind the IO lock.

The situation the read() protect us against is that two backends try to
extend to the same block, but after one of them succeeded the buffer is
written out and reused for an independent page. So there is no in-memory
state telling the slower backend that that page has already been used.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#25)
1 attachment(s)
Re: Relation extension scalability

On Fri, Dec 18, 2015 at 10:51 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sun, Jul 19 2015 9:37 PM Andres Wrote,

The situation the read() protect us against is that two backends try to
extend to the same block, but after one of them succeeded the buffer is
written out and reused for an independent page. So there is no in-memory
state telling the slower backend that that page has already been used.

I was looking into this patch, and done some performance testing..

Currently i have done testing my my local machine, later i will perform on
big machine once i get access to that.

Just wanted to share the current result what i get i my local machine
Machine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM
of RAM).

Test Script:
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

This time i have done some testing on big machine with* 64 physical cores @
2.13GHz and 50GB of RAM*

There is performance comparison of base, extend without
RelationExtensionLock patch given by Andres and
multi-extend patch (this will extend the multiple blocks at a time based on
a configuration parameter.)

*Problem Analysis:------------------------*
1. With base code when i try to observe the problem using perf and other
method (gdb), i found that RelationExtensionLock is main bottleneck.
2. Then after using RelationExtensionLock Free patch, i observed now
contention is FileWrite (All backends are trying to extend the file.)

*Performance Summary and
Analysis:------------------------------------------------*
1. In my performance results Multi Extend shown best performance and
scalability.
2. I think by extending in multiple blocks we solves both the
problem(Extension Lock and Parallel File Write).
3. After extending one Block immediately adding to FSM so in most of the
cases other backend can directly find it without taking extension lock.

Currently the patch is in initial stage, i have done only test performance
and pass the regress test suit.

*Open problems -----------------------------*
1. After extending the page we are adding it directly to FSM, so if vacuum
find this page as new it will give WARNING.
2. In RelationGetBufferForTuple, when PageIsNew, we are doing PageInit,
same need to be consider for index cases.

*Test Script:-------------------------*
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

*Performanec Data:*
--------------------------
*There are Three code base for performance*
1. Base Code

2. Lock Free Patch : patch given in below thread
*/messages/by-id/20150719140746.GH25610@awork2.anarazel.de
</messages/by-id/20150719140746.GH25610@awork2.anarazel.de&gt;*

3. Multi extend patch attached in the mail.
*#extend_num_pages : *This this new config parameter to tell how many extra
page extend in case of normal extend..
may be it will give more control to user if we make it relation property.

I will work on the patch for this CF, so adding it to CF.

*Shared Buffer 48 GB*

*Clients* *Base (TPS)*
*Lock Free Patch* *Multi-extend **extend_num_pages=5* 1 142 138 148 2 251
253 280 4 237 416 464 8 168 491 575 16 141 448 404 32 122 337 332

*Shared Buffer 64 MB*

*Clients* *Base (TPS)* *Multi-extend **extend_num_pages=5*
1 140 148
2 252 266
4 229 437
8 153 475
16 132 364

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v1.patchtext/x-patch; charset=US-ASCII; name=multi_extend_v1.patchDownload
*** a/src/backend/access/brin/brin_pageops.c
--- b/src/backend/access/brin/brin_pageops.c
***************
*** 771,776 **** brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
--- 771,781 ----
  			UnlockRelationForExtension(irel, ExclusiveLock);
  
  		page = BufferGetPage(buf);
+ 		if (PageIsNew(page))
+ 		{
+ 			MarkBufferDirty(buf);
+ 			PageInit(page, BufferGetPageSize(buf), 0);
+ 		}
  
  		/*
  		 * We have a new buffer to insert into.  Check that the new page has
*** a/src/backend/access/heap/hio.c
--- b/src/backend/access/heap/hio.c
***************
*** 393,398 **** RelationGetBufferForTuple(Relation relation, Size len,
--- 393,404 ----
  		 * we're done.
  		 */
  		page = BufferGetPage(buffer);
+ 		if (PageIsNew(page))
+ 		{
+ 			MarkBufferDirty(buffer);
+ 			PageInit(page, BufferGetPageSize(buffer), 0);
+ 		}
+ 
  		pageFreeSpace = PageGetHeapFreeSpace(page);
  		if (len + saveFreeSpace <= pageFreeSpace)
  		{
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 90,95 **** int			effective_io_concurrency = 0;
--- 90,96 ----
   * effective_io_concurrency parameter set.
   */
  int			target_prefetch_pages = 0;
+ int			extend_num_pages = 0;
  
  /* local state for StartBufferIO and related functions */
  static BufferDesc *InProgressBuf = NULL;
***************
*** 394,400 **** ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
  static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
  				  ForkNumber forkNum, BlockNumber blockNum,
  				  ReadBufferMode mode, BufferAccessStrategy strategy,
! 				  bool *hit);
  static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
  static void PinBuffer_Locked(BufferDesc *buf);
  static void UnpinBuffer(BufferDesc *buf, bool fixOwner);
--- 395,401 ----
  static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
  				  ForkNumber forkNum, BlockNumber blockNum,
  				  ReadBufferMode mode, BufferAccessStrategy strategy,
! 				  bool *hit, Relation rel);
  static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
  static void PinBuffer_Locked(BufferDesc *buf);
  static void UnpinBuffer(BufferDesc *buf, bool fixOwner);
***************
*** 621,627 **** ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
  	 */
  	pgstat_count_buffer_read(reln);
  	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
! 							forkNum, blockNum, mode, strategy, &hit);
  	if (hit)
  		pgstat_count_buffer_hit(reln);
  	return buf;
--- 622,628 ----
  	 */
  	pgstat_count_buffer_read(reln);
  	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
! 							forkNum, blockNum, mode, strategy, &hit, reln);
  	if (hit)
  		pgstat_count_buffer_hit(reln);
  	return buf;
***************
*** 649,655 **** ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  	Assert(InRecovery);
  
  	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
! 							 mode, strategy, &hit);
  }
  
  
--- 650,656 ----
  	Assert(InRecovery);
  
  	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
! 							 mode, strategy, &hit, NULL);
  }
  
  
***************
*** 661,667 **** ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
  static Buffer
  ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  				  BlockNumber blockNum, ReadBufferMode mode,
! 				  BufferAccessStrategy strategy, bool *hit)
  {
  	BufferDesc *bufHdr;
  	Block		bufBlock;
--- 662,668 ----
  static Buffer
  ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  				  BlockNumber blockNum, ReadBufferMode mode,
! 				  BufferAccessStrategy strategy, bool *hit, Relation rel)
  {
  	BufferDesc *bufHdr;
  	Block		bufBlock;
***************
*** 685,691 **** ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
--- 686,695 ----
  
  	/* Substitute proper block number if caller asked for P_NEW */
  	if (isExtend)
+ 	{
  		blockNum = smgrnblocks(smgr, forkNum);
+ 		//blockNum += extend_num_pages;
+ 	}
  
  	if (isLocalBuf)
  	{
***************
*** 814,823 **** ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
--- 818,836 ----
  
  	if (isExtend)
  	{
+ 		int blkCount = 0;
+ 
  		/* new buffers are zero-filled */
  		MemSet((char *) bufBlock, 0, BLCKSZ);
  		/* don't set checksum for all-zero page */
  		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
+ 
+ 		while (blkCount < extend_num_pages)
+ 		{
+ 			blkCount++;
+ 			smgrextend(smgr, forkNum, blockNum+blkCount, (char *) bufBlock, false);
+ 			RecordPageWithFreeSpace(rel, blockNum+blkCount, 8126);
+ 		}
  	}
  	else
  	{
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 2683,2688 **** static struct config_int ConfigureNamesInt[] =
--- 2683,2698 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"extend_num_pages", PGC_SUSET, RESOURCES_ASYNCHRONOUS,
+ 			gettext_noop("Sets the Number of pages to extended at one time."),
+ 			NULL
+ 		},
+ 		&extend_num_pages,
+ 		0, 0, 100,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 139,144 ****
--- 139,146 ----
  
  #temp_file_limit = -1			# limits per-session temp file space
  					# in kB, or -1 for no limit
+ #extend_num_pages = 0			# number of extra pages allocate during extend
+ 					# min 0 max 100 pages
  
  # - Kernel Resource Usage -
  
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 60,65 **** extern PGDLLIMPORT char *BufferBlocks;
--- 60,66 ----
  
  /* in guc.c */
  extern int	effective_io_concurrency;
+ extern int	extend_num_pages;
  
  /* in localbuf.c */
  extern PGDLLIMPORT int NLocBuffer;
#27Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#26)
Re: Relation extension scalability

On Thu, Dec 31, 2015 at 6:22 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Dec 18, 2015 at 10:51 AM, Dilip Kumar <dilipbalaut@gmail.com>
wrote:

On Sun, Jul 19 2015 9:37 PM Andres Wrote,

The situation the read() protect us against is that two backends try to
extend to the same block, but after one of them succeeded the buffer is
written out and reused for an independent page. So there is no in-memory
state telling the slower backend that that page has already been used.

I was looking into this patch, and done some performance testing..

Currently i have done testing my my local machine, later i will perform
on big machine once i get access to that.

Just wanted to share the current result what i get i my local machine
Machine conf (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 8 core and 16GM
of RAM).

Test Script:
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

This time i have done some testing on big machine with* 64 physical cores
@ 2.13GHz and 50GB of RAM*

There is performance comparison of base, extend without
RelationExtensionLock patch given by Andres and
multi-extend patch (this will extend the multiple blocks at a time based
on a configuration parameter.)

*Problem Analysis:------------------------*
1. With base code when i try to observe the problem using perf and other
method (gdb), i found that RelationExtensionLock is main bottleneck.
2. Then after using RelationExtensionLock Free patch, i observed now
contention is FileWrite (All backends are trying to extend the file.)

*Performance Summary and
Analysis:------------------------------------------------*
1. In my performance results Multi Extend shown best performance and
scalability.
2. I think by extending in multiple blocks we solves both the
problem(Extension Lock and Parallel File Write).
3. After extending one Block immediately adding to FSM so in most of the
cases other backend can directly find it without taking extension lock.

Currently the patch is in initial stage, i have done only test performance
and pass the regress test suit.

*Open problems -----------------------------*
1. After extending the page we are adding it directly to FSM, so if vacuum
find this page as new it will give WARNING.
2. In RelationGetBufferForTuple, when PageIsNew, we are doing PageInit,
same need to be consider for index cases.

*Test Script:-------------------------*
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinarywide' WITH BINARY";

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

*Performanec Data:*
--------------------------
*There are Three code base for performance*
1. Base Code

2. Lock Free Patch : patch given in below thread
*/messages/by-id/20150719140746.GH25610@awork2.anarazel.de
</messages/by-id/20150719140746.GH25610@awork2.anarazel.de&gt;*

3. Multi extend patch attached in the mail.
*#extend_num_pages : *This this new config parameter to tell how many
extra page extend in case of normal extend..
may be it will give more control to user if we make it relation property.

I will work on the patch for this CF, so adding it to CF.

*Shared Buffer 48 GB*

*Clients* *Base (TPS)*
*Lock Free Patch* *Multi-extend **extend_num_pages=5* 1 142 138 148 2 251
253 280 4 237 416 464 8 168 491 575 16 141 448 404 32 122 337 332

*Shared Buffer 64 MB*

*Clients* *Base (TPS)* *Multi-extend **extend_num_pages=5*
1 140 148
2 252 266
4 229 437
8 153 475
16 132 364

I'm not really sure what this email is trying to say.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#27)
Re: Relation extension scalability

On Thu, Jan 7, 2016 at 1:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 31, 2015 at 6:22 AM, Dilip Kumar <dilipbalaut@gmail.com>
wrote:

*Performanec Data:*
--------------------------
*There are Three code base for performance*
1. Base Code

2. Lock Free Patch : patch given in below thread
*/messages/by-id/20150719140746.GH25610@awork2.anarazel.de
</messages/by-id/20150719140746.GH25610@awork2.anarazel.de&gt;*

3. Multi extend patch attached in the mail.
*#extend_num_pages : *This this new config parameter to tell how many
extra page extend in case of normal extend..
may be it will give more control to user if we make it relation property.

I will work on the patch for this CF, so adding it to CF.

*Shared Buffer 48 GB*

*Clients* *Base (TPS)*
*Lock Free Patch* *Multi-extend **extend_num_pages=5* 1 142 138 148 2 251
253 280 4 237 416 464 8 168 491 575 16 141 448 404 32 122 337 332

*Shared Buffer 64 MB*

*Clients* *Base (TPS)* *Multi-extend **extend_num_pages=5*
1 140 148
2 252 266
4 229 437
8 153 475
16 132 364

I'm not really sure what this email is trying to say.

What I could understand from above e-mail is that Dilip has tried to
extend relation multiple-pages-at-a-time and observed that it gives
comparable or better performance as compare to Andres's idea of
lock-free extension and it doesn't regress the low-thread count case.

Now, I think here point to discuss is that there could be multiple-ways
for extending a relation multiple-pages-at-a-time like:

a. Extend the relation page by page and add it to FSM without initializing
it. I think this is what the current patch of Dilip seems to be doing. If
we
want to go via this route, then we need to ensure that whenever we get
the page from FSM, if it is empty and not initialised, then initialise it.
b. Extend the relation page by page, initialize it and add it to FSM.
c. Extend the relation *n* pages at a time (in mdextend, have a provision
to do FILEWRITE for multiples of BLCKSZ). Here again, we need to
evaluate what is better way to add it to FSM (after Page initialization or
before page initialization).
d. Use some form of Group Extension, which means only one backend
will extend the relation and others will piggyback there request of
extension to that backend and wait for extension.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#29Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#28)
Re: Relation extension scalability

On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:

What I could understand from above e-mail is that Dilip has tried to
extend relation multiple-pages-at-a-time and observed that it gives
comparable or better performance as compare to Andres's idea of
lock-free extension and it doesn't regress the low-thread count case.

I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.

Now, I think here point to discuss is that there could be multiple-ways
for extending a relation multiple-pages-at-a-time like:

a. Extend the relation page by page and add it to FSM without initializing
it. I think this is what the current patch of Dilip seems to be doing. If
we
want to go via this route, then we need to ensure that whenever we get
the page from FSM, if it is empty and not initialised, then initialise
it.

I think that's pretty much unacceptable, for the non-error path at
least.

One very simple, linux specific, approach would be to simply do
fallocate(FALLOC_FL_KEEP_SIZE) to extend the file, that way space is
pre-allocated, but not yet marked as allocated.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#29)
1 attachment(s)
Re: Relation extension scalability

On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:

I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.

Attached patch solves this issue, I am allocating the buffer for each page
and initializing the page, only after that adding to FSM.

a. Extend the relation page by page and add it to FSM without

initializing

it. I think this is what the current patch of Dilip seems to be doing.

If

we

I think that's pretty much unacceptable, for the non-error path at
least.

Performance results:
----------------------------
Test Case:
------------
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinary' WITH BINARY";

echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

Test Summary:
--------------------
1. I have measured the performance of base and patch.
2. With patch there are multiple results, that are with different values of
"extend_num_pages" (parameter which says how many extra block to extend)

Test with Data on magnetic Disk and WAL on SSD
--------------------------------------------------------------------
Shared Buffer : 48GB
max_wal_size : 10GB
Storage : Magnetic Disk
WAL : SSD

tps with different value of extend_num_page

------------------------------------------------------------

Client Base 10-Page 20-Page 50-Page

1 105 103 157 129
2 217 219 255 288
4 210 421 494 486
8 166 605 702 701
16 145 484 563 686
32 124 477 480 745

Test with Data and WAL on SSD
-----------------------------------------------

Shared Buffer : 48GB
Max Wal Size : 10GB
Storage : SSD

tps with different value of extend_num_page

------------------------------------------------------------

Client Base 10-Page 20-Page 50-Page 100-Page

1 152 153 155 147 157
2 281 281 292 275 287
4 236 505 502 508 514
8 171 662 687 767 764
16 145 527 639 826 907

Note: Test with both data and WAL on Magnetic Disk : No significant
improvement visible
-- I think wall write is becoming bottleneck in this case.

Currently i have kept extend_num_page as session level parameter but i
think later we can make this as table property.
Any suggestion on this ?

Apart from this approach, I also tried extending the file in multiple block
in one extend call, but this approach (extending one by one) is performing
better.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v2.patchtext/x-diff; charset=US-ASCII; name=multi_extend_v2.patchDownload
*** a/src/backend/access/heap/hio.c
--- b/src/backend/access/heap/hio.c
***************
*** 24,29 ****
--- 24,30 ----
  #include "storage/lmgr.h"
  #include "storage/smgr.h"
  
+ int extend_num_pages = 0;
  
  /*
   * RelationPutHeapTuple - place tuple at specified page
***************
*** 238,243 **** RelationGetBufferForTuple(Relation relation, Size len,
--- 239,245 ----
  	BlockNumber targetBlock,
  				otherBlock;
  	bool		needLock;
+ 	int			totalBlocks;
  
  	len = MAXALIGN(len);		/* be conservative */
  
***************
*** 449,467 **** RelationGetBufferForTuple(Relation relation, Size len,
  	 * it worth keeping an accurate file length in shared memory someplace,
  	 * rather than relying on the kernel to do it for us?
  	 */
! 	buffer = ReadBufferBI(relation, P_NEW, bistate);
! 
! 	/*
! 	 * We can be certain that locking the otherBuffer first is OK, since it
! 	 * must have a lower page number.
! 	 */
! 	if (otherBuffer != InvalidBuffer)
! 		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
! 
! 	/*
! 	 * Now acquire lock on the new page.
! 	 */
! 	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
  
  	/*
  	 * Release the file-extension lock; it's now OK for someone else to extend
--- 451,491 ----
  	 * it worth keeping an accurate file length in shared memory someplace,
  	 * rather than relying on the kernel to do it for us?
  	 */
! 
! 	totalBlocks = extend_num_pages;
! 
! 	do {
! 
! 
! 		buffer = ReadBufferBI(relation, P_NEW, bistate);
! 
! 		/*
! 		 * We can be certain that locking the otherBuffer first is OK, since it
! 		 * must have a lower page number.
! 		 */
! 		if ((otherBuffer != InvalidBuffer) && !totalBlocks)
! 			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
! 
! 		/*
! 		 * Now acquire lock on the new page.
! 		 */
! 		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
! 
! 		if (totalBlocks)
! 		{
! 			Page page;
! 			Size freespace;
! 
! 			page = BufferGetPage(buffer);
! 			PageInit(page, BufferGetPageSize(buf), 0);
! 
! 			freespace = PageGetHeapFreeSpace(page);
! 			MarkBufferDirty(buffer);
! 			UnlockReleaseBuffer(buffer);
! 			RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);
! 		}
! 
! 	}while (totalBlocks--);
  
  	/*
  	 * Release the file-extension lock; it's now OK for someone else to extend
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 31,36 ****
--- 31,37 ----
  #include "access/transam.h"
  #include "access/twophase.h"
  #include "access/xact.h"
+ #include "access/hio.h"
  #include "catalog/namespace.h"
  #include "commands/async.h"
  #include "commands/prepare.h"
***************
*** 2683,2688 **** static struct config_int ConfigureNamesInt[] =
--- 2684,2699 ----
  		NULL, NULL, NULL
  	},
  
+ 	{
+ 		{"extend_num_pages", PGC_USERSET, UNGROUPED,
+ 			gettext_noop("Sets the Number of pages to extended at one time."),
+ 			NULL
+ 		},
+ 		&extend_num_pages,
+ 		0, 0, 100,
+ 		NULL, NULL, NULL
+ 	},
+ 
  	/* End-of-list marker */
  	{
  		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 139,144 ****
--- 139,146 ----
  
  #temp_file_limit = -1			# limits per-session temp file space
  					# in kB, or -1 for no limit
+ #extend_num_pages = 0			# number of extra pages allocate during extend
+ 					# min 0 max 100 pages
  
  # - Kernel Resource Usage -
  
*** a/src/include/access/hio.h
--- b/src/include/access/hio.h
***************
*** 19,25 ****
  #include "utils/relcache.h"
  #include "storage/buf.h"
  
! 
  /*
   * state for bulk inserts --- private to heapam.c and hio.c
   *
--- 19,25 ----
  #include "utils/relcache.h"
  #include "storage/buf.h"
  
! extern int extend_num_pages;
  /*
   * state for bulk inserts --- private to heapam.c and hio.c
   *
#31Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#30)
Re: Relation extension scalability

On Tue, Jan 12, 2016 at 2:41 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:

I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.

Attached patch solves this issue, I am allocating the buffer for each page
and initializing the page, only after that adding to FSM.

Few comments about patch:

1.
Patch is not getting compiled.

1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier
1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier
1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier

2.
! page = BufferGetPage(buffer);
! PageInit(page, BufferGetPageSize
(buf), 0);
!
! freespace = PageGetHeapFreeSpace(page);
!
MarkBufferDirty(buffer);
! UnlockReleaseBuffer(buffer);
!
RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);

What is the need to mark page dirty here, won't it automatically
be markerd dirty once the page is used? I think it is required
if you wish to WAL-log this action.

3. I think you don't need to multi-extend a relation if
HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
get a new page by extending a relation.

4. Again why do you need this multi-extend optimization for local
relations (those only accessible to current backend)?

5. Do we need this for nbtree as well? One way to check that
is by Copying large data in table having index.

Note: Test with both data and WAL on Magnetic Disk : No significant
improvement visible
-- I think wall write is becoming bottleneck in this case.

In that case, can you try the same test with un-logged tables?

Also, it is good to check the performance of patch with read-write work
load to ensure that extending relation in multiple-chunks should not
regress such cases.

Currently i have kept extend_num_page as session level parameter but i
think later we can make this as table property.
Any suggestion on this ?

I think we should have a new storage_parameter at table level
extend_by_blocks or something like that instead of GUC. The
default value of this parameter should be 1 which means retain
current behaviour by default.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#31)
Re: Relation extension scalability

On Sat, Jan 23, 2016 at 12:19 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Jan 12, 2016 at 2:41 PM, Dilip Kumar <dilipbalaut@gmail.com>
wrote:

On Thu, Jan 7, 2016 at 4:53 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-07 16:48:53 +0530, Amit Kapila wrote:

I think it's a worthwhile approach to pursue. But until it actually
fixes the problem of leaving around uninitialized pages I don't think
it's very meaningful to do performance comparisons.

Attached patch solves this issue, I am allocating the buffer for each
page and initializing the page, only after that adding to FSM.

Few comments about patch:

I found one more problem with patch.

! UnlockReleaseBuffer(buffer);
! RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer),
freespace);

You can't call BufferGetBlockNumber(buffer) after releasing
the pin on buffer which will be released by
UnlockReleaseBuffer(). Get the block number before unlocking
the buffer.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#33Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#31)
Re: Relation extension scalability

On Sat, Jan 23, 2016 at 12:19 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote

Few comments about patch:

Thanks for reviewing..

1.
Patch is not getting compiled.

1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier
1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier
1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier

Oh, My mistake, my preprocessor is ignoring this error and relacing it with
BLKSIZE

I will fix in next version of patch.

2.
! page = BufferGetPage(buffer);
! PageInit(page, BufferGetPageSize
(buf), 0);
!
! freespace = PageGetHeapFreeSpace(page);
!
MarkBufferDirty(buffer);
! UnlockReleaseBuffer(buffer);
!
RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);

What is the need to mark page dirty here, won't it automatically
be markerd dirty once the page is used? I think it is required
if you wish to WAL-log this action.

These pages are not going to be used immediately and we have done PageInit
so i think it should be marked dirty before adding to FSM, so that if
buffer get replaced out then it flushes the init data.

3. I think you don't need to multi-extend a relation if
HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
get a new page by extending a relation.

Yes, if HEAP_INSERT_SKIP_FSM is used and we use multi-extend atleast in
current transaction it will not take pages from FSM and everytime it will
do multi-extend, however pages will be used if there are parallel backend,
but still not a good idea to extend every time in multiple chunk in current
backend.

So i will change this..

4. Again why do you need this multi-extend optimization for local

relations (those only accessible to current backend)?

I think we can change this while adding the table level "extend_by_blocks"
for local table we will not allow this property, so no need to change at
this place.

What do you think ?

5. Do we need this for nbtree as well? One way to check that

is by Copying large data in table having index.

Ok, i will try this test and update.

Note: Test with both data and WAL on Magnetic Disk : No significant

improvement visible
-- I think wall write is becoming bottleneck in this case.

In that case, can you try the same test with un-logged tables?

OK

Also, it is good to check the performance of patch with read-write work
load to ensure that extending relation in multiple-chunks should not
regress such cases.

Ok

Currently i have kept extend_num_page as session level parameter but i

think later we can make this as table property.
Any suggestion on this ?

I think we should have a new storage_parameter at table level
extend_by_blocks or something like that instead of GUC. The
default value of this parameter should be 1 which means retain
current behaviour by default.

+1

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#34Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#32)
Re: Relation extension scalability

On Sat, Jan 23, 2016 at 4:28 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I found one more problem with patch.

! UnlockReleaseBuffer(buffer);
! RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer),
freespace);

You can't call BufferGetBlockNumber(buffer) after releasing
the pin on buffer which will be released by
UnlockReleaseBuffer(). Get the block number before unlocking
the buffer.

Good catch, will fix this also in next version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#35Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#33)
1 attachment(s)
Re: Relation extension scalability

On Mon, Jan 25, 2016 at 11:59 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

1.

Patch is not getting compiled.

1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
identifier

Oh, My mistake, my preprocessor is ignoring this error and relacing it
with BLKSIZE

FIXED

3. I think you don't need to multi-extend a relation if

HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
get a new page by extending a relation.

So i will change this..

FIXED

4. Again why do you need this multi-extend optimization for local

relations (those only accessible to current backend)?

I think we can change this while adding the table level
"extend_by_blocks" for local table we will not allow this property, so no
need to change at this place.
What do you think ?

Now I have added table level parameter for specifying the number of blocks,
So do you think that we still need to block it, as user can control it,
Moreover i think if user want to use for local table then no harm in it at
least by extending in one shot he avoid multiple call of Extension lock,
though there will be no contention.

What is your opinion ?

5. Do we need this for nbtree as well? One way to check that

is by Copying large data in table having index.

Ok, i will try this test and update.

I tried to load data to table with Index and tried to analyze bottleneck
using perf, and found btcompare was taking maximum time, still i don't deny
that it can not get benefited by multi extend.

So i tried quick POC for this, but i realize that even though we extend
multiple page for index and add to FSM, it will be updated only in current
page, Information about new free page will be propagated to root page only
during vacuum, And unlike heap Btree always search FSM from root and it
will not find the extra added pages.

So i think we can analyze this topic separately for index multi extend and
find is there are cases where index multi extend can give benefit.

Note: Test with both data and WAL on Magnetic Disk : No significant

improvement visible
-- I think wall write is becoming bottleneck in this case.

In that case, can you try the same test with un-logged tables?

Date with un-logged table

Test Init:
------------
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script
./psql -d postgres -c "create unlogged table data (data text)" --> base
./psql -d postgres -c "create unlogged table data (data text)
with(extend_by_blocks=50)" --patch

test_script:
------------
./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

Shared Buffer 48GB
Table: Unlogged Table
ench -c$ -j$ -f -M Prepared postgres

Clients Base patch
1 178 180
2 337 338
4 265 601
8 167 805

Also, it is good to check the performance of patch with read-write work

load to ensure that extending relation in multiple-chunks should not
regress such cases.

Ok

I did not find in regression in normal case.
Note: I tested it with previous patch extend_num_pages=10 (guc parameter)
so that we can see any impact on overall system.

Currently i have kept extend_num_page as session level parameter but i

think later we can make this as table property.
Any suggestion on this ?

I think we should have a new storage_parameter at table level
extend_by_blocks or something like that instead of GUC. The
default value of this parameter should be 1 which means retain
current behaviour by default.

+1

Changed it to table level storage parameter. I kept max_limit to 100 any
suggestion on this ? should we increase it ?

latest patch is attached..

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v3.patchapplication/x-patch; name=multi_extend_v3.patchDownload
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 86b9ae1..76f9a21 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -268,6 +268,16 @@ static relopt_int intRelOpts[] =
 #endif
 	},
 
+	{
+		{
+			"extend_by_blocks",
+			"Number of blocks to be added to relation in every extend call",
+			RELOPT_KIND_HEAP,
+			AccessExclusiveLock
+		},
+		1, 1, 100
+	},
+
 	/* list terminator */
 	{{NULL}}
 };
@@ -1291,7 +1301,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL,
 		offsetof(StdRdOptions, autovacuum) +offsetof(AutoVacOpts, analyze_scale_factor)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, user_catalog_table)}
+		offsetof(StdRdOptions, user_catalog_table)},
+		{"extend_by_blocks", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, extend_by_blocks)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..ec430fc 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -238,6 +238,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
+	int			extraBlocks;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -443,25 +444,49 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	if (needLock)
 		LockRelationForExtension(relation, ExclusiveLock);
 
+	if (use_fsm)
+		extraBlocks = RelationGetExtendBlocks(relation) -1;
+	else
+		extraBlocks = 0;
 	/*
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
 	 * rather than relying on the kernel to do it for us?
 	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
 
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+	do {
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
 
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		/*
+		 * We can be certain that locking the otherBuffer first is OK, since it
+		 * must have a lower page number.
+		 */
+		if ((otherBuffer != InvalidBuffer) && !extraBlocks)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		if (extraBlocks)
+		{
+			Page page;
+			Size freespace;
+			BlockNumber blockNum;
+
+			page = BufferGetPage(buffer);
+			PageInit(page, BufferGetPageSize(buffer), 0);
+
+			freespace = PageGetHeapFreeSpace(page);
+			MarkBufferDirty(buffer);
+			blockNum = BufferGetBlockNumber(buffer);
+			UnlockReleaseBuffer(buffer);
+			RecordPageWithFreeSpace(relation, blockNum, freespace);
+		}
+
+	}while (extraBlocks--);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index a174b34..40c3941 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -19,7 +19,6 @@
 #include "utils/relcache.h"
 #include "storage/buf.h"
 
-
 /*
  * state for bulk inserts --- private to heapam.c and hio.c
  *
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f2bebf2..26f6b8e 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -203,6 +203,7 @@ typedef struct StdRdOptions
 	AutoVacOpts autovacuum;		/* autovacuum-related options */
 	bool		user_catalog_table;		/* use as an additional catalog
 										 * relation */
+	int			extend_by_blocks;
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -239,6 +240,13 @@ typedef struct StdRdOptions
 	((relation)->rd_options ?				\
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * RelationGetExtendBlocks
+ *		Returns the relation's number of block to be extended one time.
+ */
+#define RelationGetExtendBlocks(relation) \
+	((relation)->rd_options ? \
+	 ((StdRdOptions *) (relation)->rd_options)->extend_by_blocks : 1)
 
 /*
  * ViewOptions
#36Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#35)
Re: Relation extension scalability

On Thu, Jan 28, 2016 at 4:53 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I did not find in regression in normal case.
Note: I tested it with previous patch extend_num_pages=10 (guc parameter)
so that we can see any impact on overall system.

Just forgot to mentioned That i have run pgbench read-write case.

S.F: 300

./pgbench -j $ -c $ -T 1800 -M Prepared postgres

Tested with 1,8,16,32,64 Threads and did not see any regression with patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#37Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#35)
Re: Relation extension scalability

On Thu, Jan 28, 2016 at 6:23 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

[ new patch ]

This patch contains a useless hunk and also code not in PostgreSQL
style. Get pgindent set up and it will do it correctly for you, or
look at the style of the surrounding code.

What I'm a bit murky about is *why* this should be any better than the
status quo. I mean, the obvious explanation that pops to mind is that
extending the relation by two pages at a time relieves pressure on the
relation extension lock somehow. One other possible explanation is
that calling RecordPageWithFreeSpace() allows multiple backends to get
access to that page at the same time, while otherwise each backend
would try to conduct a separate extension. But in the first case,
what we ought to do is try to make the locking more efficient; and in
the second case, we might want to think about recording the first page
in the free space map too. I don't think the problem here is that w

Here's a sketch of another approach to this problem. Get rid of the
relation extension lock. Instead, have an array of, say, 256 lwlocks.
Each one protects the extension of relations where hash(relfilenode) %
256 maps to that lock. To extend a relation, grab the corresponding
lwlock, do the work, then release the lwlock. You might occasionally
have a situation where two relations are both being extended very
quickly and happen to map to the same bucket, but that shouldn't be
much of a problem in practice, and the way we're doing it presently is
worse, not better, since two relation extension locks may very easily
map to the same lock manager partition. The only problem with this is
that acquiring an LWLock holds off interrupts, and we don't want
interrupts to be locked out across a potentially lengthy I/O. We
could partially fix that if we call RESUME_INTERRUPTS() after
acquiring the lock and HOLD_INTERRUPTS() just before releasing it, but
there's still the problem that you might block non-interruptibly while
somebody else has the lock. I don't see an easy solution to that
problem right off-hand, but if something like this performs well we
can probably conjure up some solution to that problem.

I'm not saying that we need to do that exact thing - but I am saying
that I don't think we can proceed with an approach like this without
first understanding why it works and whether there's some other way
that might be better to address the underlying problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#35)
Re: Relation extension scalability

On 2016-01-28 16:53:08 +0530, Dilip Kumar wrote:

test_script:
------------
./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

Shared Buffer 48GB
Table: Unlogged Table
ench -c$ -j$ -f -M Prepared postgres

Clients Base patch
1 178 180
2 337 338
4 265 601
8 167 805

Could you also measure how this behaves for an INSERT instead of a COPY
workload? Both throughput and latency. It's quite possible that this
causes latency hikes, because suddenly backends will have to wait for
one other to extend by 50 pages. You'll probably have to use -P 1 or
full statement logging to judge that. I think just having a number of
connections inserting relatively wide rows into one table should do the
trick.

I'm doubtful that anything that does the victim buffer search while
holding the extension lock will actually scale in a wide range of
scenarios. The copy scenario here probably isn't too bad because the
copy ring buffes are in use, and because there's no reads increasing the
usagecount of recent buffers; thus a victim buffers are easily found.

Thanks,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#37)
Re: Relation extension scalability

On 2016-02-02 10:12:38 -0500, Robert Haas wrote:

Here's a sketch of another approach to this problem. Get rid of the
relation extension lock. Instead, have an array of, say, 256 lwlocks.
Each one protects the extension of relations where hash(relfilenode) %
256 maps to that lock. To extend a relation, grab the corresponding
lwlock, do the work, then release the lwlock. You might occasionally
have a situation where two relations are both being extended very
quickly and happen to map to the same bucket, but that shouldn't be
much of a problem in practice, and the way we're doing it presently is
worse, not better, since two relation extension locks may very easily
map to the same lock manager partition.

I guess you suspect that the performance problems come from the
heavyweight lock overhead? That's not what I *think* I've seen in
profiles, but it's hard to conclusively judge that.

I kinda doubt that really solves the problem, profiles aside,
though. The above wouldn't really get rid of the extension locks, it
just changes the implementation a bit. We'd still do victim buffer
search, and filesystem operations, while holding an exclusive
lock. Batching can solve some of that, but I think primarily we need
more granular locking, or get rid of locks entirely.

The only problem with this is that acquiring an LWLock holds off
interrupts, and we don't want interrupts to be locked out across a
potentially lengthy I/O. We could partially fix that if we call
RESUME_INTERRUPTS() after acquiring the lock and HOLD_INTERRUPTS()
just before releasing it, but there's still the problem that you might
block non-interruptibly while somebody else has the lock. I don't see
an easy solution to that problem right off-hand, but if something like
this performs well we can probably conjure up some solution to that
problem.

Hm. I think to get rid of the HOLD_INTERRUPTS() we'd have to to record
what lock we were waiting on, and in which mode, before going into
PGSemaphoreLock(). Then LWLockReleaseAll() could "hand off" the wakeup
to the next waiter in the queue. Without that we'd sometimes end up with
absorbing a wakeup without then releasing the lock, causing everyone to
block on a released lock.

There's probably two major questions around that: Will it have a
performance impact, and will there be any impact on existing callers?

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#38)
Re: Relation extension scalability

On Tue, Feb 2, 2016 at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:

I'm doubtful that anything that does the victim buffer search while
holding the extension lock will actually scale in a wide range of
scenarios. The copy scenario here probably isn't too bad because the
copy ring buffes are in use, and because there's no reads increasing the
usagecount of recent buffers; thus a victim buffers are easily found.

I agree that's an avenue we should try to explore. I haven't had any
time to think much about how it should be done, but it seems like it
ought to be possible somehow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#38)
Re: Relation extension scalability

Andres Freund wrote:

Could you also measure how this behaves for [...]

While we're proposing benchmark cases -- I remember this being an issue
with toast tables getting very large values of xml which causes multiple
toast pages to be extended for each new value inserted. If there are
multiple processes inserting these all the time, things get clogged.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#38)
Re: Relation extension scalability

On Tue, Feb 2, 2016 at 9:19 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-28 16:53:08 +0530, Dilip Kumar wrote:

test_script:
------------
./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

Shared Buffer 48GB
Table: Unlogged Table
ench -c$ -j$ -f -M Prepared postgres

Clients Base patch
1 178 180
2 337 338
4 265 601
8 167 805

Could you also measure how this behaves for an INSERT instead of a COPY
workload?

I think such a test will be useful.

I'm doubtful that anything that does the victim buffer search while
holding the extension lock will actually scale in a wide range of
scenarios.

I think the problem for victim buffer could be visible if the blocks
are dirty and it needs to write the dirty buffer and especially as
the patch is written where after acquiring the extension lock, it again
tries to extend the relation without checking if it can get a page with
space from FSM. It seems to me that we should re-check the
availability of page because while one backend is waiting on extension
lock, other backend might have added pages. To re-check the
availability we might want to use something similar to
LWLockAcquireOrWait() semantics as used during WAL writing.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#43Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#42)
Re: Relation extension scalability

On Fri, Feb 5, 2016 at 4:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Could you also measure how this behaves for an INSERT instead of a COPY
workload?

I think such a test will be useful.

I have measured the performance with insert to see the behavior when it
don't use strategy. I have tested multiple option, small tuple, big tuple,
data fits in shared buffer and doesn't fit in shared buffer.

Observation:
------------------
Apart from this test I have also used some tool which can take a many
stack traces with some delay.
1. I have observed with base code (when data don't fits in shared buffer)
almost all the stack traces are waiting on "LockRelationForExtension" and
many on "FlushBuffer" also (Flushing the dirty buffer).

Total Stack Captured: 204, FlushBuffer: 13, LockRelationForExtension: 187

(This test run with 8 thread (shared buf 512MB) and after every 5 second
stack is captured.)

2. If I change shared buf 48GB then Obviously FlushBuffer disappeared but
still LockRelationForExtension remains in very high number.

3.Performance of base code in both the cases when Data fits in shared
buffers or doesn't fits in shared buffer remain very low and non-scaling(we
can see that in below results).

Test--1 (big record insert and Data fits in shared Buffer)
------------------------------------------------------------
setup
--------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int)
with(extend_by_blocks=$variable)" {create table data (a int) for base code}
echo "insert into data select * from test_data;" >> copy_script

test:
-----
shared_buffers=48GB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres

client base extend_by_block=50 extend_by_block=1000
1 113 115 118
4 50 220 216
8 43 202 302

Test--2 (big record insert and Data doesn't fits in shared Buffer)
------------------------------------------------------------------
setup:
-------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int)
with(extend_by_blocks=1000)"
echo "insert into data select * from test_data;" >> copy_script

test:
------
shared_buffers=512MB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres

client base extend_by_block=1000
1 125 125
4 49 236
8 41 294
16 39 279

Test--3 (small record insert and Data fits in shared Buffer)
------------------------------------------------------------------
setup:
--------
./psql -d postgres -c "create table test_data(a int)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,10000))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=20)"
echo "insert into data select * from test_data;" >> copy_script

test:
-----
shared_buffers=48GB -c max_wal_size=20GB -c checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres

client base Patch-extend_by_block=20
1 137 143
2 269 250
4 377 443
8 170 690
16 145 745

*All test done with Data on MD and Wal on SSD

Note: Last patch have max limit of extend_by_block=100 so for taking
performance with extend_by_block=1000 i localy changed it.
I will send the modified patch once we finalize on which approach to
proceed with.

I'm doubtful that anything that does the victim buffer search while
holding the extension lock will actually scale in a wide range of
scenarios.

I think the problem for victim buffer could be visible if the blocks
are dirty and it needs to write the dirty buffer and especially as
the patch is written where after acquiring the extension lock, it again
tries to extend the relation without checking if it can get a page with
space from FSM. It seems to me that we should re-check the
availability of page because while one backend is waiting on extension
lock, other backend might have added pages. To re-check the
availability we might want to use something similar to
LWLockAcquireOrWait() semantics as used during WAL writing.

I will work on this in next version...

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#44Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#43)
Re: Relation extension scalability

On 2016-02-10 10:32:44 +0530, Dilip Kumar wrote:

On Fri, Feb 5, 2016 at 4:50 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Could you also measure how this behaves for an INSERT instead of a COPY
workload?

I think such a test will be useful.

I have measured the performance with insert to see the behavior when it
don't use strategy. I have tested multiple option, small tuple, big tuple,
data fits in shared buffer and doesn't fit in shared buffer.

Could you please also have a look at the influence this has on latency?
I think you unfortunately have to look at the per-transaction logs, and
then see whether the worst case latencies get better or worse.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#44)
2 attachment(s)
Re: Relation extension scalability

On Wed, Feb 10, 2016 at 3:24 PM, Andres Freund <andres@anarazel.de> wrote:

Could you please also have a look at the influence this has on latency?
I think you unfortunately have to look at the per-transaction logs, and
then see whether the worst case latencies get better or worse.

I have quickly measured the per transaction latency of one case
(I selected below case to find the worst case latency because in this case
we are extending by 1000 blocks and data doesn't fits in shared buffer)

Test--2 (big record insert and Data doesn't fits in shared Buffer)
------------------------------------------------------------------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int)
with(extend_by_blocks=1000)"
echo "insert into data select * from test_data;" >> copy_script

shared_buffers=512B -c max_wal_size=20GB -c checkpoint_timeout=10min
./pgbench -c 8 -j 8 -f copy_script -T -l 120 postgres

base patch(extend 1000)
best 23245 3857
worst 236329 382859
Average 190303 35143

I have also attached the pgbench log files
patch_1000.tar --> log files with patch extend by 1000 blocks
base.tar --> log files with base code

From attached files we can see that very few transactions latency with
patch is high(> 300,000) which is expected and that too when we are
extensing 1000 blocks, And we base code almost every transaction latency is
hight (>200,000) that we can see that best case and average case latency is
1/5 with extend by 1000.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

patch_1000.tarapplication/x-tar; name=patch_1000.tarDownload
base.tarapplication/x-tar; name=base.tarDownload
#46Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#45)
1 attachment(s)
Re: Relation extension scalability

On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have tested Relation extension patch from various aspects and performance
results and other statistical data are explained in the mail.

Test 1: Identify the Heavy Weight lock is the Problem or the Actual Context
Switch
1. I converted the RelationExtensionLock to simple LWLock and tested with
single Relation. Results are as below

This is the simple script of copy 10000 record in one transaction of size 4
Bytes
client base lwlock multi_extend by 50 block
1 155 156 160
2 282 276 284
4 248 319 428
8 161 267 675
16 143 241 889

LWLock performance is better than base, obvious reason may be because we
have saved some instructions by converting to LWLock but it don't scales
any better compared to base code.

Test2: Identify that improvement in case of multiextend is becuase of
avoiding context switch or some other factor, like reusing blocks b/w
backend by putting in FSM..

1. Test by just extending multiple blocks and reuse in it's own backend
(Don't put in FSM)
Insert 1K record data don't fits in shared buffer 512MB Shared Buffer

Client Base Extend 800 block self use Extend 1000 Block
1 117 131 118
2 111 203 140
3 51 242 178
4 51 231 190
5 52 259 224
6 51 263 243
7 43 253 254
8 43 240 254
16 40 190 243

We can see the same improvement in case of self using the blocks also, It
shows that Sharing the blocks between the backend was not the WIN but
avoiding context switch was the measure win.

2. Tested the Number of ProcSleep during the Run.
This is the simple script of copy 10000 record in one transaction of size 4
Bytes

* BASE CODE*
*PATCH MULTI EXTEND*
Client Base_TPS ProcSleep Count Extend By 10 Block Proc
Sleep Count
2 280 457,506
311 62,641
3 235 1,098,701
358 141,624
4 216 1,155,735
368 188,173

What we can see in above test that, in Base code performance is degrading
after 2 threads, while Proc Sleep count in increasing with huge amount.

Compared to that in Patch, with extending 10 blocks at a time Proc Sleep
reduce to ~1/8 and we can see it is constantly scaling.

Proc Sleep test for Insert test when data don't fits in shared buffer and
inserting big record of 1024 bytes, is currently running
once I get the data will post the same.

Posting the re-based version and moving to next CF.

Open points:
1. After getting the Lock recheck the FSM if some other back end has
already added extra blocks and reuse them.
2. Is it good idea to have user level parameter for extend_by_block or we
can try some approach to internally identify how many blocks are needed and
as per the need only add the blocks, this will make it more flexible.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v4.patchapplication/x-patch; name=multi_extend_v4.patchDownload
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 86b9ae1..78e81dd 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -268,6 +268,16 @@ static relopt_int intRelOpts[] =
 #endif
 	},
 
+	{
+		{
+			"extend_by_blocks",
+			"Number of blocks to be added to relation in every extend call",
+			RELOPT_KIND_HEAP,
+			AccessExclusiveLock
+		},
+		1, 1, 10000
+	},
+
 	/* list terminator */
 	{{NULL}}
 };
@@ -1291,7 +1301,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL,
 		offsetof(StdRdOptions, autovacuum) +offsetof(AutoVacOpts, analyze_scale_factor)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, user_catalog_table)}
+		offsetof(StdRdOptions, user_catalog_table)},
+		{"extend_by_blocks", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, extend_by_blocks)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..eb3ce17 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -238,6 +238,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
+	int			extraBlocks;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -443,25 +444,50 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	if (needLock)
 		LockRelationForExtension(relation, ExclusiveLock);
 
+	if (use_fsm)
+		extraBlocks = RelationGetExtendBlocks(relation) - 1;
+	else
+		extraBlocks = 0;
 	/*
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
 	 * rather than relying on the kernel to do it for us?
 	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
 
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+	do
+	{
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
 
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		/*
+		 * We can be certain that locking the otherBuffer first is OK, since
+		 * it must have a lower page number.
+		 */
+		if ((otherBuffer != InvalidBuffer) && !extraBlocks)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		if (extraBlocks)
+		{
+			Page		page;
+			Size		freespace;
+			BlockNumber blockNum;
+
+			page = BufferGetPage(buffer);
+			PageInit(page, BufferGetPageSize(buffer), 0);
+
+			freespace = PageGetHeapFreeSpace(page);
+			MarkBufferDirty(buffer);
+			blockNum = BufferGetBlockNumber(buffer);
+			UnlockReleaseBuffer(buffer);
+			RecordPageWithFreeSpace(relation, blockNum, freespace);
+		}
+
+	} while (extraBlocks--);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index f2bebf2..26f6b8e 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -203,6 +203,7 @@ typedef struct StdRdOptions
 	AutoVacOpts autovacuum;		/* autovacuum-related options */
 	bool		user_catalog_table;		/* use as an additional catalog
 										 * relation */
+	int			extend_by_blocks;
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -239,6 +240,13 @@ typedef struct StdRdOptions
 	((relation)->rd_options ?				\
 	 ((StdRdOptions *) (relation)->rd_options)->user_catalog_table : false)
 
+/*
+ * RelationGetExtendBlocks
+ *		Returns the relation's number of block to be extended one time.
+ */
+#define RelationGetExtendBlocks(relation) \
+	((relation)->rd_options ? \
+	 ((StdRdOptions *) (relation)->rd_options)->extend_by_blocks : 1)
 
 /*
  * ViewOptions
#47Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#46)
Re: Relation extension scalability

On Mon, Feb 29, 2016 at 3:37 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar <dilipbalaut@gmail.com>
wrote:

Test2: Identify that improvement in case of multiextend is becuase of
avoiding context switch or some other factor, like reusing blocks b/w
backend by putting in FSM..

1. Test by just extending multiple blocks and reuse in it's own backend
(Don't put in FSM)
Insert 1K record data don't fits in shared buffer 512MB Shared Buffer

Client Base Extend 800 block self use Extend 1000 Block

1 117 131 118
2 111 203 140
3 51 242 178
4 51 231 190
5 52 259 224
6 51 263 243
7 43 253 254
8 43 240 254
16 40 190 243

We can see the same improvement in case of self using the blocks also, It
shows that Sharing the blocks between the backend was not the WIN but
avoiding context switch was the measure win.

One thing that is slightly unclear is that whether there is any overhead
due to buffer eviction especially when the buffer to be evicted is already
dirty and needs XLogFlush(). One reason why it might not hurt is that by
the time we tries to evict the buffer, corresponding WAL is already flushed
by WALWriter or other possibility could be that even if it is getting done
during buffer eviction, the impact for same is much lesser. Can we try to
measure the number of flush calls which happen during buffer eviction?

2. Tested the Number of ProcSleep during the Run.
This is the simple script of copy 10000 record in one transaction of size
4 Bytes

* BASE CODE*
*PATCH MULTI EXTEND*
Client Base_TPS ProcSleep Count Extend By 10 Block Proc
Sleep Count
2 280 457,506
311 62,641
3 235 1,098,701
358 141,624
4 216 1,155,735
368 188,173

What we can see in above test that, in Base code performance is degrading
after 2 threads, while Proc Sleep count in increasing with huge amount.

Compared to that in Patch, with extending 10 blocks at a time Proc Sleep
reduce to ~1/8 and we can see it is constantly scaling.

Proc Sleep test for Insert test when data don't fits in shared buffer and
inserting big record of 1024 bytes, is currently running
once I get the data will post the same.

Okay. However, I wonder if the performance data for the case when data
doesn't fit into shared buffer also shows similar trend, then it might be
worth to try by doing extend w.r.t load in system. I mean to say we can
batch the extension requests (as we have done in ProcArrayGroupClearXid)
and extend accordingly, if that works out then the benefit could be that we
don't need any configurable knob for the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#48Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#47)
Re: Relation extension scalability

On Tue, Mar 1, 2016 at 4:36 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

One thing that is slightly unclear is that whether there is any overhead
due to buffer eviction especially when the buffer to be evicted is already
dirty and needs XLogFlush(). One reason why it might not hurt is that by
the time we tries to evict the buffer, corresponding WAL is already flushed
by WALWriter or other possibility could be that even if it is getting done
during buffer eviction, the impact for same is much lesser. Can we try to
measure the number of flush calls which happen during buffer eviction?

Good Idea, I will do this test and post the results..

Proc Sleep test for Insert test when data don't fits in shared buffer and
inserting big record of 1024 bytes, is currently running
once I get the data will post the same.

Okay. However, I wonder if the performance data for the case when data
doesn't fit into shared buffer also shows similar trend, then it might be
worth to try by doing extend w.r.t load in system. I mean to say we can
batch the extension requests (as we have done in ProcArrayGroupClearXid)
and extend accordingly, if that works out then the benefit could be that we
don't need any configurable knob for the same.

1. One option can be as you suggested like ProcArrayGroupClearXid, With
some modification, because when we wait for the request and extend w.r.t
that, may be again we face the Context Switch problem, So may be we can
extend in some multiple of the Request.
(But we need to take care whether to give block directly to requester or
add it into FSM or do both i.e. give at-least one block to requester and
put some multiple in FSM)

2. Other option can be that we analyse the current Load on the extend and
then speed up or slow down the extending speed.
Simple algorithm can look like this

If (GotBlockFromFSM())
Success++ // We got the block from FSM
Else
Failure++ // Did not get Block from FSM and need to
extend by my self

Now while extending
-------------------------
Speed up
-------------
If (failure - success > Threshold ) // Threshold can be one number assume
10.
ExtendByBlock += failure- success; --> If many failure means load
is high then Increase the ExtendByBlock
Failure = success= 0 --> reset after this
so that we can measure the latest trend.

Slow down..
--------------
//Now its possible that demand of block is reduced but ExtendByBlock is
still high.. So analyse statistic and slow down the extending pace..

If (success - failure > Threshold)
{
// Can not reduce it by big number because, may be more request are
satisfying because this is correct amount, so gradually decrease the pace
and re-analyse the statistics next time.
ExtendByBlock --;
Failure = success= 0
}

Any Suggestions are Welcome...

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#49Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#48)
1 attachment(s)
Re: Relation extension scalability

On Wed, Mar 2, 2016 at 10:31 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

1. One option can be as you suggested like ProcArrayGroupClearXid, With
some modification, because when we wait for the request and extend w.r.t
that, may be again we face the Context Switch problem, So may be we can
extend in some multiple of the Request.
(But we need to take care whether to give block directly to requester or
add it into FSM or do both i.e. give at-least one block to requester and
put some multiple in FSM)

2. Other option can be that we analyse the current Load on the extend and
then speed up or slow down the extending speed.
Simple algorithm can look like this

I have tried the approach of group extend,

1. We convert the extension lock to TryLock and if we get the lock then
extend by one block.2.
2. If don't get the Lock then use the Group leader concep where only one
process will extend for all, Slight change from ProcArrayGroupClear is that
here other than satisfying the requested backend we Add some extra blocks
in FSM, say GroupSize*10.
3. So Actually we can not get exact load but still we have some factor like
group size tell us exactly the contention size and we extend in multiple of
that.

Performance Analysis:
---------------------
Performance is scaling with this approach, its slightly less compared to
previous patch where we directly give extend_by_block parameter and extend
in multiple blocks, and I think that's obvious because in group extend case
only after contention happen on lock we add extra blocks, but in former
case it was extending extra blocks optimistically.

Test1(COPY)
-----
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script

./pgbench -f copy_script -T 300 -c$ -j$ postgres

Shared Buffer:8GB max_wal_size:10GB Storage:Magnetic Disk WAL:SSD
-----------------------------------------------------------------------------------------------
Client Base multi_extend by 20 page group_extend_patch(groupsize*10)
1 105 157 149
2 217 255 282
4 210 494 452
8 166 702 629
16 145 563 561
32 124 480 566

Test2(INSERT)
--------
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int, b text)
echo "insert into data select * from test_data;" >> insert_script

shared_buffers=512GB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f insert_script -T 300 postgres

Client Base Multi-extend by 1000 *group extend (group*10)
*group extend (group*100)
1 117 118
125 122
2 111 140
161 159
4 51 190
141 187
8 43 254
148 172
16 40 243
150 173

* (group*10)-> means inside the code, Group leader will see how many
members are in the group who want blocks, so we will satisfy request of all
member + will put extra blocks in FSM extra block to extend are =
(group*10) --> 10 is just some constant.

Summary:
------------
1. Here with group extend patch, there is no configuration to tell that how
many block to extend, so that should be decided by current load in the
system (contention on the extension lock).
2. With small multiplier i.e. 10 we can get fairly good improvement compare
to base code, but when load is high like record size is 1K, improving the
multiplier size giving better results.

* Note: Currently this is POC patch, It has only one group Extend List, so
currently can handle only one relation Group extend.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_group.patchapplication/x-patch; name=multi_extend_group.patchDownload

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..adb82ba 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -23,6 +23,7 @@
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "storage/proc.h"
 
 
 /*
@@ -168,6 +169,160 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 	}
 }
 
+
+static BlockNumber
+GroupExtendRelation(PGPROC *proc, Relation relation, BulkInsertState bistate)
+{
+	volatile PROC_HDR *procglobal = ProcGlobal;
+	uint32		nextidx;
+	uint32		wakeidx;
+	int			extraWaits = -1;
+	BlockNumber targetBlock;
+	int count = 0;
+
+	/* Add ourselves to the list of processes needing a group extend. */
+	proc->groupExtendMember = true;
+
+	while (true)
+	{
+		nextidx = pg_atomic_read_u32(&procglobal->extendGroupFirst);
+		pg_atomic_write_u32(&proc->extendGroupNext, nextidx);
+
+		if (pg_atomic_compare_exchange_u32(&procglobal->extendGroupFirst,
+										   &nextidx,
+										   (uint32) proc->pgprocno))
+			break;
+	}
+
+	/*
+	 * If the list was not empty, the leader will clear our XID.  It is
+	 * impossible to have followers without a leader because the first process
+	 * that has added itself to the list will always have nextidx as
+	 * INVALID_PGPROCNO.
+	 */
+	if (nextidx != INVALID_PGPROCNO)
+	{
+		/* Sleep until the leader clears our XID. */
+		for (;;)
+		{
+			/* acts as a read barrier */
+			PGSemaphoreLock(&proc->sem);
+			if (!proc->groupExtendMember)
+				break;
+			extraWaits++;
+		}
+
+		Assert(pg_atomic_read_u32(&proc->extendGroupNext) == INVALID_PGPROCNO);
+
+		/* Fix semaphore count for any absorbed wake ups */
+		while (extraWaits-- > 0)
+			PGSemaphoreUnlock(&proc->sem);
+
+		targetBlock = proc->blockNum;
+
+		proc->blockNum = InvalidBlockNumber;
+
+		return targetBlock;
+	}
+
+	/* We are the leader.  Acquire the lock on behalf of everyone. */
+	LockRelationForExtension(relation, ExclusiveLock);
+
+	/*
+	 * Now that we've got the lock, clear the list of processes waiting for
+	 * group extending
+	 */
+	while (true)
+	{
+		nextidx = pg_atomic_read_u32(&procglobal->extendGroupFirst);
+		if (pg_atomic_compare_exchange_u32(&procglobal->extendGroupFirst,
+										   &nextidx,
+										   PG_INT32_MAX))
+			break;
+	}
+
+	/* Remember head of list so we can perform wakeups after dropping lock. */
+	wakeidx = nextidx;
+
+
+	/* Walk the list and clear all XIDs. */
+	while (nextidx != INVALID_PGPROCNO)
+	{
+		PGPROC	   *proc = &ProcGlobal->allProcs[nextidx];
+		Buffer buffer;
+
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+		proc->blockNum = BufferGetBlockNumber(buffer);
+
+		ReleaseBuffer(buffer);
+
+		/* Move to next proc in list. */
+		nextidx = pg_atomic_read_u32(&proc->extendGroupNext);
+
+		count ++;
+	}
+
+	count = count*10;
+
+	do
+	{
+		Buffer buffer;
+		Page		page;
+		Size		freespace;
+		BlockNumber blockNum;
+
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	} while (count--);
+
+	/* We're done with the lock now. */
+	UnlockRelationForExtension(relation, ExclusiveLock);
+
+	/*
+	 * Now that we've released the lock, go back and wake everybody up.  We
+	 * don't do this under the lock so as to keep lock hold times to a
+	 * minimum.  The system calls we need to perform to wake other processes
+	 * up are probably much slower than the simple memory writes we did while
+	 * holding the lock.
+	 */
+	while (wakeidx != INVALID_PGPROCNO)
+	{
+		PGPROC	   *proc = &ProcGlobal->allProcs[wakeidx];
+
+		wakeidx = pg_atomic_read_u32(&proc->extendGroupNext);
+		pg_atomic_write_u32(&proc->extendGroupNext, INVALID_PGPROCNO);
+
+		/* ensure all previous writes are visible before follower continues. */
+		pg_write_barrier();
+
+		proc->groupExtendMember = false;
+
+		if (proc != MyProc)
+			PGSemaphoreUnlock(&proc->sem);
+		else
+		{
+			targetBlock = proc->blockNum;
+			proc->blockNum = InvalidBlockNumber;
+		}
+	}
+
+	return targetBlock;
+}
+
 /*
  * RelationGetBufferForTuple
  *
@@ -238,6 +393,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
+	int			extraBlocks;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -308,6 +464,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -441,27 +598,95 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (TryLockRelationForExtension(relation, ExclusiveLock))
+		{
+			buffer = ReadBufferBI(relation, P_NEW, bistate);
+			/*
+			 * We can be certain that locking the otherBuffer first is OK, since
+			 * it must have a lower page number.
+			 */
+			if ((otherBuffer != InvalidBuffer) && !extraBlocks)
+				LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+			/*
+			 * Now acquire lock on the new page.
+			 */
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			UnlockRelationForExtension(relation, ExclusiveLock);
+		}
+		else
+		{
+			targetBlock = GroupExtendRelation(MyProc, relation, bistate);
+			goto loop;
+		}
+	}
+	else
+	{
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+		/*
+		 * We can be certain that locking the otherBuffer first is OK, since
+		 * it must have a lower page number.
+		 */
+		if ((otherBuffer != InvalidBuffer) && !extraBlocks)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	}
+
+#if 0
+	if (LOCKACQUIRE_NOT_AVAIL)
+
+	if (use_fsm)
+		extraBlocks = RelationGetExtendBlocks(relation) - 1;
+	else
+		extraBlocks = 0;
 	/*
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
 	 * rather than relying on the kernel to do it for us?
 	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
 
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+	do
+	{
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/*
+		 * We can be certain that locking the otherBuffer first is OK, since
+		 * it must have a lower page number.
+		 */
+		if ((otherBuffer != InvalidBuffer) && !extraBlocks)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		if (extraBlocks)
+		{
+			Page		page;
+			Size		freespace;
+			BlockNumber blockNum;
+
+			page = BufferGetPage(buffer);
+			PageInit(page, BufferGetPageSize(buffer), 0);
+
+			freespace = PageGetHeapFreeSpace(page);
+			MarkBufferDirty(buffer);
+			blockNum = BufferGetBlockNumber(buffer);
+			UnlockReleaseBuffer(buffer);
+			RecordPageWithFreeSpace(relation, blockNum, freespace);
+		}
+
+	} while (extraBlocks--);
 
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
@@ -471,7 +696,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
-
+#endif
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 9d16afb..08e5e07 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -340,6 +340,21 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 	(void) LockAcquire(&tag, lockmode, false, false);
 }
 
+LockAcquireResult
+TryLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockAcquire(&tag, lockmode, false, true);
+}
+
+
+
+
 /*
  *		UnlockRelationForExtension
  */
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6453b88..a823eda 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -182,6 +182,7 @@ InitProcGlobal(void)
 	ProcGlobal->walwriterLatch = NULL;
 	ProcGlobal->checkpointerLatch = NULL;
 	pg_atomic_init_u32(&ProcGlobal->procArrayGroupFirst, INVALID_PGPROCNO);
+	pg_atomic_init_u32(&ProcGlobal->extendGroupFirst, INVALID_PGPROCNO);
 
 	/*
 	 * Create and initialize all the PGPROC structures we'll need.  There are
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index dbcdd3f..5a0e60b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -152,6 +152,11 @@ struct PGPROC
 	 */
 	TransactionId	procArrayGroupMemberXid;
 
+	bool groupExtendMember;
+	pg_atomic_uint32	extendGroupNext;
+	uint32	blockNum;
+
+
 	/* Per-backend LWLock.  Protects fields below (but not group fields). */
 	LWLock		backendLock;
 
@@ -223,6 +228,8 @@ typedef struct PROC_HDR
 	PGPROC	   *bgworkerFreeProcs;
 	/* First pgproc waiting for group XID clear */
 	pg_atomic_uint32 procArrayGroupFirst;
+	pg_atomic_uint32 extendGroupFirst;
+
 	/* WALWriter process's latch */
 	Latch	   *walwriterLatch;
 	/* Checkpointer process's latch */
#50Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#49)
Re: Relation extension scalability

On Fri, Mar 4, 2016 at 12:06 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have tried the approach of group extend,

1. We convert the extension lock to TryLock and if we get the lock then
extend by one block.2.
2. If don't get the Lock then use the Group leader concep where only one
process will extend for all, Slight change from ProcArrayGroupClear is that
here other than satisfying the requested backend we Add some extra blocks in
FSM, say GroupSize*10.
3. So Actually we can not get exact load but still we have some factor like
group size tell us exactly the contention size and we extend in multiple of
that.

This approach seems good to me, and the performance results look very
positive. The nice thing about this is that there is not a
user-configurable knob; the system automatically determines when
larger extensions are needed, which will mean that real-world users
are much more likely to benefit from this. I don't think it matters
that this is a little faster or slower than an approach with a manual
knob; what matter is that it is a huge improvement over unpatched
master, and that it does not need a knob. The arbitrary constant of
10 is a little unsettling but I think we can live with it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#50)
Re: Relation extension scalability

Robert Haas <robertmhaas@gmail.com> writes:

This approach seems good to me, and the performance results look very
positive. The nice thing about this is that there is not a
user-configurable knob; the system automatically determines when
larger extensions are needed, which will mean that real-world users
are much more likely to benefit from this. I don't think it matters
that this is a little faster or slower than an approach with a manual
knob; what matter is that it is a huge improvement over unpatched
master, and that it does not need a knob. The arbitrary constant of
10 is a little unsettling but I think we can live with it.

+1. "No knob" is a huge win.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#50)
Re: Relation extension scalability

On Fri, Mar 4, 2016 at 9:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 4, 2016 at 12:06 AM, Dilip Kumar <dilipbalaut@gmail.com>

wrote:

I have tried the approach of group extend,

1. We convert the extension lock to TryLock and if we get the lock then
extend by one block.2.
2. If don't get the Lock then use the Group leader concep where only one
process will extend for all, Slight change from ProcArrayGroupClear is

that

here other than satisfying the requested backend we Add some extra

blocks in

FSM, say GroupSize*10.
3. So Actually we can not get exact load but still we have some factor

like

group size tell us exactly the contention size and we extend in

multiple of

that.

This approach seems good to me, and the performance results look very
positive. The nice thing about this is that there is not a
user-configurable knob; the system automatically determines when
larger extensions are needed, which will mean that real-world users
are much more likely to benefit from this.

I think one thing which needs more thoughts about this approach is that we
need to maintain some number of slots so that group extend for different
relations can happen in parallel. Do we want to provide simultaneous
extension for 1, 2, 3, 4, 5 or more number of relations? I think providing
it for three or four relations should be okay as higher the number we want
to provide, bigger the size of PGPROC structure will be.

+GroupExtendRelation(PGPROC *proc, Relation relation, BulkInsertState
bistate)

+{

+ volatile PROC_HDR *procglobal = ProcGlobal;

+ uint32 nextidx;

+ uint32 wakeidx;

+ int extraWaits = -1;

+ BlockNumber targetBlock;

+ int count = 0;

+

+ /* Add ourselves to the list of processes needing a group extend. */

+ proc->groupExtendMember = true;

..

..

+ /* We are the leader. Acquire the lock on behalf of everyone. */

+ LockRelationForExtension(relation, ExclusiveLock);

To provide it for multiple relations, I think you need to advocate the reloid
for relation in each proc and then get the relation descriptor for relation
extension lock.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#53Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#52)
Re: Relation extension scalability

On Fri, Mar 4, 2016 at 11:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think one thing which needs more thoughts about this approach is that we
need to maintain some number of slots so that group extend for different
relations can happen in parallel. Do we want to provide simultaneous
extension for 1, 2, 3, 4, 5 or more number of relations? I think providing
it for three or four relations should be okay as higher the number we want
to provide, bigger the size of PGPROC structure will be.

Hmm. Can we drive this off of the heavyweight lock manager's idea of
how big the relation extension lock wait queue is, instead of adding
more stuff to PGPROC?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#53)
Re: Relation extension scalability

On Mon, Mar 7, 2016 at 8:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 4, 2016 at 11:49 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think one thing which needs more thoughts about this approach is that

we

need to maintain some number of slots so that group extend for different
relations can happen in parallel. Do we want to provide simultaneous
extension for 1, 2, 3, 4, 5 or more number of relations? I think

providing

it for three or four relations should be okay as higher the number we

want

to provide, bigger the size of PGPROC structure will be.

Hmm. Can we drive this off of the heavyweight lock manager's idea of
how big the relation extension lock wait queue is, instead of adding
more stuff to PGPROC?

One idea to make it work without adding additional stuff in PGPROC is that
after acquiring relation extension lock, check if there is any available
block in fsm, if it founds any block, then release the lock and proceed,
else extend the relation by one block and then check lock's wait queue size
or number of lock requests (nRequested) and extend the relation further in
proportion to wait queue size and then release the lock and proceed. Here,
I think we can check for wait queue size even before extending the relation
by one block.

The benefit of doing it with PGPROC is that there will be relatively less
number LockAcquire calls as compare to heavyweight lock approach, which I
think should not matter much because we are planing to extend the relation
in proportion to wait queue size (probably wait queue size * 10).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#55Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#54)
Re: Relation extension scalability

On Tue, Mar 8, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm. Can we drive this off of the heavyweight lock manager's idea of
how big the relation extension lock wait queue is, instead of adding
more stuff to PGPROC?

One idea to make it work without adding additional stuff in PGPROC is that
after acquiring relation extension lock, check if there is any available
block in fsm, if it founds any block, then release the lock and proceed,
else extend the relation by one block and then check lock's wait queue size
or number of lock requests (nRequested) and extend the relation further in
proportion to wait queue size and then release the lock and proceed. Here,
I think we can check for wait queue size even before extending the relation
by one block.

The benefit of doing it with PGPROC is that there will be relatively less
number LockAcquire calls as compare to heavyweight lock approach, which I
think should not matter much because we are planing to extend the relation
in proportion to wait queue size (probably wait queue size * 10).

I don't think switching relation extension from heavyweight locks to
lightweight locks is going to work. It would mean, for example, that
we lose the ability to service interrupts while extending a relation;
not to mention that we lose scalability if many relations are being
extended at once.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#55)
Re: Relation extension scalability

On Tue, Mar 8, 2016 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 8, 2016 at 4:27 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Hmm. Can we drive this off of the heavyweight lock manager's idea of
how big the relation extension lock wait queue is, instead of adding
more stuff to PGPROC?

One idea to make it work without adding additional stuff in PGPROC is

that

after acquiring relation extension lock, check if there is any available
block in fsm, if it founds any block, then release the lock and proceed,
else extend the relation by one block and then check lock's wait queue

size

or number of lock requests (nRequested) and extend the relation further

in

proportion to wait queue size and then release the lock and proceed.

Here,

I think we can check for wait queue size even before extending the

relation

by one block.

The benefit of doing it with PGPROC is that there will be relatively

less

number LockAcquire calls as compare to heavyweight lock approach, which

I

think should not matter much because we are planing to extend the

relation

in proportion to wait queue size (probably wait queue size * 10).

I don't think switching relation extension from heavyweight locks to
lightweight locks is going to work.

Sorry, but I am not suggesting to change it to lightweight locks. I am
just suggesting how to make batching works with heavyweight locks as asked
by you.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#57Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#56)
1 attachment(s)
Re: Relation extension scalability

On Tue, Mar 8, 2016 at 8:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Hmm. Can we drive this off of the heavyweight lock manager's idea of
how big the relation extension lock wait queue is, instead of adding
more stuff to PGPROC?

One idea to make it work without adding additional stuff in PGPROC is

that

after acquiring relation extension lock, check if there is any

available

block in fsm, if it founds any block, then release the lock and

proceed,

else extend the relation by one block and then check lock's wait queue

size

or number of lock requests (nRequested) and extend the relation

further in

proportion to wait queue size and then release the lock and proceed.

Here,

I think we can check for wait queue size even before extending the

relation

by one block.

I have come up with this patch..

If this approach looks fine then I will prepare final patch (more comments,
indentation, and improve some code) and do some long run testing (current
results are 5 mins run).

Idea is same what Robert and Amit suggested up thread.

/* First we try the lock and if get just extend one block, this will give
two benefit ,
1. Single thread performance will not impact by checking lock waiters and
all
2. If we check the waiter in else part it will give time for more waiter to
get collected and will get better estimation of contention*/

TryRelExtLock ()
{
extend one block
}
else
{
RelextLock()
if (recheck the FSM if somebody have added blocks for me)
-- don't extend any block just reuse
else
--we have to extend blocks
-- get the waiter = lock->nRequested
--add extra block to FSM extraBlock = waiter*20;
}

Result looks like this
---------------------------

Test1(COPY)
-----./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
10000) g(i)) TO '/tmp/copybinary' WITH BINARY";echo COPY data from
'/tmp/copybinary' WITH BINARY; > copy_script
./pgbench -f copy_script -T 300 -c$ -j$ postgres

Shared Buffer:8GB max_wal_size:10GB Storage:Magnetic Disk WAL:SSD
-----------------------------------------------------------------------------------------------
Client Base multi_extend by 20 page lock_waiter patch(waiter*20)
1 105 157 148
2 217 255 252
4 210 494 442
8 166 702 645
16 145 563 773
32 124 480 957

Note: @1 thread there should not be any improvement, so many be run to run
variance.

Test2(INSERT)
--------./psql -d postgres -c "create table test_data(a int, b text)"./psql
-d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"./psql -d postgres -c
"create table data (a int, b text)
echo "insert into data select * from test_data;" >> insert_script

shared_buffers=512GB max_wal_size=20GB checkpoint_timeout=10min
./pgbench -c $ -j $ -f insert_script -T 300 postgres

Client Base Multi-extend by 1000 lock_waiter patch(waiter*20)
1 117 118 117

2 111 140 132

4 51 190 134

8 43 254 148

16 40 243 206
32 - - 264

* (waiter*20)-> First process got the lock will find the lock waiters and
add waiter*20 extra blocks.

In next run I will run beyond 32 also, as we can see even at 32 client its
increasing.. so its clear when it see more contentions, adapting to
contention and adding more blocks..

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v5.patchapplication/octet-stream; name=multi_extend_v5.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..832a80d 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -167,6 +167,32 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 			break;
 	}
 }
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)
+{
+	Buffer buffer;
+	/*
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since
+	 * it must have a lower page number.
+	 */
+	if (otherBuffer != InvalidBuffer)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return buffer;
+}
 
 /*
  * RelationGetBufferForTuple
@@ -236,7 +262,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	Size		pageFreeSpace,
 				saveFreeSpace;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				lastValidBlock;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +335,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +416,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 								 otherBlock, targetBlock, vmbuffer_other,
 								 vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -441,36 +471,76 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
-
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.  Note that we cannot release this lock before
-	 * we have buffer lock on the new page, or we risk a race condition
-	 * against vacuumlazy.c --- see comments therein.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (TryLockRelationForExtension(relation, ExclusiveLock))
+		{
+			buffer = RelationAddOneBlock(relation, otherBuffer, bistate);
+			UnlockRelationForExtension(relation, ExclusiveLock);
+		}
+		else
+		{
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			if (use_fsm)
+			{
+				Page		page;
+				Size		freespace;
+				BlockNumber blockNum;
+				int			extraBlocks = 0;
+				int 		lockWaiters = 0;
+				Buffer		buf;
+
+				/*
+				 * Update FSM as to condition of this page, and ask for another page
+				 * to try.
+				 */
+				targetBlock = RecordAndGetPageWithFreeSpace(relation,
+															lastValidBlock,
+															pageFreeSpace,
+															len + saveFreeSpace);
+
+				/* Other waiter has extended the block for us*/
+				if (targetBlock != InvalidBlockNumber)
+				{
+					UnlockRelationForExtension(relation, ExclusiveLock);
+					goto loop;
+				}
+
+				buffer = RelationAddOneBlock(relation, otherBuffer, bistate);
+
+				lockWaiters = RelationExtensionLockWaiter(relation);
+
+				extraBlocks = lockWaiters * 20;
+
+				while (extraBlocks--)
+				{
+					/*
+					 * XXX This does an lseek - rather expensive - but at the moment it is the
+					 * only way to accurately determine how many blocks are in a relation.  Is
+					 * it worth keeping an accurate file length in shared memory someplace,
+					 * rather than relying on the kernel to do it for us?
+					 */
+					buf = ReadBufferBI(relation, P_NEW, bistate);
+					LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+					page = BufferGetPage(buf);
+					PageInit(page, BufferGetPageSize(buf), 0);
+					freespace = PageGetHeapFreeSpace(page);
+					MarkBufferDirty(buf);
+					blockNum = BufferGetBlockNumber(buf);
+					UnlockReleaseBuffer(buf);
+
+					RecordPageWithFreeSpace(relation, blockNum, freespace);
+				}
+			}
+			else
+				buffer = RelationAddOneBlock(relation, otherBuffer, bistate);
+
+			UnlockRelationForExtension(relation, ExclusiveLock);
+		}
+	}
+	else
+		buffer = RelationAddOneBlock(relation, otherBuffer, bistate);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 9d16afb..a56b203 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -340,6 +340,30 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 	(void) LockAcquire(&tag, lockmode, false, false);
 }
 
+LockAcquireResult
+TryLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockAcquire(&tag, lockmode, false, true);
+}
+
+int
+RelationExtensionLockWaiter(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
 /*
  *		UnlockRelationForExtension
  */
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index a458c68..8f49192 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,35 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCALLOCKTAG localtag;
+	LOCALLOCK  *locallock;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+
+	/*
+	 * Find a LOCALLOCK entry for this lock and lockmode
+	 */
+	MemSet(&localtag, 0, sizeof(localtag));		/* must clear padding */
+	localtag.lock = *locktag;
+	localtag.mode = ExclusiveLock;
+
+	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
+										  (void *) &localtag,
+										  HASH_FIND, &found);
+
+	if (found)
+		waiters = locallock->lock->nRequested;
+
+	return waiters;
+}
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index e9d41bf..a492bb1 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -101,4 +101,7 @@ extern void UnlockSharedObjectForSession(Oid classid, Oid objid, uint16 objsubid
 /* Describe a locktag for error messages */
 extern void DescribeLockTag(StringInfo buf, const LOCKTAG *tag);
 
+extern LockAcquireResult TryLockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern int RelationExtensionLockWaiter(Relation relation);
+
 #endif   /* LMGR_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 788d50a..3fd74fb 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -572,6 +572,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#58Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#57)
Re: Relation extension scalability

On Tue, Mar 8, 2016 at 11:20 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have come up with this patch..

If this approach looks fine then I will prepare final patch (more comments,
indentation, and improve some code) and do some long run testing (current
results are 5 mins run).

Idea is same what Robert and Amit suggested up thread.

So this seems promising, but I think the code needs some work.

LockWaiterCount() bravely accesses a shared memory data structure that
is mutable with no locking at all. That might actually be safe for
our purposes, but I think it would be better to take the partition
lock in shared mode if it doesn't cost too much performance. If
that's too expensive, then it should at least have a comment
explaining (1) that it is doing this without the lock and (2) why
that's safe (sketch: the LOCK can't go away because we hold it, and
nRequested could change but we don't mind a stale value, and a 32-bit
read won't be torn).

A few of the other functions in here also lack comments, and perhaps
should have them.

The changes to RelationGetBufferForTuple need a visit from the
refactoring police. Look at the way you are calling
RelationAddOneBlock. The logic looks about like this:

if (needLock)
{
if (trylock relation for extension)
RelationAddOneBlock();
else
{
lock relation for extension;
if (use fsm)
{
complicated;
}
else
RelationAddOneBlock();
}
else
RelationAddOneBlock();

So basically you want to do the RelationAddOneBlock() thing if
!needLock || !use_fsm || can't trylock. See if you can rearrange the
code so that there's only one fallthrough call to
RelationAddOneBlock() instead of three separate ones.

Also, consider moving the code that adds multiple blocks at a time to
its own function instead of including it all in line.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#58)
1 attachment(s)
Re: Relation extension scalability

On Wed, Mar 9, 2016 at 1:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

LockWaiterCount() bravely accesses a shared memory data structure that
is mutable with no locking at all. That might actually be safe for
our purposes, but I think it would be better to take the partition
lock in shared mode if it doesn't cost too much performance. If
that's too expensive, then it should at least have a comment
explaining (1) that it is doing this without the lock and (2) why
that's safe (sketch: the LOCK can't go away because we hold it, and
nRequested could change but we don't mind a stale value, and a 32-bit
read won't be torn).

With LWLock also performance are equally good so added the lock.

A few of the other functions in here also lack comments, and perhaps
should have them.

The changes to RelationGetBufferForTuple need a visit from the
refactoring police. Look at the way you are calling
RelationAddOneBlock. The logic looks about like this:

if (needLock)
{
if (trylock relation for extension)
RelationAddOneBlock();
else
{
lock relation for extension;
if (use fsm)
{
complicated;
}
else
RelationAddOneBlock();
}
else
RelationAddOneBlock();

So basically you want to do the RelationAddOneBlock() thing if
!needLock || !use_fsm || can't trylock. See if you can rearrange the
code so that there's only one fallthrough call to
RelationAddOneBlock() instead of three separate ones.

Actually in every case we need one blocks, So I have re factored it and
RelationAddOneBlock is now out of any condition.

Also, consider moving the code that adds multiple blocks at a time to
its own function instead of including it all in line.

Done

Attaching a latest patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v6.patchtext/x-patch; charset=US-ASCII; name=multi_extend_v6.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..eb4ee0c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,86 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)
+{
+	Buffer buffer;
+	/*
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since
+	 * it must have a lower page number.
+	 */
+	if (otherBuffer != InvalidBuffer)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return buffer;
+}
+/*
+ * RelationAddExtraBlocks
+ *
+ * Extend extra blocks for the relations to avoid the future contention
+ * on the relation extension lock.
+ */
+
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int 		lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * For calculating number of extra blocks to extend, find the level
+	 * of contention on this lock, by getting the requester of this lock
+	 * and add extra blocks in multiple of waiters.
+	 */
+	lockWaiters = RelationExtensionLockWaiter(relation);
+
+	extraBlocks = lockWaiters * 20;
+
+	while (extraBlocks--)
+	{
+		/*
+		 * Here we are adding extra blocks to the relation after
+		 * adding each block update the information in FSM so that
+		 * other backend running parallel can find the block.
+		 */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +313,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +389,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +470,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 								 otherBlock, targetBlock, vmbuffer_other,
 								 vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -441,37 +525,53 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
-
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+	{
+		/*
+		 * First try to get the lock in no-wait mode, if succeed extend one
+		 * block, else get the lock in normal mode and after we get the lock
+		 * extend some extra blocks, extra blocks will be added to satisfy
+		 * request of other waiters and avoid future contention. Here instead
+		 * of directly taking lock we try no-wait mode, this is to handle the
+		 * case, when there is no contention - it should not find the lock
+		 * waiter and execute extra instructions.
+		 */
+		if (LOCKACQUIRE_OK
+				!= RelationExtensionLockConditional(relation, ExclusiveLock))
+		{
+			LockRelationForExtension(relation, ExclusiveLock);
+			
+			if (use_fsm)
+			{
+				if (lastValidBlock != InvalidBlockNumber)
+				{
+					targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														lastValidBlock,
+														pageFreeSpace,
+														len + saveFreeSpace);
+				}	
+
+				/* Other waiter has extended the block for us*/
+				if (targetBlock != InvalidBlockNumber)
+				{
+					UnlockRelationForExtension(relation, ExclusiveLock);
+					goto loop;
+				}
+				
+				RelationAddExtraBlocks(relation, bistate);
+			}
+		}
+	}
 
-	/*
-	 * Now acquire lock on the new page.
+	/* For all case we need to add at least one block to satisfy our
+	 * own request.
 	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = RelationAddOneBlock(relation, otherBuffer, bistate);
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.  Note that we cannot release this lock before
-	 * we have buffer lock on the new page, or we risk a race condition
-	 * against vacuumlazy.c --- see comments therein.
-	 */
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
 
+
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 9d16afb..5d9454d 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,40 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		RelationExtensionLockConditional
+ *
+ * Same as LockRelationForExtension except it will not wait on the lock.
+ */
+LockAcquireResult
+RelationExtensionLockConditional(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockAcquire(&tag, lockmode, false, true);
+}
+
+/*
+ *		RelationExtensionLockWaiter
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiter(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index a458c68..cf723ae 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,47 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCALLOCKTAG localtag;
+	LOCALLOCK  *locallock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	/*
+	 * Find a LOCALLOCK entry for this lock and lockmode
+	 */
+	MemSet(&localtag, 0, sizeof(localtag));		/* must clear padding */
+	localtag.lock = *locktag;
+	localtag.mode = ExclusiveLock;
+
+	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
+										(void *) &localtag,
+										HASH_FIND, &found);
+
+
+	hashcode = locallock->hashcode;
+
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	if (found)
+		waiters = locallock->lock->nRequested;
+
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index e9d41bf..ea8e19e 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -101,4 +101,7 @@ extern void UnlockSharedObjectForSession(Oid classid, Oid objid, uint16 objsubid
 /* Describe a locktag for error messages */
 extern void DescribeLockTag(StringInfo buf, const LOCKTAG *tag);
 
+extern LockAcquireResult RelationExtensionLockConditional(Relation relation, LOCKMODE lockmode);
+extern int RelationExtensionLockWaiter(Relation relation);
+
 #endif   /* LMGR_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 788d50a..3fd74fb 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -572,6 +572,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#60Petr Jelinek
petr@2ndquadrant.com
In reply to: Dilip Kumar (#59)
Re: Relation extension scalability

On 10/03/16 02:53, Dilip Kumar wrote:

Attaching a latest patch.

Hmm, why did you remove the comment above the call to
UnlockRelationForExtension? It still seems relevant, maybe with some
minor modification?

Also there is a bit of whitespace mess inside the conditional lock block.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Dilip Kumar
dilipbalaut@gmail.com
In reply to: Petr Jelinek (#60)
1 attachment(s)
Re: Relation extension scalability

On Thu, Mar 10, 2016 at 7:55 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Thanks for the comments..

Hmm, why did you remove the comment above the call to
UnlockRelationForExtension?

While re factoring I lose this comment.. Fixed it

It still seems relevant, maybe with some minor modification?

Also there is a bit of whitespace mess inside the conditional lock block.

Fixed

I got the result of 10 mins run so posting it..
Note: Base code results are copied from up thread...

Results For 10 Mins run of COPY 10000 records of size 4 bytes script and
configuration are same as used in up thread
--------------------------------------------------------------------------------------------

Client Base Patch
1 105 111
2 217 246
4 210 428
8 166 653
16 145 808
32 124 988
64 --- 974

Results For 10 Mins run of INSERT 1000 records of size 1024 bytes(data
don't fits in shared buffer)
--------------------------------------------------------------------------------------------------

Client Base Patch
1 117 120
2 111 126
4 51 130
8 43 147
16 40 209
32 --- 254
64 --- 205

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v7.patchtext/x-patch; charset=US-ASCII; name=multi_extend_v7.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..b73535c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,87 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)
+{
+	Buffer buffer;
+	/*
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since
+	 * it must have a lower page number.
+	 */
+	if (otherBuffer != InvalidBuffer)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return buffer;
+}
+
+/*
+ * RelationAddExtraBlocks
+ *
+ * Extend extra blocks for the relations to avoid the future contention
+ * on the relation extension lock.
+ */
+
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int 		lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * For calculating number of extra blocks to extend, find the level
+	 * of contention on this lock, by getting the requester of this lock
+	 * and add extra blocks in multiple of waiters.
+	 */
+	lockWaiters = RelationExtensionLockWaiter(relation);
+
+	extraBlocks = lockWaiters * 20;
+
+	while (extraBlocks--)
+	{
+		/*
+		 * Here we are adding extra blocks to the relation after
+		 * adding each block update the information in FSM so that
+		 * other backend running parallel can find the block.
+		 */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +314,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +390,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +471,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 								 otherBlock, targetBlock, vmbuffer_other,
 								 vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -441,27 +526,47 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
-
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+	{
+		/*
+		 * First try to get the lock in no-wait mode, if succeed extend one
+		 * block, else get the lock in normal mode and after we get the lock
+		 * extend some extra blocks, extra blocks will be added to satisfy
+		 * request of other waiters and avoid future contention. Here instead
+		 * of directly taking lock we try no-wait mode, this is to handle the
+		 * case, when there is no contention - it should not find the lock
+		 * waiter and execute extra instructions.
+		 */
+		if (LOCKACQUIRE_OK
+				!= RelationExtensionLockConditional(relation, ExclusiveLock))
+		{
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			if (use_fsm)
+			{
+				if (lastValidBlock != InvalidBlockNumber)
+				{
+					targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														lastValidBlock,
+														pageFreeSpace,
+														len + saveFreeSpace);
+				}
+
+				/* Other waiter has extended the block for us*/
+				if (targetBlock != InvalidBlockNumber)
+				{
+					UnlockRelationForExtension(relation, ExclusiveLock);
+					goto loop;
+				}
+
+				RelationAddExtraBlocks(relation, bistate);
+			}
+		}
+	}
 
-	/*
-	 * Now acquire lock on the new page.
+	/* For all case we need to add at least one block to satisfy our
+	 * own request.
 	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = RelationAddOneBlock(relation, otherBuffer, bistate);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
@@ -472,6 +577,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
 
+
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 9d16afb..5d9454d 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,40 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		RelationExtensionLockConditional
+ *
+ * Same as LockRelationForExtension except it will not wait on the lock.
+ */
+LockAcquireResult
+RelationExtensionLockConditional(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockAcquire(&tag, lockmode, false, true);
+}
+
+/*
+ *		RelationExtensionLockWaiter
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiter(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index a458c68..cf723ae 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,47 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCALLOCKTAG localtag;
+	LOCALLOCK  *locallock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	/*
+	 * Find a LOCALLOCK entry for this lock and lockmode
+	 */
+	MemSet(&localtag, 0, sizeof(localtag));		/* must clear padding */
+	localtag.lock = *locktag;
+	localtag.mode = ExclusiveLock;
+
+	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
+										(void *) &localtag,
+										HASH_FIND, &found);
+
+
+	hashcode = locallock->hashcode;
+
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	if (found)
+		waiters = locallock->lock->nRequested;
+
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index e9d41bf..ea8e19e 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -101,4 +101,7 @@ extern void UnlockSharedObjectForSession(Oid classid, Oid objid, uint16 objsubid
 /* Describe a locktag for error messages */
 extern void DescribeLockTag(StringInfo buf, const LOCKTAG *tag);
 
+extern LockAcquireResult RelationExtensionLockConditional(Relation relation, LOCKMODE lockmode);
+extern int RelationExtensionLockWaiter(Relation relation);
+
 #endif   /* LMGR_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 788d50a..3fd74fb 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -572,6 +572,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#62Petr Jelinek
petr@2ndquadrant.com
In reply to: Dilip Kumar (#61)
Re: Relation extension scalability

On 10/03/16 09:57, Dilip Kumar wrote:

On Thu, Mar 10, 2016 at 7:55 AM, Petr Jelinek <petr@2ndquadrant.com
<mailto:petr@2ndquadrant.com>> wrote:

Thanks for the comments..

Hmm, why did you remove the comment above the call to
UnlockRelationForExtension?

While re factoring I lose this comment.. Fixed it

It still seems relevant, maybe with some minor modification?

Also there is a bit of whitespace mess inside the conditional lock
block.

Fixed

I got the result of 10 mins run so posting it..
Note: Base code results are copied from up thread...

Results For 10 Mins run of COPY 10000 records of size 4 bytes script and
configuration are same as used in up thread
--------------------------------------------------------------------------------------------

Client Base Patch
1 105 111
2 217 246
4 210 428
8 166 653
16 145 808
32 124 988
64 --- 974

Results For 10 Mins run of INSERT 1000 records of size 1024 bytes(data
don't fits in shared buffer)
--------------------------------------------------------------------------------------------------

Client Base Patch
1 117 120
2 111 126
4 51 130
8 43 147
16 40 209
32 --- 254
64 --- 205

Those look good. The patch looks good in general now. I am bit scared by
the lockWaiters * 20 as it can result in relatively big changes in rare
corner cases when for example a lot of backends were waiting for lock on
relation and suddenly all try to extend it. I wonder if we should clamp
it to something sane (although what's sane today might be small in
couple of years).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Dilip Kumar
dilipbalaut@gmail.com
In reply to: Petr Jelinek (#62)
Re: Relation extension scalability

On Fri, Mar 11, 2016 at 12:04 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Thanks for looking..

Those look good. The patch looks good in general now. I am bit scared by

the lockWaiters * 20 as it can result in relatively big changes in rare
corner cases when for example a lot of backends were waiting for lock on
relation and suddenly all try to extend it. I wonder if we should clamp it
to something sane (although what's sane today might be small in couple of
years).

But in such condition when all are waiting on lock, then at a time only one
waiter will get the lock and that will easily count the lock waiter and
extend in multiple of that. And we also have the check that if any waiter
get the lock it will first check in FSM that anybody else have added one
block or not..

And other waiter will not get lock unless first waiter extend all blocks
and release the locks.

One possible case is as soon as we extend the blocks new requester directly
find in FSM and don't come for lock, and old waiter after getting lock
don't find in FSM, But IMHO in such cases, also its good that other waiter
also extend more blocks (because this can happen when request flow is very
high).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#64Petr Jelinek
petr@2ndquadrant.com
In reply to: Dilip Kumar (#63)
Re: Relation extension scalability

On 11/03/16 02:44, Dilip Kumar wrote:

On Fri, Mar 11, 2016 at 12:04 AM, Petr Jelinek <petr@2ndquadrant.com
<mailto:petr@2ndquadrant.com>> wrote:

Thanks for looking..

Those look good. The patch looks good in general now. I am bit
scared by the lockWaiters * 20 as it can result in relatively big
changes in rare corner cases when for example a lot of backends were
waiting for lock on relation and suddenly all try to extend it. I
wonder if we should clamp it to something sane (although what's sane
today might be small in couple of years).

But in such condition when all are waiting on lock, then at a time only
one waiter will get the lock and that will easily count the lock waiter
and extend in multiple of that. And we also have the check that if any
waiter get the lock it will first check in FSM that anybody else have
added one block or not..

I am not talking about extension locks, the lock queue can be long
because there is concurrent DDL for example and then once DDL finishes
suddenly 100 connections that tried to insert into table will try to get
extension lock and this will add 2000 new pages when much fewer was
actually needed. I guess that's fine as it's corner case and it's only
16MB even in such extreme case.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#64)
Re: Relation extension scalability

On Thu, Mar 10, 2016 at 8:54 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

I am not talking about extension locks, the lock queue can be long because
there is concurrent DDL for example and then once DDL finishes suddenly 100
connections that tried to insert into table will try to get extension lock
and this will add 2000 new pages when much fewer was actually needed. I
guess that's fine as it's corner case and it's only 16MB even in such
extreme case.

I don't really understand this part about concurrent DDL. If there
were concurrent DDL going on, presumably other backends would be
blocked on the relation lock, not the relation extension lock - and it
doesn't seem likely that you'd often have a huge pile-up of inserters
waiting on concurrent DDL. But I guess it could happen.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#65)
Re: Relation extension scalability

On 11/03/16 22:29, Robert Haas wrote:

On Thu, Mar 10, 2016 at 8:54 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

I am not talking about extension locks, the lock queue can be long because
there is concurrent DDL for example and then once DDL finishes suddenly 100
connections that tried to insert into table will try to get extension lock
and this will add 2000 new pages when much fewer was actually needed. I
guess that's fine as it's corner case and it's only 16MB even in such
extreme case.

I don't really understand this part about concurrent DDL. If there
were concurrent DDL going on, presumably other backends would be
blocked on the relation lock, not the relation extension lock - and it
doesn't seem likely that you'd often have a huge pile-up of inserters
waiting on concurrent DDL. But I guess it could happen.

Yeah I was thinking about the latter part and as I said it's very rare
case, but I did see something similar couple of times in the wild. It's
not objection against committing this patch though, in fact I think it
can be committed as is.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Petr Jelinek (#66)
Re: Relation extension scalability

On 3/11/16 5:14 PM, Petr Jelinek wrote:

I don't really understand this part about concurrent DDL. If there
were concurrent DDL going on, presumably other backends would be
blocked on the relation lock, not the relation extension lock - and it
doesn't seem likely that you'd often have a huge pile-up of inserters
waiting on concurrent DDL. But I guess it could happen.

Yeah I was thinking about the latter part and as I said it's very rare
case, but I did see something similar couple of times in the wild. It's
not objection against committing this patch though, in fact I think it
can be committed as is.

FWIW, this is definitely a real possibility in any shop that has very
high downtime costs and high transaction rates.

I also think some kind of clamp is a good idea. It's not that uncommon
to run max_connections significantly higher than 100, so the extension
could be way larger than 16MB. In those cases this patch could actually
make things far worse as everyone backs up waiting on the OS to extend
many MB when all you actually needed were a couple dozen more pages.

BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent
demand for extension that means you're running lots of small DML
operations, not really big ones. I'd think that would make *1 more
appropriate.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Dilip Kumar
dilipbalaut@gmail.com
In reply to: Jim Nasby (#67)
Re: Relation extension scalability

On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

FWIW, this is definitely a real possibility in any shop that has very high
downtime costs and high transaction rates.

I also think some kind of clamp is a good idea. It's not that uncommon to
run max_connections significantly higher than 100, so the extension could
be way larger than 16MB. In those cases this patch could actually make
things far worse as everyone backs up waiting on the OS to extend many MB
when all you actually needed were a couple dozen more pages.

I agree, We can have some max limit on number of extra pages, What other
thinks ?

BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent
demand for extension that means you're running lots of small DML
operations, not really big ones. I'd think that would make *1 more
appropriate.

*1 will not solve this problem, Here the main problem was many people are
sleep/wakeup on the extension lock and that was causing the bottleneck. So
if we do *1 this will satisfy only current requesters which has already
waited on the lock. But our goal is to avoid backends from requesting this
lock.

Idea of Finding the requester to get the statistics on this locks (load on
the lock) and extend in multiple of load so that in future this situation
will be avoided for long time and again when happen next time extend in
multiple of load.

How 20 comes ?
I tested with Multiple clients loads 1..64, with multiple load size 4
byte records to 1KB Records, COPY/ INSERT and found 20 works best.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#69Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#68)
Re: Relation extension scalability

On Sat, Mar 12, 2016 at 8:16 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com>

wrote:

FWIW, this is definitely a real possibility in any shop that has very

high downtime costs and high transaction rates.

I also think some kind of clamp is a good idea. It's not that uncommon

to run max_connections significantly higher than 100, so the extension
could be way larger than 16MB. In those cases this patch could actually
make things far worse as everyone backs up waiting on the OS to extend many
MB when all you actually needed were a couple dozen more pages.

I agree, We can have some max limit on number of extra pages, What other

thinks ?

BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent

demand for extension that means you're running lots of small DML
operations, not really big ones. I'd think that would make *1 more
appropriate.

*1 will not solve this problem, Here the main problem was many people are

sleep/wakeup on the extension lock and that was causing the bottleneck. So
if we do *1 this will satisfy only current requesters which has already
waited on the lock. But our goal is to avoid backends from requesting this
lock.

Idea of Finding the requester to get the statistics on this locks (load

on the lock) and extend in multiple of load so that in future this
situation will be avoided for long time and again when happen next time
extend in multiple of load.

How 20 comes ?
I tested with Multiple clients loads 1..64, with multiple load size 4

byte records to 1KB Records, COPY/ INSERT and found 20 works best.

Can you post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier
you have tried, so that it is clear that 20 is best?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#70Petr Jelinek
petr@2ndquadrant.com
In reply to: Jim Nasby (#67)
Re: Relation extension scalability

On 12/03/16 01:01, Jim Nasby wrote:

On 3/11/16 5:14 PM, Petr Jelinek wrote:

I don't really understand this part about concurrent DDL. If there
were concurrent DDL going on, presumably other backends would be
blocked on the relation lock, not the relation extension lock - and it
doesn't seem likely that you'd often have a huge pile-up of inserters
waiting on concurrent DDL. But I guess it could happen.

Yeah I was thinking about the latter part and as I said it's very rare
case, but I did see something similar couple of times in the wild. It's
not objection against committing this patch though, in fact I think it
can be committed as is.

FWIW, this is definitely a real possibility in any shop that has very
high downtime costs and high transaction rates.

I also think some kind of clamp is a good idea. It's not that uncommon
to run max_connections significantly higher than 100, so the extension
could be way larger than 16MB. In those cases this patch could actually
make things far worse as everyone backs up waiting on the OS to extend
many MB when all you actually needed were a couple dozen more pages.

Well yeah I've seen 10k, but not everything will write to same table,
wanted to stay in realms of something that has realistic chance of
happening.

BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent
demand for extension that means you're running lots of small DML
operations, not really big ones. I'd think that would make *1 more
appropriate.

The benchmarks I've seen showed you want at least *10 and *20 was better.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Petr Jelinek
petr@2ndquadrant.com
In reply to: Dilip Kumar (#68)
Re: Relation extension scalability

On 12/03/16 03:46, Dilip Kumar wrote:

On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby <Jim.Nasby@bluetreble.com
<mailto:Jim.Nasby@bluetreble.com>> wrote:

FWIW, this is definitely a real possibility in any shop that has
very high downtime costs and high transaction rates.

I also think some kind of clamp is a good idea. It's not that
uncommon to run max_connections significantly higher than 100, so
the extension could be way larger than 16MB. In those cases this
patch could actually make things far worse as everyone backs up
waiting on the OS to extend many MB when all you actually needed
were a couple dozen more pages.

I agree, We can have some max limit on number of extra pages, What other
thinks ?

Well, that's what I meant with clamping originally. I don't know what is
a good value though.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#69)
Re: Relation extension scalability

On Sat, Mar 12, 2016 at 8:37 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Can you post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier
you have tried, so that it is clear that 20 is best?

I had Tried with 1, 10, 20 and 50.

1. With base code it was almost the same as base code.

2. With 10 thread data it matching with my previous group extend patch data
posted upthread
/messages/by-id/CAFiTN-tyEu+Wf0-jBc3TGfCoHdEAjNTx=WVuxpoA1vDDyST6KQ@mail.gmail.com

3. Beyond 20 with 50 I did not see any extra benefit in performance number
(compared to 20 when tested 50 with 4 byte COPY I did not tested other data
size with 50.).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#73Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Petr Jelinek (#71)
Re: Relation extension scalability

On 3/11/16 9:57 PM, Petr Jelinek wrote:

I also think some kind of clamp is a good idea. It's not that
uncommon to run max_connections significantly higher than 100, so
the extension could be way larger than 16MB. In those cases this
patch could actually make things far worse as everyone backs up
waiting on the OS to extend many MB when all you actually needed
were a couple dozen more pages.

I agree, We can have some max limit on number of extra pages, What other
thinks ?

Well, that's what I meant with clamping originally. I don't know what is
a good value though.

Well, 16MB is 2K pages, which is what you'd get if 100 connections were
all blocked and we're doing 20 pages per waiter. That seems like a
really extreme scenario, so maybe 4MB is a good compromise. That's
unlikely to be hit in most cases, unlikely to put a ton of stress on IO,
even with magnetic media (assuming the whole 4MB is queued to write in
one shot...). 4MB would still reduce the number of locks by 500x.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Dilip Kumar
dilipbalaut@gmail.com
In reply to: Jim Nasby (#73)
Re: Relation extension scalability

On Mon, Mar 14, 2016 at 5:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Well, 16MB is 2K pages, which is what you'd get if 100 connections were
all blocked and we're doing 20 pages per waiter. That seems like a really
extreme scenario, so maybe 4MB is a good compromise. That's unlikely to be
hit in most cases, unlikely to put a ton of stress on IO, even with
magnetic media (assuming the whole 4MB is queued to write in one shot...).
4MB would still reduce the number of locks by 500x.

In my performance results given up thread, we are getting max performance
at 32 clients, means at a time we are extending 32*20 ~= max (600) pages at
a time. So now with 4MB limit (max 512 pages) Results will looks similar.
So we need to take a decision whether 4MB is good limit, should I change it
?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#75Petr Jelinek
petr@2ndquadrant.com
In reply to: Dilip Kumar (#74)
Re: Relation extension scalability

On 14/03/16 03:29, Dilip Kumar wrote:

On Mon, Mar 14, 2016 at 5:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com
<mailto:Jim.Nasby@bluetreble.com>> wrote:

Well, 16MB is 2K pages, which is what you'd get if 100 connections
were all blocked and we're doing 20 pages per waiter. That seems
like a really extreme scenario, so maybe 4MB is a good compromise.
That's unlikely to be hit in most cases, unlikely to put a ton of
stress on IO, even with magnetic media (assuming the whole 4MB is
queued to write in one shot...). 4MB would still reduce the number
of locks by 500x.

In my performance results given up thread, we are getting max
performance at 32 clients, means at a time we are extending 32*20 ~= max
(600) pages at a time. So now with 4MB limit (max 512 pages) Results
will looks similar. So we need to take a decision whether 4MB is good
limit, should I change it ?

Well any value we choose will be very arbitrary. If we look at it from
the point of maximum absolute disk space we allocate for relation at
once, the 4MB limit would represent 2.5 orders of magnitude change. That
sounds like enough for one release cycle, I think we can further tune it
if the need arises in next one. (with my love for round numbers I would
have suggested 8MB as that's 3 orders of magnitude, but I am fine with
4MB as well)

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Dilip Kumar
dilipbalaut@gmail.com
In reply to: Petr Jelinek (#75)
1 attachment(s)
Re: Relation extension scalability

On Mon, Mar 14, 2016 at 8:26 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Well any value we choose will be very arbitrary. If we look at it from the
point of maximum absolute disk space we allocate for relation at once,
the 4MB limit would represent 2.5 orders of magnitude change. That sounds
like enough for one release cycle, I think we can further tune it if the
need arises in next one. (with my love for round numbers I would have
suggested 8MB as that's 3 orders of magnitude, but I am fine with 4MB as
well)

I have modified the patch, this contains the max limit on extra pages,
512(4MB) pages is the max limit.

I have measured the performance also and that looks equally good.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v8.patchtext/x-patch; charset=US-ASCII; name=multi_extend_v8.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..fcd42b9 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,94 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)
+{
+	Buffer buffer;
+	/*
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since
+	 * it must have a lower page number.
+	 */
+	if (otherBuffer != InvalidBuffer)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return buffer;
+}
+
+/*
+ * RelationAddExtraBlocks
+ *
+ * Extend extra blocks for the relations to avoid the future contention
+ * on the relation extension lock.
+ */
+
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int 		lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * For calculating number of extra blocks to extend, find the level
+	 * of contention on this lock, by getting the requester of this lock
+	 * and add extra blocks in multiple of waiters.
+	 */
+	lockWaiters = RelationExtensionLockWaiter(relation);
+
+	extraBlocks = lockWaiters * 20;
+
+	/* To avoid the cases when there are huge number of lock waiters, and
+	 * extend file size by big amount at a time, put some limit on the
+	 * max number of pages to be extended at a time.
+	 */
+	if (extraBlocks > 512)
+		extraBlocks = 512;
+
+	while (extraBlocks--)
+	{
+		/*
+		 * Here we are adding extra blocks to the relation after
+		 * adding each block update the information in FSM so that
+		 * other backend running parallel can find the block.
+		 */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +321,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +397,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +478,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 								 otherBlock, targetBlock, vmbuffer_other,
 								 vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -441,27 +533,47 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
-
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+	{
+		/*
+		 * First try to get the lock in no-wait mode, if succeed extend one
+		 * block, else get the lock in normal mode and after we get the lock
+		 * extend some extra blocks, extra blocks will be added to satisfy
+		 * request of other waiters and avoid future contention. Here instead
+		 * of directly taking lock we try no-wait mode, this is to handle the
+		 * case, when there is no contention - it should not find the lock
+		 * waiter and execute extra instructions.
+		 */
+		if (LOCKACQUIRE_OK
+				!= RelationExtensionLockConditional(relation, ExclusiveLock))
+		{
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			if (use_fsm)
+			{
+				if (lastValidBlock != InvalidBlockNumber)
+				{
+					targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														lastValidBlock,
+														pageFreeSpace,
+														len + saveFreeSpace);
+				}
+
+				/* Other waiter has extended the block for us*/
+				if (targetBlock != InvalidBlockNumber)
+				{
+					UnlockRelationForExtension(relation, ExclusiveLock);
+					goto loop;
+				}
+
+				RelationAddExtraBlocks(relation, bistate);
+			}
+		}
+	}
 
-	/*
-	 * Now acquire lock on the new page.
+	/* For all case we need to add at least one block to satisfy our
+	 * own request.
 	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = RelationAddOneBlock(relation, otherBuffer, bistate);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
@@ -472,6 +584,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
 
+
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 9d16afb..5d9454d 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,40 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		RelationExtensionLockConditional
+ *
+ * Same as LockRelationForExtension except it will not wait on the lock.
+ */
+LockAcquireResult
+RelationExtensionLockConditional(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockAcquire(&tag, lockmode, false, true);
+}
+
+/*
+ *		RelationExtensionLockWaiter
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiter(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index a458c68..cf723ae 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,47 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCALLOCKTAG localtag;
+	LOCALLOCK  *locallock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	/*
+	 * Find a LOCALLOCK entry for this lock and lockmode
+	 */
+	MemSet(&localtag, 0, sizeof(localtag));		/* must clear padding */
+	localtag.lock = *locktag;
+	localtag.mode = ExclusiveLock;
+
+	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
+										(void *) &localtag,
+										HASH_FIND, &found);
+
+
+	hashcode = locallock->hashcode;
+
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	if (found)
+		waiters = locallock->lock->nRequested;
+
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index e9d41bf..ea8e19e 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -101,4 +101,7 @@ extern void UnlockSharedObjectForSession(Oid classid, Oid objid, uint16 objsubid
 /* Describe a locktag for error messages */
 extern void DescribeLockTag(StringInfo buf, const LOCKTAG *tag);
 
+extern LockAcquireResult RelationExtensionLockConditional(Relation relation, LOCKMODE lockmode);
+extern int RelationExtensionLockWaiter(Relation relation);
+
 #endif   /* LMGR_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 788d50a..3fd74fb 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -572,6 +572,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#77Petr Jelinek
petr@2ndquadrant.com
In reply to: Dilip Kumar (#76)
Re: Relation extension scalability

On 17/03/16 04:42, Dilip Kumar wrote:

On Mon, Mar 14, 2016 at 8:26 AM, Petr Jelinek <petr@2ndquadrant.com
<mailto:petr@2ndquadrant.com>> wrote:

Well any value we choose will be very arbitrary. If we look at it
from the point of maximum absolute disk space we allocate for
relation at once, the 4MB limit would represent 2.5 orders of
magnitude change. That sounds like enough for one release cycle, I
think we can further tune it if the need arises in next one. (with
my love for round numbers I would have suggested 8MB as that's 3
orders of magnitude, but I am fine with 4MB as well)

I have modified the patch, this contains the max limit on extra pages,
512(4MB) pages is the max limit.

I have measured the performance also and that looks equally good.

Great.

Just small notational thing, maybe this would be simpler?:
extraBlocks = Min(512, lockWaiters * 20);

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Dilip Kumar
dilipbalaut@gmail.com
In reply to: Petr Jelinek (#77)
1 attachment(s)
Re: Relation extension scalability

On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Great.

Just small notational thing, maybe this would be simpler?:
extraBlocks = Min(512, lockWaiters * 20);

Done, new patch attached.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v9.patchtext/x-patch; charset=US-ASCII; name=multi_extend_v9.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..a9e18dc 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,91 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)
+{
+	Buffer buffer;
+	/*
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since
+	 * it must have a lower page number.
+	 */
+	if (otherBuffer != InvalidBuffer)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return buffer;
+}
+
+/*
+ * RelationAddExtraBlocks
+ *
+ * Extend extra blocks for the relations to avoid the future contention
+ * on the relation extension lock.
+ */
+
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int 		lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * For calculating number of extra blocks to extend, find the level
+	 * of contention on this lock, by getting the requester of this lock
+	 * and add extra blocks in multiple of waiters.
+	 */
+	lockWaiters = RelationExtensionLockWaiter(relation);
+
+	/* To avoid the cases when there are huge number of lock waiters, and
+	 * extend file size by big amount at a time, put some limit on the
+	 * max number of pages to be extended at a time.
+	 */
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	while (extraBlocks--)
+	{
+		/*
+		 * Here we are adding extra blocks to the relation after
+		 * adding each block update the information in FSM so that
+		 * other backend running parallel can find the block.
+		 */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +318,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +394,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +475,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 								 otherBlock, targetBlock, vmbuffer_other,
 								 vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -441,27 +530,47 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
-
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+	{
+		/*
+		 * First try to get the lock in no-wait mode, if succeed extend one
+		 * block, else get the lock in normal mode and after we get the lock
+		 * extend some extra blocks, extra blocks will be added to satisfy
+		 * request of other waiters and avoid future contention. Here instead
+		 * of directly taking lock we try no-wait mode, this is to handle the
+		 * case, when there is no contention - it should not find the lock
+		 * waiter and execute extra instructions.
+		 */
+		if (LOCKACQUIRE_OK
+				!= RelationExtensionLockConditional(relation, ExclusiveLock))
+		{
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			if (use_fsm)
+			{
+				if (lastValidBlock != InvalidBlockNumber)
+				{
+					targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														lastValidBlock,
+														pageFreeSpace,
+														len + saveFreeSpace);
+				}
+
+				/* Other waiter has extended the block for us*/
+				if (targetBlock != InvalidBlockNumber)
+				{
+					UnlockRelationForExtension(relation, ExclusiveLock);
+					goto loop;
+				}
+
+				RelationAddExtraBlocks(relation, bistate);
+			}
+		}
+	}
 
-	/*
-	 * Now acquire lock on the new page.
+	/* For all case we need to add at least one block to satisfy our
+	 * own request.
 	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = RelationAddOneBlock(relation, otherBuffer, bistate);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
@@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
 
+
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 9d16afb..5d9454d 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,40 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		RelationExtensionLockConditional
+ *
+ * Same as LockRelationForExtension except it will not wait on the lock.
+ */
+LockAcquireResult
+RelationExtensionLockConditional(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockAcquire(&tag, lockmode, false, true);
+}
+
+/*
+ *		RelationExtensionLockWaiter
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiter(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index a458c68..cf723ae 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,47 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCALLOCKTAG localtag;
+	LOCALLOCK  *locallock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	/*
+	 * Find a LOCALLOCK entry for this lock and lockmode
+	 */
+	MemSet(&localtag, 0, sizeof(localtag));		/* must clear padding */
+	localtag.lock = *locktag;
+	localtag.mode = ExclusiveLock;
+
+	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
+										(void *) &localtag,
+										HASH_FIND, &found);
+
+
+	hashcode = locallock->hashcode;
+
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	if (found)
+		waiters = locallock->lock->nRequested;
+
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index e9d41bf..ea8e19e 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -101,4 +101,7 @@ extern void UnlockSharedObjectForSession(Oid classid, Oid objid, uint16 objsubid
 /* Describe a locktag for error messages */
 extern void DescribeLockTag(StringInfo buf, const LOCKTAG *tag);
 
+extern LockAcquireResult RelationExtensionLockConditional(Relation relation, LOCKMODE lockmode);
+extern int RelationExtensionLockWaiter(Relation relation);
+
 #endif   /* LMGR_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 788d50a..3fd74fb 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -572,6 +572,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#79Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#78)
Re: Relation extension scalability

On Fri, Mar 18, 2016 at 2:38 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek <petr@2ndquadrant.com>

wrote:

Great.

Just small notational thing, maybe this would be simpler?:
extraBlocks = Min(512, lockWaiters * 20);

Done, new patch attached.

Review comments:

1.
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState
bistate)

Shall we mention in comments that this API returns locked buffer and it's
the responsibility of caller to unlock it.

2.
+ /* To avoid the cases when there are huge number of lock waiters, and
+ * extend file size by big amount at a
time, put some limit on the

first line in multi-line comments should not contain anything.

3.
+ extraBlocks = Min(512, lockWaiters * 20);

I think we should explain in comments about the reason of choosing 20 in
above calculation.

4. Sometime back [1]/messages/by-id/CAA4eK1LOnxz4Qa_DquqbanSPXscTJXrKexJii8h3gnD9z8UY-A@mail.gmail.com, you agreed on doing some analysis for the overhead
that XLogFlush can cause during buffer eviction, but I don't see the
results of same, are you planing to complete the same?

5.
+ if (LOCKACQUIRE_OK
+ != RelationExtensionLockConditional(relation,
ExclusiveLock))

I think the coding style is to keep constant on right side of condition,
did you see any other place in code which uses the check in a similar way?

6.
- /*
- * Now acquire lock on the new page.
+ /* For all case we need to add at least one block to satisfy
our
+ * own request.
  */

Same problem as in point 2.

7.
@@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
if (needLock)

UnlockRelationForExtension(relation, ExclusiveLock);

+
+

Spurious newline addition.

8.
+int
+RelationExtensionLockWaiter(Relation relation)

How about naming this function as RelationExtensionLockWaiterCount?

9.
+ /* Other waiter has extended the block for us*/

Provide an extra space before ending the comment.

10.
+ if (use_fsm)
+ {
+ if (lastValidBlock !=
InvalidBlockNumber)
+ {
+ targetBlock =
RecordAndGetPageWithFreeSpace(relation,
+
lastValidBlock,
+
pageFreeSpace,
+
len + saveFreeSpace);
+ }

Are you using RecordAndGetPageWithFreeSpace() instead of
GetPageWithFreeSpace() to get the page close to the previous target page?
If yes, then do you see enough benefit of the same that it can compensate
the additional write operation which Record* API might cause due to
additional dirtying of buffer?

11.
+ {
+ /*
+ * First try to get the lock in no-wait mode, if succeed extend one
+
 * block, else get the lock in normal mode and after we get the lock
+ * extend some extra blocks, extra
blocks will be added to satisfy
+ * request of other waiters and avoid future contention. Here instead
+
* of directly taking lock we try no-wait mode, this is to handle the
+ * case, when there is no
contention - it should not find the lock
+ * waiter and execute extra instructions.
+ */
+
if (LOCKACQUIRE_OK
+ != RelationExtensionLockConditional(relation, ExclusiveLock))
+
{
+ LockRelationForExtension(relation, ExclusiveLock);
+
+ if (use_fsm)
+
{
+ if (lastValidBlock != InvalidBlockNumber)
+
{
+ targetBlock = RecordAndGetPageWithFreeSpace(relation,
+
lastValidBlock,
+
pageFreeSpace,
+
len + saveFreeSpace);
+
}
+
+ /* Other waiter has extended the block for us*/
+
if (targetBlock != InvalidBlockNumber)
+ {
+
UnlockRelationForExtension(relation, ExclusiveLock);
+ goto loop;
+
}
+
+ RelationAddExtraBlocks(relation, bistate);
+ }
+
}
+ }
- /*
- * Now acquire lock on the new page.
+ /* For all case we need to add at least one block to
satisfy our
+ * own request.
  */
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ buffer =
RelationAddOneBlock(relation, otherBuffer, bistate);

Won't this cause one extra block addition after backend extends the
relation for multiple blocks, what is the need of same?

12. I think it is good to once test pgbench read-write tests to ensure that
this doesn't introduce any new regression.

[1]: /messages/by-id/CAA4eK1LOnxz4Qa_DquqbanSPXscTJXrKexJii8h3gnD9z8UY-A@mail.gmail.com
/messages/by-id/CAA4eK1LOnxz4Qa_DquqbanSPXscTJXrKexJii8h3gnD9z8UY-A@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#80Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#79)
1 attachment(s)
Re: Relation extension scalability

On Mon, Mar 21, 2016 at 8:10 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Review comments:

Thanks for the review, Please find my response inline..

1.
/*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer,
BulkInsertState bistate)

Shall we mention in comments that this API returns locked buffer and it's
the responsibility of caller to unlock it.

Fixed

2.
+ /* To avoid the cases when there are huge number of lock waiters, and
+ * extend file size by big amount at a
time, put some limit on the

first line in multi-line comments should not contain anything.

Fixed

3.
+ extraBlocks = Min(512, lockWaiters * 20);

I think we should explain in comments about the reason of choosing 20 in
above calculation.

Fixed, Just check explanation is enough or we need to add something more ?

4. Sometime back [1], you agreed on doing some analysis for the overhead
that XLogFlush can cause during buffer eviction, but I don't see the
results of same, are you planing to complete the same?

Ok, I will test this..

5.
+ if (LOCKACQUIRE_OK
+ != RelationExtensionLockConditional(relation,
ExclusiveLock))

I think the coding style is to keep constant on right side of condition,
did you see any other place in code which uses the check in a similar way?

Fixed,

Not sure about any other place. (This is what I used to follow to keep
constant on left side to avoid the cases, where instead == by mistake if we
have given =, then it will do assignment instead throwing error).

But In PG style constant should be on right, so I will take care.

6.
- /*
- * Now acquire lock on the new page.
+ /* For all case we need to add at least one block to satisfy
our
+ * own request.
*/

Same problem as in point 2.

Fixed.

7.
@@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
if (needLock)

UnlockRelationForExtension(relation, ExclusiveLock);

+
+

Spurious newline addition.

Fixed

8.
+int
+RelationExtensionLockWaiter(Relation relation)

How about naming this function as RelationExtensionLockWaiterCount?

Done

9.
+ /* Other waiter has extended the block for us*/

Provide an extra space before ending the comment.

Fixed

10.
+ if (use_fsm)
+ {
+ if (lastValidBlock !=
InvalidBlockNumber)
+ {
+ targetBlock =
RecordAndGetPageWithFreeSpace(relation,
+
lastValidBlock,
+
pageFreeSpace,
+
len + saveFreeSpace);
+ }

Are you using RecordAndGetPageWithFreeSpace() instead of
GetPageWithFreeSpace() to get the page close to the previous target page?
If yes, then do you see enough benefit of the same that it can compensate
the additional write operation which Record* API might cause due to
additional dirtying of buffer?

Here we are calling RecordAndGetPageWithFreeSpace instead of
GetPageWithFreeSpace, because other backend who have got the lock might
have added extra block in the FSM and its possible that FSM tree might not
have been updated so far and we will not get the page by searching from top
using GetPageWithFreeSpace, so we need to search the leaf page directly
using our last valid target block.

Explained same in the comments...

11.
+ {
+ /*
+ * First try to get the lock in no-wait mode, if succeed extend one
+
* block, else get the lock in normal mode and after we get the lock
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ buffer =
RelationAddOneBlock(relation, otherBuffer, bistate);

Won't this cause one extra block addition after backend extends the
relation for multiple blocks, what is the need of same?

This is the block for this backend, Extra extend for future request and
already added to FSM. I could have added this count along with extra block
in RelationAddExtraBlocks, But In that case I need to put some extra If for
saving one buffer for this bakend and then returning that the specific
buffer to caller, and In caller also need to distinguish between who wants
to add one block or who have got one block added in along with extra block.

I think this way code is simple.. That everybody comes down will add one
block for self use. and all other functionality and logic is above, i.e.
wether to take lock or not, whether to add extra blocks or not..

12. I think it is good to once test pgbench read-write tests to ensure
that this doesn't introduce any new regression.

I will test this and post the results..

Latest patch attached..

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v10.patchapplication/octet-stream; name=multi_extend_v10.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..b1e764c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,102 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer, and caller must take care
+ * of unlocking the buffer.
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)
+{
+	Buffer buffer;
+	/*
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since
+	 * it must have a lower page number.
+	 */
+	if (otherBuffer != InvalidBuffer)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return buffer;
+}
+
+/*
+ * RelationAddExtraBlocks
+ *
+ * Extend extra blocks for the relations to avoid the future contention
+ * on the relation extension lock.
+ */
+
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int 		lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * For calculating number of extra blocks to extend, find the level
+	 * of contention on this lock, by getting the requester of this lock
+	 * and add extra blocks in multiple of waiters.
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+
+	/*
+	 * To avoid the cases when there are huge number of lock waiters, and
+	 * extend file size by big amount at a time, put some limit on the
+	 * max number of pages to be extended at a time.
+	 * At a time we are extending (20 * LockWaiters), We reached to this
+	 * number by testing with different multipliers i.e 10,20,50.. and found
+	 * that 10 is good for the small tuple size but not sufficient when
+	 * tuple size is very huge (1K), and with 20 we get good scalability
+	 * with small tuple as well as with big tuple and increasing multiplier
+	 * beyond 20 doen't seems to be giving much benefit, So no point in
+	 * adding many extra block without seeing significant improvement.
+	 * Also we are limiting the extension at 512 blocks (4MB), Even though
+	 * there can be some benefit its better to have some limit.
+	 */
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	while (extraBlocks--)
+	{
+		/*
+		 * Here we are adding extra blocks to the relation after
+		 * adding each block update the information in FSM so that
+		 * other backend running parallel can find the block.
+		 */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +329,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +405,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +486,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 								 otherBlock, targetBlock, vmbuffer_other,
 								 vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -441,27 +541,58 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
-
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+	{
+		/*
+		 * First try to get the lock in no-wait mode, if succeed extend one
+		 * block, else get the lock in normal mode and after we get the lock
+		 * extend some extra blocks, extra blocks will be added to satisfy
+		 * request of other waiters and avoid future contention. Here instead
+		 * of directly taking lock we try no-wait mode, this is to handle the
+		 * case, when there is no contention - it should not find the lock
+		 * waiter and execute extra instructions.
+		 */
+		if (RelationExtensionLockConditional(relation, ExclusiveLock)
+			!= LOCKACQUIRE_OK)
+		{
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			if (use_fsm)
+			{
+				if (lastValidBlock != InvalidBlockNumber)
+				{
+					/*
+					 * Here we are calling RecordAndGetPageWithFreeSpace
+					 * instead of GetPageWithFreeSpace, because other backend
+					 * who have got the lock might have added extra blocks
+					 * in the FSM and its possible that FSM tree might not
+					 * have been updated so far and we will not get the page by
+					 * searching from top using GetPageWithFreeSpace, so we
+					 * need to search the leaf page directly using our
+					 * last valid target block.
+					 */
+					targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														lastValidBlock,
+														pageFreeSpace,
+														len + saveFreeSpace);
+				}
+
+				/* Other waiter has extended the block for us */
+				if (targetBlock != InvalidBlockNumber)
+				{
+					UnlockRelationForExtension(relation, ExclusiveLock);
+					goto loop;
+				}
+
+				RelationAddExtraBlocks(relation, bistate);
+			}
+		}
+	}
 
 	/*
-	 * Now acquire lock on the new page.
+	 * For all case we need to add at least one block to satisfy our
+	 * own request.
 	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = RelationAddOneBlock(relation, otherBuffer, bistate);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 9d16afb..8967d1b 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,40 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		RelationExtensionLockConditional
+ *
+ * Same as LockRelationForExtension except it will not wait on the lock.
+ */
+LockAcquireResult
+RelationExtensionLockConditional(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockAcquire(&tag, lockmode, false, true);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index a458c68..cf723ae 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,47 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCALLOCKTAG localtag;
+	LOCALLOCK  *locallock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	/*
+	 * Find a LOCALLOCK entry for this lock and lockmode
+	 */
+	MemSet(&localtag, 0, sizeof(localtag));		/* must clear padding */
+	localtag.lock = *locktag;
+	localtag.mode = ExclusiveLock;
+
+	locallock = (LOCALLOCK *) hash_search(LockMethodLocalHash,
+										(void *) &localtag,
+										HASH_FIND, &found);
+
+
+	hashcode = locallock->hashcode;
+
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	if (found)
+		waiters = locallock->lock->nRequested;
+
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index e9d41bf..ce9c6aa 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -101,4 +101,7 @@ extern void UnlockSharedObjectForSession(Oid classid, Oid objid, uint16 objsubid
 /* Describe a locktag for error messages */
 extern void DescribeLockTag(StringInfo buf, const LOCKTAG *tag);
 
+extern LockAcquireResult RelationExtensionLockConditional(Relation relation, LOCKMODE lockmode);
+extern int RelationExtensionLockWaiterCount(Relation relation);
+
 #endif   /* LMGR_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 788d50a..3fd74fb 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -572,6 +572,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#81Petr Jelinek
petr@2ndquadrant.com
In reply to: Dilip Kumar (#80)
Re: Relation extension scalability

On 22/03/16 10:15, Dilip Kumar wrote:

On Mon, Mar 21, 2016 at 8:10 PM, Amit Kapila <amit.kapila16@gmail.com
<mailto:amit.kapila16@gmail.com>> wrote:
11.
+{
+/*
+* First try to get the lock in no-wait mode, if succeed extend one
+
* block, else get the lock in normal mode and after we get the lock
-LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+buffer =
RelationAddOneBlock(relation, otherBuffer, bistate);

Won't this cause one extra block addition after backend extends the
relation for multiple blocks, what is the need of same?

This is the block for this backend, Extra extend for future request and
already added to FSM. I could have added this count along with extra
block in RelationAddExtraBlocks, But In that case I need to put some
extra If for saving one buffer for this bakend and then returning that
the specific buffer to caller, and In caller also need to distinguish
between who wants to add one block or who have got one block added in
along with extra block.

I think this way code is simple.. That everybody comes down will add one
block for self use. and all other functionality and logic is above, i.e.
wether to take lock or not, whether to add extra blocks or not..

I also think the code simplicity makes this worth it.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#81)
1 attachment(s)
Re: Relation extension scalability

On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

I also think the code simplicity makes this worth it.

Agreed. I went over this patch and did a cleanup pass today. I
discovered that the LockWaiterCount() function was broken if you try
to tried to use it on a lock that you didn't hold or a lock that you
held in any mode other than exclusive, so I tried to fix that. I
rewrote a lot of the comments and tightened some other things up. The
result is attached.

I'm baffled by the same code Amit asked about upthread, even though
there's now a comment:

+                               /*
+                                * Here we are calling
RecordAndGetPageWithFreeSpace
+                                * instead of GetPageWithFreeSpace,
because other backend
+                                * who have got the lock might have
added extra blocks in
+                                * the FSM and its possible that FSM
tree might not have
+                                * been updated so far and we will not
get the page by
+                                * searching from top using
GetPageWithFreeSpace, so we
+                                * need to search the leaf page
directly using our last
+                                * valid target block.
+                                *
+                                * XXX. I don't understand what is
happening here. -RMH
+                                */

I've read this over several times and looked at
RecordAndGetPageWithFreeSpace() and I'm still confused. First of all,
if the lock was acquired by some other backend which did
RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
whole point. Second, if the other backend extended the relation in
some other manner and did not extend the FSM, how does calling
RecordAndGetPageWithFreeSpace help? As far as I can see,
GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
searching the FSM, so if one is stymied the other will be too. What
am I missing?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

multi_extend_v11.patchtext/x-diff; charset=US-ASCII; name=multi_extend_v11.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..3ab911f 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,55 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * Put the page in the freespace map so other backends can find it.
+		 * This is what will keep those other backends from also queueing up
+		 * on the relation extension lock.
+		 */
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +282,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +358,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +439,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 								 otherBlock, targetBlock, vmbuffer_other,
 								 vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -440,10 +493,60 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immmediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			if (lastValidBlock != InvalidBlockNumber)
+			{
+				/*
+				 * Here we are calling RecordAndGetPageWithFreeSpace
+				 * instead of GetPageWithFreeSpace, because other backend
+				 * who have got the lock might have added extra blocks in
+				 * the FSM and its possible that FSM tree might not have
+				 * been updated so far and we will not get the page by
+				 * searching from top using GetPageWithFreeSpace, so we
+				 * need to search the leaf page directly using our last
+				 * valid target block.
+				 *
+				 * XXX. I don't understand what is happening here. -RMH
+				 */
+				targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														  lastValidBlock,
+															pageFreeSpace,
+													len + saveFreeSpace);
+			}
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#83Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#82)
Re: Relation extension scalability

On 23/03/16 19:39, Robert Haas wrote:

On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

I also think the code simplicity makes this worth it.

Agreed. I went over this patch and did a cleanup pass today. I
discovered that the LockWaiterCount() function was broken if you try
to tried to use it on a lock that you didn't hold or a lock that you
held in any mode other than exclusive, so I tried to fix that. I
rewrote a lot of the comments and tightened some other things up. The
result is attached.

I'm baffled by the same code Amit asked about upthread, even though
there's now a comment:

+                               /*
+                                * Here we are calling
RecordAndGetPageWithFreeSpace
+                                * instead of GetPageWithFreeSpace,
because other backend
+                                * who have got the lock might have
added extra blocks in
+                                * the FSM and its possible that FSM
tree might not have
+                                * been updated so far and we will not
get the page by
+                                * searching from top using
GetPageWithFreeSpace, so we
+                                * need to search the leaf page
directly using our last
+                                * valid target block.
+                                *
+                                * XXX. I don't understand what is
happening here. -RMH
+                                */

I've read this over several times and looked at
RecordAndGetPageWithFreeSpace() and I'm still confused. First of all,
if the lock was acquired by some other backend which did
RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
whole point.

That's good point, maybe this coding is bit too defensive.

Second, if the other backend extended the relation in
some other manner and did not extend the FSM, how does calling
RecordAndGetPageWithFreeSpace help? As far as I can see,
GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
searching the FSM, so if one is stymied the other will be too. What
am I missing?

The RecordAndGetPageWithFreeSpace will extend FSM as it calls
fsm_set_and_search which in turn calls fsm_readbuf with extend = true.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#83)
Re: Relation extension scalability

On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Second, if the other backend extended the relation in
some other manner and did not extend the FSM, how does calling
RecordAndGetPageWithFreeSpace help? As far as I can see,
GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
searching the FSM, so if one is stymied the other will be too. What
am I missing?

The RecordAndGetPageWithFreeSpace will extend FSM as it calls
fsm_set_and_search which in turn calls fsm_readbuf with extend = true.

So how does that help? If I'm reading this right, the new block will
be all zeroes which means no space available on any of those pages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#84)
Re: Relation extension scalability

On 23/03/16 20:01, Robert Haas wrote:

On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Second, if the other backend extended the relation in
some other manner and did not extend the FSM, how does calling
RecordAndGetPageWithFreeSpace help? As far as I can see,
GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
searching the FSM, so if one is stymied the other will be too. What
am I missing?

The RecordAndGetPageWithFreeSpace will extend FSM as it calls
fsm_set_and_search which in turn calls fsm_readbuf with extend = true.

So how does that help? If I'm reading this right, the new block will
be all zeroes which means no space available on any of those pages.

I am bit confused as to what exactly you are saying, but what will
happen is we get back to the while cycle and try again so eventually we
should find either block with enough free space or add new one (not sure
if this would actually ever happen in practice in heavily concurrent
workload where the FSM would not be correctly extended during relation
extension though, we might just loop here forever).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86Petr Jelinek
petr@2ndquadrant.com
In reply to: Petr Jelinek (#85)
Re: Relation extension scalability

On 23/03/16 20:19, Petr Jelinek wrote:

On 23/03/16 20:01, Robert Haas wrote:

On Wed, Mar 23, 2016 at 2:52 PM, Petr Jelinek <petr@2ndquadrant.com>
wrote:

Second, if the other backend extended the relation in
some other manner and did not extend the FSM, how does calling
RecordAndGetPageWithFreeSpace help? As far as I can see,
GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
searching the FSM, so if one is stymied the other will be too. What
am I missing?

The RecordAndGetPageWithFreeSpace will extend FSM as it calls
fsm_set_and_search which in turn calls fsm_readbuf with extend = true.

So how does that help? If I'm reading this right, the new block will
be all zeroes which means no space available on any of those pages.

I am bit confused as to what exactly you are saying, but what will
happen is we get back to the while cycle and try again so eventually we
should find either block with enough free space or add new one (not sure
if this would actually ever happen in practice in heavily concurrent
workload where the FSM would not be correctly extended during relation
extension though, we might just loop here forever).

Btw thinking about it some more, ISTM that not finding the block and
just doing the extension if the FSM wasn't extended correctly previously
is probably cleaner behavior than what we do now. The reasoning for that
opinion is that if the FSM wasn't extended, we'll fix it by doing
relation extension since we know we do both in this code path and also
if we could not find page before we'll most likely not find one even on
retry and if there was page added at the end by extension that we might
reuse partially here then there is no harm in adding new one anyway as
the whole point of this patch is that it does bigger extension that
strictly necessary so insisting on page reuse for something that seems
like only theoretical possibility that does not even exist in current
code does not seem right.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Robert Haas
robertmhaas@gmail.com
In reply to: Petr Jelinek (#86)
Re: Relation extension scalability

On Wed, Mar 23, 2016 at 3:33 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Btw thinking about it some more, ISTM that not finding the block and just
doing the extension if the FSM wasn't extended correctly previously is
probably cleaner behavior than what we do now. The reasoning for that
opinion is that if the FSM wasn't extended, we'll fix it by doing relation
extension since we know we do both in this code path and also if we could
not find page before we'll most likely not find one even on retry and if
there was page added at the end by extension that we might reuse partially
here then there is no harm in adding new one anyway as the whole point of
this patch is that it does bigger extension that strictly necessary so
insisting on page reuse for something that seems like only theoretical
possibility that does not even exist in current code does not seem right.

I'm not sure I completely follow this. The fact that the last
sentence is 9 lines long may be related. :-)

I think it's pretty clearly important to re-check the FSM after
acquiring the extension lock. Otherwise, imagine that 25 backends
arrive at the exact same time. The first one gets the lock and
extends the relation 500 pages; the next one, 480, and so on. In
total, they extend the relation by 6500 pages, which is a bit rich.
Rechecking the FSM after acquiring the lock prevents that from
happening, and that's a very good thing. We'll do one 500-page
extension and that's it.

However, I don't think using RecordAndGetPageWithFreeSpace rather than
GetPageWithFreeSpace is appropriate. We've already recorded free
space on that page, and recording it again is a bad idea. It's quite
possible that by the time we get the lock our old value is totally
inaccurate. If there's some advantage to searching in the more
targeted way that RecordAndGetPageWithFreeSpace does over
GetPageWithFreeSpace then we need a new API into the freespace stuff
that does the more targeted search without updating anything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88Petr Jelinek
petr@2ndquadrant.com
In reply to: Robert Haas (#87)
Re: Relation extension scalability

On 23/03/16 20:43, Robert Haas wrote:

On Wed, Mar 23, 2016 at 3:33 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

Btw thinking about it some more, ISTM that not finding the block and just
doing the extension if the FSM wasn't extended correctly previously is
probably cleaner behavior than what we do now. The reasoning for that
opinion is that if the FSM wasn't extended, we'll fix it by doing relation
extension since we know we do both in this code path and also if we could
not find page before we'll most likely not find one even on retry and if
there was page added at the end by extension that we might reuse partially
here then there is no harm in adding new one anyway as the whole point of
this patch is that it does bigger extension that strictly necessary so
insisting on page reuse for something that seems like only theoretical
possibility that does not even exist in current code does not seem right.

I'm not sure I completely follow this. The fact that the last
sentence is 9 lines long may be related. :-)

I tend to do that sometimes :)

I think it's pretty clearly important to re-check the FSM after
acquiring the extension lock. Otherwise, imagine that 25 backends
arrive at the exact same time. The first one gets the lock and
extends the relation 500 pages; the next one, 480, and so on. In
total, they extend the relation by 6500 pages, which is a bit rich.
Rechecking the FSM after acquiring the lock prevents that from
happening, and that's a very good thing. We'll do one 500-page
extension and that's it.

Right, but that would only happen if all the backends did it using
different code which does not do the FSM extension because the current
code does FSM extension and the point of using
RecordAndGetPageWithFreeSpace seems to be "just in case" somebody is
doing extension differently (at least I don't see other reason). So
basically I am not saying we shouldn't do the search but that I agree
GetPageWithFreeSpace should be enough as the worst that can happen is
that we overextend the relation in case some theoretical code from
somewhere else also did extension of relation without extending FSM
(afaics).

But maybe Dilip had some other reason for using the
RecordAndGetPageWithFreeSpace that is not documented.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#82)
Re: Relation extension scalability

On Thu, Mar 24, 2016 at 12:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 22, 2016 at 1:12 PM, Petr Jelinek <petr@2ndquadrant.com>

wrote:

I've read this over several times and looked at
RecordAndGetPageWithFreeSpace() and I'm still confused. First of all,
if the lock was acquired by some other backend which did
RelationAddExtraBlocks(), it *will* have updated the FSM - that's the
whole point.

It doesn't update the FSM uptill root in some cases, as per comments on top
of RecordPageWithFreeSpace and the code as well.

Second, if the other backend extended the relation in
some other manner and did not extend the FSM, how does calling
RecordAndGetPageWithFreeSpace help? As far as I can see,
GetPageWithFreeSpace and RecordAndGetPageWithFreeSpace are both just
searching the FSM, so if one is stymied the other will be too. What
am I missing?

RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
update till root, it will be able to search the newly added page. I agree
with whatever you have said in another mail that we should introduce a new
API to do a more targeted search for such cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#90Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#89)
Re: Relation extension scalability

On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
update till root, it will be able to search the newly added page. I agree
with whatever you have said in another mail that we should introduce a new
API to do a more targeted search for such cases.

OK, let's do that, then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#90)
1 attachment(s)
Re: Relation extension scalability

On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed

to

it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
update till root, it will be able to search the newly added page. I

agree

with whatever you have said in another mail that we should introduce a

new

API to do a more targeted search for such cases.

OK, let's do that, then.

Ok, I have added new API which just find the free block from and start
search from last given page.

1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
don't like this name, but I could not come up with any better, Please
suggest one.

And also body of GetPageWithFreeSpaceUsingOldPage looks almost similar to
RecordAndGetPageWithFreeSpace, I tried to merge these two but for that we
need to pass extra parameter to the function.

2. I also had to write one more function *fsm_search_from_addr *instead of
using* fsm_set_and_search. *So that we can find block without updating the
other slot.

I have done performance test just to ensure the result. And performance is
same as old. with both COPY and INSERT.

3. I have also run pgbench read-write what amit suggested upthread.. No
regression or improvement with pgbench workload.

Client base Patch
1 899 914
8 5397 5413
32 18170 18052
64 29850 29941

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v12.patchapplication/octet-stream; name=multi_extend_v12.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..3b46c4f 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,55 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * Put the page in the freespace map so other backends can find it.
+		 * This is what will keep those other backends from also queueing up
+		 * on the relation extension lock.
+		 */
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +282,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +358,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +439,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 								 otherBlock, targetBlock, vmbuffer_other,
 								 vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -440,10 +493,57 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immmediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			if (lastValidBlock != InvalidBlockNumber)
+			{
+				/*
+				 * Here we are calling GetPageWithFreeSpaceUsingOldPage
+				 * instead of GetPageWithFreeSpace, because other backend
+				 * who have got the lock might have added extra blocks in
+				 * the FSM and its possible that free space information
+				 * is not yet propagated up till root node (it will be
+				 * updated during vacuum).
+				 * So directly start search from leaf level where we ended
+				 * the search last time.
+				 */
+				targetBlock = GetPageWithFreeSpaceUsingOldPage(relation,
+														lastValidBlock,
+														len + saveFreeSpace);
+			}
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..ad94b08 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,6 +109,7 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
+static int fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue);
 
 
 /******** Public API ********/
@@ -135,6 +136,37 @@ GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
 }
 
 /*
+ * 		GetPageWithFreeSpaceUsingOldPage
+ *
+ * As above, but start the search from oldPage instead of staring from root
+ * So that, we can find the appropriate page in cases where free block is
+ * added to FSM but not yet updated up till root.
+ */
+BlockNumber
+GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage,
+								Size spaceNeeded)
+{
+	int			search_cat = fsm_space_needed_to_cat(spaceNeeded);
+	FSMAddress	addr;
+	uint16		slot;
+	int			search_slot;
+
+	/* Get the location of the FSM byte representing the heap block */
+	addr = fsm_get_location(oldPage, &slot);
+
+	search_slot = fsm_search_from_addr(rel, addr, search_cat);
+
+	/*
+	 * If fsm_set_and_search found a suitable new block, return that.
+	 * Otherwise, search as usual.
+	 */
+	if (search_slot != -1)
+		return fsm_get_heap_blk(addr, search_slot);
+	else
+		return fsm_search(rel, search_cat);
+}
+
+/*
  * RecordAndGetPageWithFreeSpace - update info about a page and try again.
  *
  * We provide this combo form to save some locking overhead, compared to
@@ -634,6 +666,35 @@ fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 }
 
 /*
+ * Search the the fsm tree for for free space > minValue
+ * It will start the search from given addr, and will be used for searching
+ * the required page in cases where vacuum have not yet updated the FSM tree
+ * till root level.
+ * If one is found, its slot number is returned, -1 otherwise.
+ */
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+	Buffer		buf;
+	int			newslot = -1;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+	if (minValue != 0)
+	{
+		/* Search while we still hold the lock */
+		newslot = fsm_search_avail(buf, minValue,
+								   addr.level == FSM_BOTTOM_LEVEL,
+								   false);
+	}
+
+	UnlockReleaseBuffer(buf);
+
+	return newslot;
+}
+
+/*
  * Search the tree for a heap page with at least min_cat of free space
  */
 static BlockNumber
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..0e67e47 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,9 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern BlockNumber GetPageWithFreeSpaceUsingOldPage(Relation rel,
+								BlockNumber oldPage,
+								Size spaceNeeded);
+
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#92Petr Jelinek
petr@2ndquadrant.com
In reply to: Dilip Kumar (#91)
Re: Relation extension scalability

On 24/03/16 07:04, Dilip Kumar wrote:

On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com>> wrote:

On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila
<amit.kapila16@gmail.com <mailto:amit.kapila16@gmail.com>> wrote:

RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed to
it, rather than from top, so even if RecordPageWithFreeSpace() doesn't
update till root, it will be able to search the newly added page. I agree
with whatever you have said in another mail that we should introduce a new
API to do a more targeted search for such cases.

OK, let's do that, then.

Ok, I have added new API which just find the free block from and start
search from last given page.

1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
don't like this name, but I could not come up with any better, Please
suggest one.

GetNearestPageWithFreeSpace? (although not sure that's accurate
description, maybe Nearby would be better)

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93Amit Kapila
amit.kapila16@gmail.com
In reply to: Petr Jelinek (#92)
Re: Relation extension scalability

On Thu, Mar 24, 2016 at 1:48 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

On 24/03/16 07:04, Dilip Kumar wrote:

On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com>> wrote:

On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila
<amit.kapila16@gmail.com <mailto:amit.kapila16@gmail.com>> wrote:

RecordAndGetPageWithFreeSpace() tries to search from the oldPage

passed to

it, rather than from top, so even if RecordPageWithFreeSpace()

doesn't

update till root, it will be able to search the newly added page.

I agree

with whatever you have said in another mail that we should

introduce a new

API to do a more targeted search for such cases.

OK, let's do that, then.

Ok, I have added new API which just find the free block from and start
search from last given page.

1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
don't like this name, but I could not come up with any better, Please
suggest one.

1.
+GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage,
+ Size spaceNeeded)
{
..
+ /*
+ * If fsm_set_and_search found a suitable new block, return that.
+ * Otherwise, search as usual.
+ */
..
}

In the above comment, you are referring wrong function.

2.
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+ Buffer buf;
+ int newslot = -1;
+
+ buf = fsm_readbuf(rel, addr, true);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ if (minValue != 0)
+ {
+ /* Search while we still hold the lock */
+ newslot = fsm_search_avail(buf, minValue,
+   addr.level == FSM_BOTTOM_LEVEL,
+   false);

In this new API, I don't understand why we need minValue != 0 check,
basically if user of API doesn't want to search for space > 0, then what is
the need of calling this API? I think this API should use Assert for
minValue!=0 unless you see reason for not doing so.

GetNearestPageWithFreeSpace? (although not sure that's accurate

description, maybe Nearby would be better)

Better than what is used in patch.

Yet another possibility could be to call it as GetPageWithFreeSpaceExtended
and call it from GetPageWithFreeSpace with value of oldPage
as InvalidBlockNumber.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#94Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#93)
1 attachment(s)
Re: Relation extension scalability

On Thu, Mar 24, 2016 at 6:13 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

1.
+GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage,
+ Size spaceNeeded)
{
..
+ /*
+ * If fsm_set_and_search found a suitable new block, return that.
+ * Otherwise, search as usual.
+ */
..
}

In the above comment, you are referring wrong function.

Fixed

2.
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+ Buffer buf;
+ int newslot = -1;
+
+ buf = fsm_readbuf(rel, addr, true);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ if (minValue != 0)
+ {
+ /* Search while we still hold the lock */
+ newslot = fsm_search_avail(buf, minValue,
+   addr.level == FSM_BOTTOM_LEVEL,
+   false);

In this new API, I don't understand why we need minValue != 0 check,
basically if user of API doesn't want to search for space > 0, then what is
the need of calling this API? I think this API should use Assert for
minValue!=0 unless you see reason for not doing so.

Agree, it should be assert.

GetNearestPageWithFreeSpace? (although not sure that's accurate

description, maybe Nearby would be better)

Better than what is used in patch.

Yet another possibility could be to call it as
GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with
value of oldPage as InvalidBlockNumber.

Yes I like this.. Changed the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v13.patchapplication/octet-stream; name=multi_extend_v13.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..d3608c6 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,55 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * Put the page in the freespace map so other backends can find it.
+		 * This is what will keep those other backends from also queueing up
+		 * on the relation extension lock.
+		 */
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +282,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +358,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +439,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 								 otherBlock, targetBlock, vmbuffer_other,
 								 vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -440,10 +493,57 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immmediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			if (lastValidBlock != InvalidBlockNumber)
+			{
+				/*
+				 * Here we are calling GetPageWithFreeSpaceUsingOldPage
+				 * instead of GetPageWithFreeSpace, because other backend
+				 * who have got the lock might have added extra blocks in
+				 * the FSM and its possible that free space information
+				 * is not yet propagated up till root node (it will be
+				 * updated during vacuum).
+				 * So directly start search from leaf level where we ended
+				 * the search last time.
+				 */
+				targetBlock = GetPageWithFreeSpaceExtended(relation,
+														lastValidBlock,
+														len + saveFreeSpace);
+			}
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..980651e 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,6 +109,7 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
+static int fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue);
 
 
 /******** Public API ********/
@@ -129,9 +130,46 @@ static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
 BlockNumber
 GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
 {
-	uint8		min_cat = fsm_space_needed_to_cat(spaceNeeded);
+	/*
+	 * Call GetPageWithFreeSpaceExtended with InvalidBlockNumber so that
+	 * it will search the FSM tree from the root
+	 */
+	return GetPageWithFreeSpaceExtended(rel, InvalidBlockNumber, spaceNeeded);
+}
+
+/*
+ * 		GetPageWithFreeSpaceExtended
+ *
+ * As above, but start the search from oldPage instead of staring from root
+ * So that, we can find the appropriate page in cases where free block is
+ * added to FSM but not yet updated up till root. If oldpage is Invalid
+ * then start the search from root.
+ */
+BlockNumber
+GetPageWithFreeSpaceExtended(Relation rel, BlockNumber oldPage,
+								Size spaceNeeded)
+{
+	int			search_cat = fsm_space_needed_to_cat(spaceNeeded);
+	FSMAddress	addr;
+	uint16		slot;
+	int			search_slot = -1;
+
+	if (oldPage != InvalidBlockNumber)
+	{
+		/* Get the location of the FSM byte representing the heap block */
+		addr = fsm_get_location(oldPage, &slot);
+
+		search_slot = fsm_search_from_addr(rel, addr, search_cat);
+	}
 
-	return fsm_search(rel, min_cat);
+	/*
+	 * If fsm_search_from_addr found a suitable new block, return that.
+	 * Otherwise, search as usual.
+	 */
+	if (search_slot != -1)
+		return fsm_get_heap_blk(addr, search_slot);
+	else
+		return fsm_search(rel, search_cat);
 }
 
 /*
@@ -634,6 +672,34 @@ fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 }
 
 /*
+ * Search the the fsm tree for for free space > minValue
+ * It will start the search from given addr, and will be used for searching
+ * the required page in cases where vacuum have not yet updated the FSM tree
+ * till root level.
+ * If one is found, its slot number is returned, -1 otherwise.
+ */
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+	Buffer		buf;
+	int			newslot = -1;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+	Assert(minValue != 0);
+
+	/* Search while we still hold the lock */
+	newslot = fsm_search_avail(buf, minValue,
+							addr.level == FSM_BOTTOM_LEVEL,
+							false);
+
+	UnlockReleaseBuffer(buf);
+
+	return newslot;
+}
+
+/*
  * Search the tree for a heap page with at least min_cat of free space
  */
 static BlockNumber
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..27cb971 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,9 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern BlockNumber GetPageWithFreeSpaceExtended(Relation rel,
+								BlockNumber oldPage,
+								Size spaceNeeded);
+
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#95Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#94)
Re: Relation extension scalability

On Thu, Mar 24, 2016 at 7:17 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Yet another possibility could be to call it as
GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with
value of oldPage as InvalidBlockNumber.

Yes I like this.. Changed the same.

After thinking about this some more, I don't think this is the right
approach. I finally understand what's going on here:
RecordPageWithFreeSpace updates the FSM lazily, only adjusting the
leaves and not the upper levels. It relies on VACUUM to update the
upper levels. This seems like it might be a bad policy in general,
because VACUUM on a very large relation may be quite infrequent, and
you could lose track of a lot of space for a long time, leading to a
lot of extra bloat. However, it's a particularly bad policy for bulk
relation extension, because you're stuffing a large number of totally
free pages in there in a way that doesn't make them particularly easy
for anybody else to discover. There are two ways we can fail here:

1. Callers who use GetPageWithFreeSpace() rather than
GetPageFreeSpaceExtended() will fail to find the new pages if the
upper map levels haven't been updated by VACUUM.

2. Even callers who use GetPageFreeSpaceExtended() may fail to find
the new pages. This can happen in two separate ways, namely (a) the
lastValidBlock saved by RelationGetBufferForTuple() can be in the
middle of the relation someplace rather than near the end, or (b) the
bulk-extension performed by some other backend can have overflowed
onto some new FSM page that won't be searched even though a relatively
plausible lastValidBlock was passed.

It seems to me that since we're adding a whole bunch of empty pages at
once, it's worth the effort to update the upper levels of the FSM.
This isn't a case of discovering a single page with an extra few bytes
of storage available due to a HOT prune or something - this is a case
of putting at least 20 and plausibly hundreds of extra pages into the
FSM. The extra effort to update the upper FSM pages is trivial by
comparison with the cost of extending the relation by many blocks.

So, I suggest adding a new function FreeSpaceMapBulkExtend(BlockNumber
first_block, BlockNumber last_block) which sets all the FSM entries
for pages between first_block and last_block to 255 and then bubbles
that up to the higher levels of the tree and all the way to the root.
Have the bulk extend code use that instead of repeatedly calling
RecordPageWithFreeSpace. That should actually be much more efficient,
because it can call fsm_readbuf(), LockBuffer(), and
UnlockReleaseBuffer() just once per FSM page instead of once per FSM
page *per byte modified*. Maybe that makes no difference in practice,
but it can't hurt.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#95)
1 attachment(s)
Re: Relation extension scalability

On Fri, Mar 25, 2016 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

1. Callers who use GetPageWithFreeSpace() rather than
GetPageFreeSpaceExtended() will fail to find the new pages if the
upper map levels haven't been updated by VACUUM.

2. Even callers who use GetPageFreeSpaceExtended() may fail to find
the new pages. This can happen in two separate ways, namely (a) the

Yeah, that's the issue, if extended pages spills to next FSM page, then
other waiters will not find those page, and one by one all waiters will end
up adding extra pages.
for example, if there are ~30 waiters then
total blocks extended = (25(25+1)/2) *20 =~ 6500 pages.

This is not the case every time but whenever heap block go to new FSM page
this will happen.

- FSM page case hold 4096 heap blk info, so after every 8th extend (assume
512 block will extend in one time), it will extend ~6500 pages

- Any new request to RelationGetBufferForTuple will be able to find those
page, because by that time the backend which is extending the page would
have set new block using RelationSetTargetBlock.
(there are still chances that some blocks can be completely left unused,
until vacuum comes).

I have changed the patch as per the suggestion (just POC because
performance number are not that great)

Below is the performance number comparison of base, previous patch(v13) and
latest patch (v14).

performance of patch v14 is significantly low compared to v13, mainly I
guess below reasons
1. As per above calculation v13 extend ~6500 block (after every 8th
extend), and that's why it's performing well.

2. In v13 as soon as we extend the block we add to FSM so immediately
available for new requester, (In this patch also I tried to add one by one
to FSM and updated fsm tree till root after all pages added to FSM, but no
significant improvement).

3. fsm_update_recursive doesn't seems like problem to me. does it ?

Copy 10000 tuples, of 4 bytes each..
---------------------------------------------
Client base patch v13 patch v14
1 118 147 126
2 217 276 269
4 210 421 347
8 166 630 375
16 145 813 415
32 124 985 451
64 974 455

Insert 1000 tuples of 1K size each.

Client base patch v13 patch v14
1 117 124 119
2 111 126 119
4 51 128 124
8 43 149 131
16 40 217 120
32 263 115
64 248 109

Note: I think one thread number can be just run to run variance..

Does anyone see problem in updating the FSM tree, I have debugged and saw
that we are able to get the pages properly from tree and same is visible in
performance number of v14 compared to base.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v14_poc.patchtext/x-diff; charset=US-ASCII; name=multi_extend_v14_poc.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..9ac1eb7 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,62 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	BlockNumber blockNum,
+				firstBlock,
+				lastBlock;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	firstBlock = lastBlock = InvalidBlockNumber;
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		if (firstBlock == InvalidBlockNumber)
+			firstBlock = blockNum;
+
+		lastBlock = blockNum;
+	}
+
+	/*
+	 * Put the page in the freespace map so other backends can find it.
+	 * This is what will keep those other backends from also queueing up
+	 * on the relation extension lock.
+	 */
+	FreeSpaceMapBulkExtend(relation, firstBlock, lastBlock);
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,8 +289,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
@@ -308,6 +364,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -440,10 +497,46 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			/*
+			 * Check if some other backend has extended a block for us while
+			 * we were waiting on the lock.
+			 */
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..5f45891 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,7 +109,10 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
-
+static bool fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2);
+static void fsm_set_range(Relation rel, FSMAddress addr, uint16 firstSlot,
+					uint16 lastSlot, uint8 newValue);
+static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat);
 
 /******** Public API ********/
 
@@ -189,6 +192,49 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
 }
 
 /*
+ * Sets all the FSM entries for pages between firstBlock and lastBlock to
+ * BLKSIZE and then bubbles that up to the higher levels of the tree and
+ * all the way to the root.
+ */
+void
+FreeSpaceMapBulkExtend(Relation rel, BlockNumber firstBlock,
+					BlockNumber lastBlock)
+{
+	int			new_cat = fsm_space_avail_to_cat(BLCKSZ);
+	FSMAddress	addr1,
+				addr2;
+	uint16		firstSlot;
+	uint16		lastSlot;
+	uint16		slot;
+	BlockNumber	blockNum;
+
+	blockNum = firstBlock;
+	addr1 = fsm_get_location(firstBlock, &slot);
+
+	while (blockNum < lastBlock)
+	{
+		blockNum++;
+
+		addr2 = fsm_get_location(blockNum, &slot);
+
+		if (!fsm_addr_on_same_page(addr1, addr2)
+			|| blockNum == lastBlock)
+		{
+			/* This BLock is on the next FSM page so update the FSM page */
+			fsm_get_location(firstBlock, &firstSlot);
+			fsm_get_location(blockNum - 1, &lastSlot);
+			fsm_set_range(rel, addr1, firstSlot, lastSlot, new_cat);
+			fsm_update_recursive(rel, addr1, new_cat);
+			/*
+			 * Continue updating FSM till last page, so set the addr1 as new
+			 * Address. and continue search for this page.
+			 */
+			addr1 = addr2;
+		}
+	}
+}
+
+/*
  * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
  *		WAL replay
  */
@@ -788,3 +834,56 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
 
 	return max_avail;
 }
+
+static bool
+fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2)
+{
+	Assert(addr1);
+	Assert(addr2);
+
+	if ((addr1.level != addr2.level)
+		|| (addr1.logpageno != addr2.logpageno))
+		return false;
+
+	return true;
+}
+
+/*
+ * Set value in given FSM page for given slot range.
+ */
+static void
+fsm_set_range(Relation rel, FSMAddress addr, uint16 firstSlot,
+			uint16 lastSlot, uint8 newValue)
+{
+	Buffer		buf;
+	Page		page;
+	int			slot;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	page = BufferGetPage(buf);
+
+	slot = firstSlot;
+
+	for (slot = firstSlot; slot <= lastSlot; slot++)
+		fsm_set_avail(page, slot, newValue);
+
+	MarkBufferDirtyHint(buf, false);
+
+	UnlockReleaseBuffer(buf);
+}
+
+static void
+fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat)
+{
+	uint16		parentslot;
+	FSMAddress	parent;
+
+	if (addr.level == FSM_ROOT_LEVEL)
+		return;
+
+	parent = fsm_get_parent(addr, &parentslot);
+	fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
+	fsm_update_recursive(rel, parent, new_cat);
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..0f2a2be 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,7 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern void FreeSpaceMapBulkExtend(Relation rel, BlockNumber firstBlock,
+							BlockNumber lastBlock);
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#97Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#96)
Re: Relation extension scalability

On Fri, Mar 25, 2016 at 1:05 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Mar 25, 2016 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

1. Callers who use GetPageWithFreeSpace() rather than
GetPageFreeSpaceExtended() will fail to find the new pages if the
upper map levels haven't been updated by VACUUM.

2. Even callers who use GetPageFreeSpaceExtended() may fail to find
the new pages. This can happen in two separate ways, namely (a) the

Yeah, that's the issue, if extended pages spills to next FSM page, then
other waiters will not find those page, and one by one all waiters will end
up adding extra pages.
for example, if there are ~30 waiters then
total blocks extended = (25(25+1)/2) *20 =~ 6500 pages.

This is not the case every time but whenever heap block go to new FSM page
this will happen.

I think we need to start testing these patches not only in terms of
how *fast* they are but how *large* the relation ends up being when
we're done. A patch that inserts the rows slower but the final
relation is smaller may be better overall. Can you retest v13, v14,
and master, and post not only the timings but the relation size
afterwards? And maybe post the exact script you are using?

performance of patch v14 is significantly low compared to v13, mainly I
guess below reasons
1. As per above calculation v13 extend ~6500 block (after every 8th extend),
and that's why it's performing well.

That should be completely unnecessary, though. I mean, if the problem
is that it's expensive to repeatedly acquire and release the relation
extension lock, then bulk-extending even 100 blocks at a time should
be enough to fix that, because you've reduced the number of times that
has to be done by 99%. There's no way we should need to extend by
thousands of blocks to get good performance.

Maybe something like this would help:

if (needLock)
{
if (!use_fsm)
LockRelationForExtension(relation, ExclusiveLock);
else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
{
BlockNumber last_blkno = RelationGetNumberOfBlocks(relation);

targetBlock = GetPageWithFreeSpaceExtended(relation,
last_blkno, len + saveFreeSpace);
if (targetBlock != InvalidBlockNumber)
goto loop;

LockRelationForExtension(relation, ExclusiveLock);
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
if (targetBlock != InvalidBlockNumber)
{
UnlockRelationForExtension(relation, ExclusiveLock);
goto loop;
}
RelationAddExtraBlocks(relation, bistate);
}
}

I think this is better than what you had before with lastValidBlock,
because we're actually interested in searching the free space map at
the *very end* of the relation, not wherever the last target block
happens to have been.

We could go further still and have GetPageWithFreeSpace() always
search the last, say, two pages of the FSM in all cases. But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#97)
4 attachment(s)
Re: Relation extension scalability

On Sat, Mar 26, 2016 at 8:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think we need to start testing these patches not only in terms of
how *fast* they are but how *large* the relation ends up being when
we're done. A patch that inserts the rows slower but the final
relation is smaller may be better overall. Can you retest v13, v14,
and master, and post not only the timings but the relation size
afterwards? And maybe post the exact script you are using?

I have tested the size and performance, scripts are attached in the mail.

COPY 1-10 bytes tuple from 32 Clients
Base V13 V14
-------- ---------
---------
TPS 123 874 446
No. Of Tuples 148270000 1049980000 536370000
Relpages 656089 4652593 2485482
INSERT 1028 bytes Tuples From 16 Clients
Base V13 V14
-------- --------
---------
TPS 42 211 120
No. Of Tuples 5149000 25343000 14524000
Rel Pages 735577 3620765 2140612

As per above results If we calculate the tuple number of tuples and
respective relpages, then neither in v13 nor v14 there are extra unused
pages.

As per my calculation for INSERT (1028 byte tuple) each page contain 7
tuples so
number of pages required
Base: 5149000/7 = 735571 (from relpages we can see 6 pages are extra)
V13 : 25343000/7= 3620428 (from relpages we can see ~300 pages are extra).
V14 : 14524000/7= 2074857 (from relpages we can see ~70000 pages are
extra).

With V14 we have found max pages number of extra pages, I expected V13 to
have max unused pages, but it's v14 and I tested it in multiple runs and
v13 is always the winner. I tested with multiple client count also like 8,
32 and v13 always have only ~60-300 extra pages out of total ~2-4 Million
Pages.

Attached files:
-------------------
test_size_ins.sh --> automated script to run insert test and calculate
tuple and relpages.
test_size_copy --> automated script to run copy test and calculate tuple
and relpages.
copy_script -> copy pg_bench script used by test_size_copy.sh
insert_script --> insert pg_bench script used by test_size_ins.sh

Maybe something like this would help:

if (needLock)
{
if (!use_fsm)
LockRelationForExtension(relation, ExclusiveLock);
else if (!ConditionLockRelationForExtension(relation,
ExclusiveLock))
{
BlockNumber last_blkno =
RelationGetNumberOfBlocks(relation);

targetBlock = GetPageWithFreeSpaceExtended(relation,
last_blkno, len + saveFreeSpace);
if (targetBlock != InvalidBlockNumber)
goto loop;

LockRelationForExtension(relation, ExclusiveLock);
targetBlock = GetPageWithFreeSpace(relation, len +
saveFreeSpace);
if (targetBlock != InvalidBlockNumber)
{
UnlockRelationForExtension(relation, ExclusiveLock);
goto loop;
}
RelationAddExtraBlocks(relation, bistate);
}
}

I think this is better than what you had before with lastValidBlock,
because we're actually interested in searching the free space map at
the *very end* of the relation, not wherever the last target block
happens to have been.

We could go further still and have GetPageWithFreeSpace() always
search the last, say, two pages of the FSM in all cases. But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.

I will try the test with this also and post the results.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

copy_scriptapplication/octet-stream; name=copy_scriptDownload
insert_scriptapplication/octet-stream; name=insert_scriptDownload
test_size_copy.shapplication/x-sh; name=test_size_copy.shDownload
test_size_ins.shapplication/x-sh; name=test_size_ins.shDownload
#99Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#98)
5 attachment(s)
Re: Relation extension scalability

On Sat, Mar 26, 2016 at 3:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

search the last, say, two pages of the FSM in all cases. But that

might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.

I will try the test with this also and post the results.

I have changed v14 as per this suggestion and results are same as v14.

I have again measured the relation size, this time directly using size
function so results are better understandable.

Relation Size
-----------------
INSERT : 16000 transaction from 32 Client

Base v13 v14_1
--------- --------- --------
TPS 37 255 112
Rel Size 17GB 17GB 18GB

COPY: 32000 transaction from 32 client
Base v13 v14_1
--------- --------- ---------
TPS 121 823 427
Rel Size 11GB 11GB 11GB

Script are attached in the mail
----------------------------------------=
test_size_ins.sh --> run insert test and calculate relation size.
test_size_copy --> run copy test and relation size
copy_script -> copy pg_bench script used by test_size_copy.sh
insert_script --> insert pg_bench script used by test_size_ins.sh
multi_extend_v14_poc_v1.patch --> modified patch of v14.

I also tried modifying v14 from different different angle.

One is like below-->
-------------------------
In AddExtraBlock
{
I add page to FSM one by one like v13 does.
then update the full FSM tree up till root
}

Results:
----------
1. With this performance is little less than v14 but the problem of extra
relation size is solved.
2. With this we can conclude that extra size of relation in v14 is because
some while extending the pages, its not immediately available and at end
some of the pages are left unused.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

test_size_copy.shapplication/x-sh; name=test_size_copy.shDownload
test_size_ins.shapplication/x-sh; name=test_size_ins.shDownload
copy_scriptapplication/octet-stream; name=copy_scriptDownload
insert_scriptapplication/octet-stream; name=insert_scriptDownload
multi_extend_v14_poc_v1.patchapplication/octet-stream; name=multi_extend_v14_poc_v1.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..7d322f8 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,62 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	BlockNumber blockNum,
+				firstBlock,
+				lastBlock;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	firstBlock = lastBlock = InvalidBlockNumber;
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		if (firstBlock == InvalidBlockNumber)
+			firstBlock = blockNum;
+
+		lastBlock = blockNum;
+	}
+
+	/*
+	 * Put the page in the freespace map so other backends can find it.
+	 * This is what will keep those other backends from also queueing up
+	 * on the relation extension lock.
+	 */
+	FreeSpaceMapBulkExtend(relation, firstBlock, lastBlock);
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,8 +289,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
@@ -308,6 +364,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -440,10 +497,54 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			BlockNumber last_blkno = RelationGetNumberOfBlocks(relation);
+
+			targetBlock = GetPageWithFreeSpaceExtended(relation,
+													last_blkno,
+													len + saveFreeSpace);
+			if (targetBlock != InvalidBlockNumber)
+				goto loop;
+
+			/* Couldn't get the lock immediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			/*
+			 * Check if some other backend has extended a block for us while
+			 * we were waiting on the lock.
+			 */
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..b3ce12e 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,8 +109,11 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
-
-
+static bool fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2);
+static void fsm_set_range(Relation rel, FSMAddress addr, uint16 firstSlot,
+					uint16 lastSlot, uint8 newValue);
+static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat);
+static int fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue);
 /******** Public API ********/
 
 /*
@@ -129,9 +132,46 @@ static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
 BlockNumber
 GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
 {
-	uint8		min_cat = fsm_space_needed_to_cat(spaceNeeded);
+	/*
+	 * Call GetPageWithFreeSpaceExtended with InvalidBlockNumber so that
+	 * it will search the FSM tree from the root
+	 */
+	return GetPageWithFreeSpaceExtended(rel, InvalidBlockNumber, spaceNeeded);
+}
 
-	return fsm_search(rel, min_cat);
+/*
+ * 		GetPageWithFreeSpaceExtended
+ *
+ * As above, but start the search from oldPage instead of staring from root
+ * So that, we can find the appropriate page in cases where free block is
+ * added to FSM but not yet updated up till root. If oldpage is Invalid
+ * then start the search from root.
+ */
+BlockNumber
+GetPageWithFreeSpaceExtended(Relation rel, BlockNumber oldPage,
+								Size spaceNeeded)
+{
+	int			search_cat = fsm_space_needed_to_cat(spaceNeeded);
+	FSMAddress	addr;
+	uint16		slot;
+	int			search_slot = -1;
+
+	if (oldPage != InvalidBlockNumber)
+	{
+		/* Get the location of the FSM byte representing the heap block */
+		addr = fsm_get_location(oldPage, &slot);
+
+		search_slot = fsm_search_from_addr(rel, addr, search_cat);
+	}
+
+	/*
+	 * If fsm_search_from_addr found a suitable new block, return that.
+	 * Otherwise, search as usual.
+	 */
+	if (search_slot != -1)
+		return fsm_get_heap_blk(addr, search_slot);
+	else
+		return fsm_search(rel, search_cat);
 }
 
 /*
@@ -189,6 +229,55 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
 }
 
 /*
+ * Sets all the FSM entries for pages between firstBlock and lastBlock to
+ * BLKSIZE and then bubbles that up to the higher levels of the tree and
+ * all the way to the root.
+ */
+void
+FreeSpaceMapBulkExtend(Relation rel, BlockNumber firstBlock,
+					BlockNumber lastBlock)
+{
+	int			new_cat = fsm_space_avail_to_cat(BLCKSZ);
+	FSMAddress	addr1,
+				addr2;
+	uint16		firstSlot;
+	uint16		lastSlot;
+	uint16		slot;
+	BlockNumber	blockNum,
+				startBlk;
+
+	startBlk = blockNum = firstBlock;
+	addr1 = fsm_get_location(firstBlock, &slot);
+
+	while (blockNum <= lastBlock)
+	{
+		addr2 = fsm_get_location(blockNum+1, &slot);
+
+		/*
+		 * If next block is on different FSM page or we reached to the last
+		 * block then add what we got so far
+		 */
+		if (!fsm_addr_on_same_page(addr1, addr2)
+			|| blockNum == lastBlock)
+		{
+			/* This BLock is on the next FSM page so update the FSM page */
+			fsm_get_location(startBlk, &firstSlot);
+			fsm_get_location(blockNum, &lastSlot);
+
+			fsm_set_range(rel, addr1, firstSlot, lastSlot, new_cat);
+			fsm_update_recursive(rel, addr1, new_cat);
+			/*
+			 * Continue updating FSM till last page, so set the addr1 as new
+			 * Address. and continue search for this page.
+			 */
+			addr1 = addr2;
+		}
+
+		blockNum++;
+	}
+}
+
+/*
  * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
  *		WAL replay
  */
@@ -788,3 +877,85 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
 
 	return max_avail;
 }
+
+static bool
+fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2)
+{
+	Assert(addr1);
+	Assert(addr2);
+
+	if ((addr1.level != addr2.level)
+		|| (addr1.logpageno != addr2.logpageno))
+		return false;
+
+	return true;
+}
+
+/*
+ * Set value in given FSM page for given slot range.
+ */
+static void
+fsm_set_range(Relation rel, FSMAddress addr, uint16 firstSlot,
+			uint16 lastSlot, uint8 newValue)
+{
+	Buffer		buf;
+	Page		page;
+	int			slot;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	page = BufferGetPage(buf);
+
+	slot = firstSlot;
+
+	for (slot = firstSlot; slot <= lastSlot; slot++)
+		fsm_set_avail(page, slot, newValue);
+
+	MarkBufferDirtyHint(buf, false);
+
+	UnlockReleaseBuffer(buf);
+}
+
+static void
+fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat)
+{
+	uint16		parentslot;
+	FSMAddress	parent;
+
+	if (addr.level == FSM_ROOT_LEVEL)
+		return;
+
+	parent = fsm_get_parent(addr, &parentslot);
+	fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
+	fsm_update_recursive(rel, parent, new_cat);
+}
+
+/*
+ * Search the the fsm tree for for free space > minValue
+ * It will start the search from given addr, and will be used for searching
+ * the required page in cases where vacuum have not yet updated the FSM tree
+ * till root level.
+ * If one is found, its slot number is returned, -1 otherwise.
+ */
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+	Buffer		buf;
+	int			newslot = -1;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+	Assert(minValue != 0);
+
+	/* Search while we still hold the lock */
+	newslot = fsm_search_avail(buf, minValue,
+							addr.level == FSM_BOTTOM_LEVEL,
+							false);
+
+	UnlockReleaseBuffer(buf);
+
+	return newslot;
+}
+
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..8da5312 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,10 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern void FreeSpaceMapBulkExtend(Relation rel, BlockNumber firstBlock,
+							BlockNumber lastBlock);
+extern BlockNumber GetPageWithFreeSpaceExtended(Relation rel,
+								BlockNumber oldPage,
+								Size spaceNeeded);
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#100Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#98)
5 attachment(s)
Re: Relation extension scalability

On Sat, Mar 26, 2016 at 3:18 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

We could go further still and have GetPageWithFreeSpace() always

search the last, say, two pages of the FSM in all cases. But that
might be expensive. The extra call to RelationGetNumberOfBlocks seems
cheap enough here because the alternative is to wait for a contended
heavyweight lock.

I will try the test with this also and post the results.

**Something went wrong in last mail, seems like become separate thread, so
resending the same mail **

I have changed v14 as per this suggestion and results are same as v14.

I have again measured the relation size, this time directly using size
function so results are better understandable.

Relation Size
-----------------
INSERT : 16000 transaction from 32 Client

Base v13 v14_1
--------- --------- --------
TPS 37 255 112
Rel Size 17GB 17GB 18GB

COPY: 32000 transaction from 32 client
Base v13 v14_1
--------- --------- ---------
TPS 121 823 427
Rel Size 11GB 11GB 11GB

Script are attached in the mail
----------------------------------------=
test_size_ins.sh --> run insert test and calculate relation size.
test_size_copy --> run copy test and relation size
copy_script -> copy pg_bench script used by test_size_copy.sh
insert_script --> insert pg_bench script used by test_size_ins.sh
multi_extend_v14_poc_v1.patch --> modified patch of v14.

I also tried modifying v14 from different different angle.

One is like below-->
-------------------------
In AddExtraBlock
{
I add page to FSM one by one like v13 does.
then update the full FSM tree up till root
}

Results:
----------
1. With this performance is little less than v14 but the problem of extra
relation size is solved.
2. With this we can conclude that extra size of relation in v14 is because
some while extending the pages, its not immediately available and at end
some of the pages are left unused.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

copy_scriptapplication/octet-stream; name=copy_scriptDownload
insert_scriptapplication/octet-stream; name=insert_scriptDownload
multi_extend_v14_poc_v1.patchapplication/octet-stream; name=multi_extend_v14_poc_v1.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..7d322f8 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,62 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	BlockNumber blockNum,
+				firstBlock,
+				lastBlock;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	firstBlock = lastBlock = InvalidBlockNumber;
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		if (firstBlock == InvalidBlockNumber)
+			firstBlock = blockNum;
+
+		lastBlock = blockNum;
+	}
+
+	/*
+	 * Put the page in the freespace map so other backends can find it.
+	 * This is what will keep those other backends from also queueing up
+	 * on the relation extension lock.
+	 */
+	FreeSpaceMapBulkExtend(relation, firstBlock, lastBlock);
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,8 +289,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
@@ -308,6 +364,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -440,10 +497,54 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			BlockNumber last_blkno = RelationGetNumberOfBlocks(relation);
+
+			targetBlock = GetPageWithFreeSpaceExtended(relation,
+													last_blkno,
+													len + saveFreeSpace);
+			if (targetBlock != InvalidBlockNumber)
+				goto loop;
+
+			/* Couldn't get the lock immediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			/*
+			 * Check if some other backend has extended a block for us while
+			 * we were waiting on the lock.
+			 */
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..b3ce12e 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,8 +109,11 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
-
-
+static bool fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2);
+static void fsm_set_range(Relation rel, FSMAddress addr, uint16 firstSlot,
+					uint16 lastSlot, uint8 newValue);
+static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat);
+static int fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue);
 /******** Public API ********/
 
 /*
@@ -129,9 +132,46 @@ static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
 BlockNumber
 GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
 {
-	uint8		min_cat = fsm_space_needed_to_cat(spaceNeeded);
+	/*
+	 * Call GetPageWithFreeSpaceExtended with InvalidBlockNumber so that
+	 * it will search the FSM tree from the root
+	 */
+	return GetPageWithFreeSpaceExtended(rel, InvalidBlockNumber, spaceNeeded);
+}
 
-	return fsm_search(rel, min_cat);
+/*
+ * 		GetPageWithFreeSpaceExtended
+ *
+ * As above, but start the search from oldPage instead of staring from root
+ * So that, we can find the appropriate page in cases where free block is
+ * added to FSM but not yet updated up till root. If oldpage is Invalid
+ * then start the search from root.
+ */
+BlockNumber
+GetPageWithFreeSpaceExtended(Relation rel, BlockNumber oldPage,
+								Size spaceNeeded)
+{
+	int			search_cat = fsm_space_needed_to_cat(spaceNeeded);
+	FSMAddress	addr;
+	uint16		slot;
+	int			search_slot = -1;
+
+	if (oldPage != InvalidBlockNumber)
+	{
+		/* Get the location of the FSM byte representing the heap block */
+		addr = fsm_get_location(oldPage, &slot);
+
+		search_slot = fsm_search_from_addr(rel, addr, search_cat);
+	}
+
+	/*
+	 * If fsm_search_from_addr found a suitable new block, return that.
+	 * Otherwise, search as usual.
+	 */
+	if (search_slot != -1)
+		return fsm_get_heap_blk(addr, search_slot);
+	else
+		return fsm_search(rel, search_cat);
 }
 
 /*
@@ -189,6 +229,55 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
 }
 
 /*
+ * Sets all the FSM entries for pages between firstBlock and lastBlock to
+ * BLKSIZE and then bubbles that up to the higher levels of the tree and
+ * all the way to the root.
+ */
+void
+FreeSpaceMapBulkExtend(Relation rel, BlockNumber firstBlock,
+					BlockNumber lastBlock)
+{
+	int			new_cat = fsm_space_avail_to_cat(BLCKSZ);
+	FSMAddress	addr1,
+				addr2;
+	uint16		firstSlot;
+	uint16		lastSlot;
+	uint16		slot;
+	BlockNumber	blockNum,
+				startBlk;
+
+	startBlk = blockNum = firstBlock;
+	addr1 = fsm_get_location(firstBlock, &slot);
+
+	while (blockNum <= lastBlock)
+	{
+		addr2 = fsm_get_location(blockNum+1, &slot);
+
+		/*
+		 * If next block is on different FSM page or we reached to the last
+		 * block then add what we got so far
+		 */
+		if (!fsm_addr_on_same_page(addr1, addr2)
+			|| blockNum == lastBlock)
+		{
+			/* This BLock is on the next FSM page so update the FSM page */
+			fsm_get_location(startBlk, &firstSlot);
+			fsm_get_location(blockNum, &lastSlot);
+
+			fsm_set_range(rel, addr1, firstSlot, lastSlot, new_cat);
+			fsm_update_recursive(rel, addr1, new_cat);
+			/*
+			 * Continue updating FSM till last page, so set the addr1 as new
+			 * Address. and continue search for this page.
+			 */
+			addr1 = addr2;
+		}
+
+		blockNum++;
+	}
+}
+
+/*
  * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
  *		WAL replay
  */
@@ -788,3 +877,85 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
 
 	return max_avail;
 }
+
+static bool
+fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2)
+{
+	Assert(addr1);
+	Assert(addr2);
+
+	if ((addr1.level != addr2.level)
+		|| (addr1.logpageno != addr2.logpageno))
+		return false;
+
+	return true;
+}
+
+/*
+ * Set value in given FSM page for given slot range.
+ */
+static void
+fsm_set_range(Relation rel, FSMAddress addr, uint16 firstSlot,
+			uint16 lastSlot, uint8 newValue)
+{
+	Buffer		buf;
+	Page		page;
+	int			slot;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	page = BufferGetPage(buf);
+
+	slot = firstSlot;
+
+	for (slot = firstSlot; slot <= lastSlot; slot++)
+		fsm_set_avail(page, slot, newValue);
+
+	MarkBufferDirtyHint(buf, false);
+
+	UnlockReleaseBuffer(buf);
+}
+
+static void
+fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat)
+{
+	uint16		parentslot;
+	FSMAddress	parent;
+
+	if (addr.level == FSM_ROOT_LEVEL)
+		return;
+
+	parent = fsm_get_parent(addr, &parentslot);
+	fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
+	fsm_update_recursive(rel, parent, new_cat);
+}
+
+/*
+ * Search the the fsm tree for for free space > minValue
+ * It will start the search from given addr, and will be used for searching
+ * the required page in cases where vacuum have not yet updated the FSM tree
+ * till root level.
+ * If one is found, its slot number is returned, -1 otherwise.
+ */
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+	Buffer		buf;
+	int			newslot = -1;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+	Assert(minValue != 0);
+
+	/* Search while we still hold the lock */
+	newslot = fsm_search_avail(buf, minValue,
+							addr.level == FSM_BOTTOM_LEVEL,
+							false);
+
+	UnlockReleaseBuffer(buf);
+
+	return newslot;
+}
+
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..8da5312 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,10 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern void FreeSpaceMapBulkExtend(Relation rel, BlockNumber firstBlock,
+							BlockNumber lastBlock);
+extern BlockNumber GetPageWithFreeSpaceExtended(Relation rel,
+								BlockNumber oldPage,
+								Size spaceNeeded);
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
test_size_copy.shapplication/x-sh; name=test_size_copy.shDownload
test_size_ins.shapplication/x-sh; name=test_size_ins.shDownload
#101Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#100)
Re: Relation extension scalability

On Sun, Mar 27, 2016 at 8:00 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Relation Size
-----------------
INSERT : 16000 transaction from 32 Client

Base v13 v14_1
--------- --------- --------
TPS 37 255 112
Rel Size 17GB 17GB 18GB

COPY: 32000 transaction from 32 client
Base v13 v14_1
--------- --------- ---------
TPS 121 823 427
Rel Size 11GB 11GB 11GB

Script are attached in the mail
----------------------------------------=
test_size_ins.sh --> run insert test and calculate relation size.
test_size_copy --> run copy test and relation size
copy_script -> copy pg_bench script used by test_size_copy.sh
insert_script --> insert pg_bench script used by test_size_ins.sh
multi_extend_v14_poc_v1.patch --> modified patch of v14.

I also tried modifying v14 from different different angle.

One is like below-->
-------------------------
In AddExtraBlock
{
I add page to FSM one by one like v13 does.
then update the full FSM tree up till root
}

Not following this. Did you attach this version?

Results:
----------
1. With this performance is little less than v14 but the problem of extra
relation size is solved.
2. With this we can conclude that extra size of relation in v14 is because
some while extending the pages, its not immediately available and at end
some of the pages are left unused.

I agree with that conclusion. I'm not quite sure where that leaves
us, though. We can go back to v13, but why isn't that producing extra
pages? It seems like it should: whenever a bulk extend rolls over to
a new FSM page, concurrent backends will search either the old or the
new one but not both.

Maybe we could do this - not sure if it's what you were suggesting above:

1. Add the pages one at a time, and do RecordPageWithFreeSpace after each one.
2. After inserting them all, go back and update the upper levels of
the FSM tree up the root.

Another idea is:

If ConditionalLockRelationForExtension fails to get the lock
immediately, search the last *two* pages of the FSM for a free page.

Just brainstorming here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#101)
Re: Relation extension scalability

On Mon, Mar 28, 2016 at 1:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:

One is like below-->
-------------------------
In AddExtraBlock
{
I add page to FSM one by one like v13 does.
then update the full FSM tree up till root
}

Not following this. Did you attach this version?

No I did not attached this.. During rough experiment, tried this, did not
produced any patch, I will send this.

Results:
----------
1. With this performance is little less than v14 but the problem of extra
relation size is solved.
2. With this we can conclude that extra size of relation in v14 is

because

some while extending the pages, its not immediately available and at end
some of the pages are left unused.

I agree with that conclusion. I'm not quite sure where that leaves
us, though. We can go back to v13, but why isn't that producing extra
pages? It seems like it should: whenever a bulk extend rolls over to
a new FSM page, concurrent backends will search either the old or the
new one but not both.

Maybe we could do this - not sure if it's what you were suggesting above:

1. Add the pages one at a time, and do RecordPageWithFreeSpace after each
one.
2. After inserting them all, go back and update the upper levels of
the FSM tree up the root.

Yes same, I wanted to explained the same above.

Another idea is:

If ConditionalLockRelationForExtension fails to get the lock
immediately, search the last *two* pages of the FSM for a free page.

Just brainstorming here.

I think this is better option, Since we will search last two pages of FSM
tree, then no need to update the upper level of the FSM tree. Right ?

I will test and post the result with this option.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#103Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#101)
Re: Relation extension scalability

On Mon, Mar 28, 2016 at 1:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Mar 27, 2016 at 8:00 AM, Dilip Kumar <dilipbalaut@gmail.com>

wrote:

Results:
----------
1. With this performance is little less than v14 but the problem of

extra

relation size is solved.
2. With this we can conclude that extra size of relation in v14 is

because

some while extending the pages, its not immediately available and at end
some of the pages are left unused.

I agree with that conclusion. I'm not quite sure where that leaves
us, though. We can go back to v13, but why isn't that producing extra
pages? It seems like it should: whenever a bulk extend rolls over to
a new FSM page, concurrent backends will search either the old or the
new one but not both.

I have not debugged the flow, but by looking at v13 code, it looks like it
will search both old and new. In
function GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(),
the basic idea of search is: Start the search from the target slot. At
every step, move one
node to the right, then climb up to the parent. Stop when we reach a node
with enough free space (as we must, since the root has enough space).
So shouldn't it be able to find the new FSM page where the bulk extend
rolls over?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#104Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#103)
Re: Relation extension scalability

On Mon, Mar 28, 2016 at 11:00 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I have not debugged the flow, but by looking at v13 code, it looks like it
will search both old and new. In
function GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(),
the basic idea of search is: Start the search from the target slot. At
every step, move one
node to the right, then climb up to the parent. Stop when we reach a node
with enough free space (as we must, since the root has enough space).
So shouldn't it be able to find the new FSM page where the bulk extend
rolls over?

This is actually multi level tree, So each FSM page contain one slot tree.

So fsm_search_avail() is searching only the slot tree, inside one FSM page.
But we want to go to next FSM page.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#105Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#102)
1 attachment(s)
Re: Relation extension scalability

On Mon, Mar 28, 2016 at 7:21 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I agree with that conclusion. I'm not quite sure where that leaves

us, though. We can go back to v13, but why isn't that producing extra
pages? It seems like it should: whenever a bulk extend rolls over to
a new FSM page, concurrent backends will search either the old or the
new one but not both.

Our open question was why V13 is not producing extra pages, I tried to
print some logs and debug. It seems to me that,
when blocks are spilling to next FSM pages, that time all backends who are
waiting on lock will not get the block because searching in old FSM page.
But the backend which is extending the pages will set
RelationSetTargetBlock to latest blocks, and that will make new FSM page
available for search by new requesters.

1. So this is why v13 (in normal cases*1) not producing unused pages.
2. But it will produce extra pages (which will be consumed by new
requesters), because all waiter will come one by one and extend 512 pages.

*1 : Above I have mentioned normal case, I mean there is some case exist
where V13 can leave unused page, Like one by one each waiter Get the lock
and extend the page, but no one go down till RelationSetTargetBlock so till
now new pages are not available by new requester, and time will come when
blocks will spill to third FSM page, now one by one all backends go down
and set RelationSetTargetBlock, and suppose last one set it to the block
which is in 3rd FSM page, in this case, pages in second FSM pages are
unused.

Maybe we could do this - not sure if it's what you were suggesting above:

1. Add the pages one at a time, and do RecordPageWithFreeSpace after each
one.
2. After inserting them all, go back and update the upper levels of
the FSM tree up the root.

I think this is better option, Since we will search last two pages of FSM
tree, then no need to update the upper level of the FSM tree. Right ?

I will test and post the result with this option.

I have created this patch and results are as below.

* All test scripts are same attached upthread

1. Relation Size : No change in size, its same as base and v13

2. INSERT 1028 Byte 1000 tuple performance
-----------------------------------------------------------
Client base v13 v15
1 117 124 122
2 111 126 123
4 51 128 125
8 43 149 135
16 40 217 147
32 35 263 141

3. COPY 10000 Tuple performance.
----------------------------------------------
Client base v13 v15
1 118 147 155
2 217 276 273
4 210 421 457
8 166 630 643
16 145 813 595
32 124 985 598

Conclusion:
---------------
1. I think v15 is solving the problem exist with v13 and performance is
significantly high compared to base, and relation size is also stable, So
IMHO V15 is winner over other solution, what other thinks ?

2. And no point in thinking that V13 is better than V15 because, V13 has
bug of sometime extending more than expected pages and that is uncontrolled
and same can be the reason also of v13 performing better.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v15.patchapplication/octet-stream; name=multi_extend_v15.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..54ffcd4 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,55 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * Put the page in the freespace map so other backends can find it.
+		 * This is what will keep those other backends from also queueing up
+		 * on the relation extension lock.
+		 */
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +282,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +358,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +439,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 								 otherBlock, targetBlock, vmbuffer_other,
 								 vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -440,10 +493,68 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			BlockNumber     last_blkno;
+			BlockNumber     prev_blkno;
+
+			/* Couldn't get the lock immmediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			last_blkno = RelationGetNumberOfBlocks(relation);
+
+			/*
+			 * If lastValidBlock is Invalid, it means when we checked there
+			 * was no block in the relation, but still we need to search
+			 * because other backend might have extended it
+			 */
+			if (lastValidBlock != InvalidBlockNumber)
+				prev_blkno = lastValidBlock;
+			else
+				prev_blkno = last_blkno;
+
+			/*
+			 * Here we are calling GetPageWithFreeSpaceExtended with
+			 * prev block and the last block of the relation. So that
+			 * we can find any block added in FSM near to the prev valid
+			 * block and also if bulk extend has spill over to next FSM
+			 * page, one bulk extend max limit is 512 pages, so finding
+			 * in two FSM page is enough.
+			 */
+			targetBlock = GetPageWithFreeSpaceExtended(relation,
+													prev_blkno,
+													last_blkno,
+													len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..19e499b 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,7 +109,8 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
-
+static int fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue);
+static bool fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2);
 
 /******** Public API ********/
 
@@ -129,9 +130,56 @@ static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
 BlockNumber
 GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
 {
-	uint8		min_cat = fsm_space_needed_to_cat(spaceNeeded);
+	/*
+	 * Call GetPageWithFreeSpaceExtended with InvalidBlockNumber so that
+	 * it will search the FSM tree from the root
+	 */
+	return GetPageWithFreeSpaceExtended(rel, InvalidBlockNumber, InvalidBlockNumber, spaceNeeded);
+}
+
+/*
+ * 		GetPageWithFreeSpaceExtended
+ *
+ * As above, but start the search in FSM page, where oldBlockNum has its slot
+ * instead of staring from root So that, we can find the appropriate page in
+ * cases where free block is added to FSM but not yet updated up till root.
+ * If oldBlockNum is Invalid then start the search from root. Also check in
+ * newBlockNum is in other FSM page, then search in that FSM page also.
+ */
+BlockNumber
+GetPageWithFreeSpaceExtended(Relation rel, BlockNumber oldBlockNum,
+							BlockNumber newBlockNum, Size spaceNeeded)
+{
+	int			search_cat = fsm_space_needed_to_cat(spaceNeeded);
+	FSMAddress	addr, new_addr;
+	uint16		slot;
+	int			search_slot = -1;
 
-	return fsm_search(rel, min_cat);
+	if (oldBlockNum != InvalidBlockNumber)
+	{
+		/* Get the location of the FSM byte representing the heap block */
+		addr = fsm_get_location(oldBlockNum, &slot);
+
+		search_slot = fsm_search_from_addr(rel, addr, search_cat);
+		if (search_slot == -1)
+		{
+			new_addr = fsm_get_location(newBlockNum, &slot);
+			if (!fsm_addr_on_same_page(addr, new_addr))
+			{
+				search_slot = fsm_search_from_addr(rel, new_addr, search_cat);
+				addr = new_addr;
+			}
+		}
+	}
+
+	/*
+	 * If fsm_search_from_addr found a suitable new block, return that.
+	 * Otherwise, search as usual.
+	 */
+	if (search_slot != -1)
+		return fsm_get_heap_blk(addr, search_slot);
+	else
+		return fsm_search(rel, search_cat);
 }
 
 /*
@@ -634,6 +682,44 @@ fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 }
 
 /*
+ * Search the the fsm tree for for free space > minValue
+ * It will start the search from given addr, and will be used for searching
+ * the required page in cases where vacuum have not yet updated the FSM tree
+ * till root level.
+ * If one is found, its slot number is returned, -1 otherwise.
+ */
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+	Buffer		buf;
+	int			newslot = -1;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+	Assert(minValue != 0);
+
+	/* Search while we still hold the lock */
+	newslot = fsm_search_avail(buf, minValue,
+							addr.level == FSM_BOTTOM_LEVEL,
+							false);
+
+	UnlockReleaseBuffer(buf);
+
+	return newslot;
+}
+
+static bool
+fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2)
+{
+	if ((addr1.level != addr2.level)
+		|| (addr1.logpageno != addr2.logpageno))
+		return false;
+
+	return true;
+}
+
+/*
  * Search the tree for a heap page with at least min_cat of free space
  */
 static BlockNumber
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..281137a 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,10 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern BlockNumber GetPageWithFreeSpaceExtended(Relation rel,
+								BlockNumber oldPage,
+								BlockNumber newPage,
+								Size spaceNeeded);
+
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#106Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#105)
1 attachment(s)
Re: Relation extension scalability

On Mon, Mar 28, 2016 at 3:02 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

1. Relation Size : No change in size, its same as base and v13

2. INSERT 1028 Byte 1000 tuple performance
-----------------------------------------------------------
Client base v13 v15
1 117 124 122
2 111 126 123
4 51 128 125
8 43 149 135
16 40 217 147
32 35 263 141

3. COPY 10000 Tuple performance.
----------------------------------------------
Client base v13 v15
1 118 147 155
2 217 276 273
4 210 421 457
8 166 630 643
16 145 813 595
32 124 985 598

Conclusion:
---------------
1. I think v15 is solving the problem exist with v13 and performance is
significantly high compared to base, and relation size is also stable, So
IMHO V15 is winner over other solution, what other thinks ?

2. And no point in thinking that V13 is better than V15 because, V13 has
bug of sometime extending more than expected pages and that is uncontrolled
and same can be the reason also of v13 performing better.

Found one problem with V15, so sending the new version.
In V15 I am taking prev_blkno as targetBlock instead it should be the last
block of the relation at that time. Attaching new patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v16.patchapplication/octet-stream; name=multi_extend_v16.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..517d465 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,55 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		/*
+		 * Put the page in the freespace map so other backends can find it.
+		 * This is what will keep those other backends from also queueing up
+		 * on the relation extension lock.
+		 */
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +282,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
-				otherBlock;
+				otherBlock,
+				prev_blkno = RelationGetNumberOfBlocks(relation);
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +358,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -440,10 +491,65 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			BlockNumber     last_blkno;
+
+			/* Couldn't get the lock immmediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			last_blkno = RelationGetNumberOfBlocks(relation);
+
+			/*
+			 * If lastValidBlock is Invalid, it means when we checked there
+			 * was no block in the relation, but still we need to search
+			 * because other backend might have extended it
+			 */
+			if (prev_blkno == InvalidBlockNumber)
+				prev_blkno = last_blkno;
+
+			/*
+			 * Here we are calling GetPageWithFreeSpaceExtended with
+			 * prev block and the last block of the relation. So that
+			 * we can find any block added in FSM near to the prev valid
+			 * block and also if bulk extend has spill over to next FSM
+			 * page, one bulk extend max limit is 512 pages, so finding
+			 * in two FSM page is enough.
+			 */
+			targetBlock = GetPageWithFreeSpaceExtended(relation,
+													prev_blkno,
+													last_blkno,
+													len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..19e499b 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,7 +109,8 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
-
+static int fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue);
+static bool fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2);
 
 /******** Public API ********/
 
@@ -129,9 +130,56 @@ static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
 BlockNumber
 GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
 {
-	uint8		min_cat = fsm_space_needed_to_cat(spaceNeeded);
+	/*
+	 * Call GetPageWithFreeSpaceExtended with InvalidBlockNumber so that
+	 * it will search the FSM tree from the root
+	 */
+	return GetPageWithFreeSpaceExtended(rel, InvalidBlockNumber, InvalidBlockNumber, spaceNeeded);
+}
+
+/*
+ * 		GetPageWithFreeSpaceExtended
+ *
+ * As above, but start the search in FSM page, where oldBlockNum has its slot
+ * instead of staring from root So that, we can find the appropriate page in
+ * cases where free block is added to FSM but not yet updated up till root.
+ * If oldBlockNum is Invalid then start the search from root. Also check in
+ * newBlockNum is in other FSM page, then search in that FSM page also.
+ */
+BlockNumber
+GetPageWithFreeSpaceExtended(Relation rel, BlockNumber oldBlockNum,
+							BlockNumber newBlockNum, Size spaceNeeded)
+{
+	int			search_cat = fsm_space_needed_to_cat(spaceNeeded);
+	FSMAddress	addr, new_addr;
+	uint16		slot;
+	int			search_slot = -1;
 
-	return fsm_search(rel, min_cat);
+	if (oldBlockNum != InvalidBlockNumber)
+	{
+		/* Get the location of the FSM byte representing the heap block */
+		addr = fsm_get_location(oldBlockNum, &slot);
+
+		search_slot = fsm_search_from_addr(rel, addr, search_cat);
+		if (search_slot == -1)
+		{
+			new_addr = fsm_get_location(newBlockNum, &slot);
+			if (!fsm_addr_on_same_page(addr, new_addr))
+			{
+				search_slot = fsm_search_from_addr(rel, new_addr, search_cat);
+				addr = new_addr;
+			}
+		}
+	}
+
+	/*
+	 * If fsm_search_from_addr found a suitable new block, return that.
+	 * Otherwise, search as usual.
+	 */
+	if (search_slot != -1)
+		return fsm_get_heap_blk(addr, search_slot);
+	else
+		return fsm_search(rel, search_cat);
 }
 
 /*
@@ -634,6 +682,44 @@ fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 }
 
 /*
+ * Search the the fsm tree for for free space > minValue
+ * It will start the search from given addr, and will be used for searching
+ * the required page in cases where vacuum have not yet updated the FSM tree
+ * till root level.
+ * If one is found, its slot number is returned, -1 otherwise.
+ */
+static int
+fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
+{
+	Buffer		buf;
+	int			newslot = -1;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+	Assert(minValue != 0);
+
+	/* Search while we still hold the lock */
+	newslot = fsm_search_avail(buf, minValue,
+							addr.level == FSM_BOTTOM_LEVEL,
+							false);
+
+	UnlockReleaseBuffer(buf);
+
+	return newslot;
+}
+
+static bool
+fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2)
+{
+	if ((addr1.level != addr2.level)
+		|| (addr1.logpageno != addr2.logpageno))
+		return false;
+
+	return true;
+}
+
+/*
  * Search the tree for a heap page with at least min_cat of free space
  */
 static BlockNumber
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..281137a 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,10 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern BlockNumber GetPageWithFreeSpaceExtended(Relation rel,
+								BlockNumber oldPage,
+								BlockNumber newPage,
+								Size spaceNeeded);
+
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#107Petr Jelinek
petr@2ndquadrant.com
In reply to: Dilip Kumar (#106)
Re: Relation extension scalability

On 28/03/16 14:46, Dilip Kumar wrote:

Conclusion:
---------------
1. I think v15 is solving the problem exist with v13 and performance
is significantly high compared to base, and relation size is also
stable, So IMHO V15 is winner over other solution, what other thinks ?

It seems so, do you have ability to reasonably test with 64 clients? I
am mostly wondering if we see the performance going further down or just
plateau.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#102)
Re: Relation extension scalability

On Sun, Mar 27, 2016 at 9:51 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I think this is better option, Since we will search last two pages of FSM
tree, then no need to update the upper level of the FSM tree. Right ?

Well, it's less important in that case, but I think it's still worth
doing. Some people are going to do just plain GetPageWithFreeSpace().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#106)
Re: Relation extension scalability

On Mon, Mar 28, 2016 at 8:46 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Found one problem with V15, so sending the new version.
In V15 I am taking prev_blkno as targetBlock instead it should be the last
block of the relation at that time. Attaching new patch.

     BlockNumber targetBlock,
-                otherBlock;
+                otherBlock,
+                prev_blkno = RelationGetNumberOfBlocks(relation);

Absolutely not. There is no way it's acceptable to introduce an
unconditional call to RelationGetNumberOfBlocks() into every call to
RelationGetBufferForTuple().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110Amit Kapila
amit.kapila16@gmail.com
In reply to: Petr Jelinek (#107)
Re: Relation extension scalability

On Tue, Mar 29, 2016 at 3:21 AM, Petr Jelinek <petr@2ndquadrant.com> wrote:

On 28/03/16 14:46, Dilip Kumar wrote:

Conclusion:
---------------
1. I think v15 is solving the problem exist with v13 and performance
is significantly high compared to base, and relation size is also
stable, So IMHO V15 is winner over other solution, what other thinks ?

It seems so, do you have ability to reasonably test with 64 clients? I am
mostly wondering if we see the performance going further down or just
plateau.

Yes, that makes sense. One more point is that if the reason for v13 giving
better performance is extra blocks (which we believe in certain cases can
leak till the time Vacuum updates the FSM tree), do you think it makes
sense to once test by increasing lockWaiters * 20 limit to may
be lockWaiters * 25 or lockWaiters * 30.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#111Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#108)
1 attachment(s)
Re: Relation extension scalability

On Tue, Mar 29, 2016 at 7:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Well, it's less important in that case, but I think it's still worth
doing. Some people are going to do just plain GetPageWithFreeSpace().

I am attaching new version v17.

Its like this...

In *RelationAddExtraBlocks*
{
-- Add Block one by one to FSM.

-- Update FSM tree all the way to root
}

In *RelationGetBufferForTuple*
 ---  Same as v14, search the FSM tree from root.  GetPageWithFreeSpace

*Summary:*
*--------------*
1. By Adding block to FSM tree one by one it solves the problem of unused
blocks in V14.
2. It Update the FSM tree all they up to root, so anybody search from root
can get the block.
3. It also search the block from root, so it don't have problem like v15
has(Exactly identifying which two FSM page to search).
4. This solves the performance problem of V14 by some optimizations in
logic of updating FSM tree till root.

*Performance Data*:
--------------------------
Client base v17
-------- -------- --------
1 117 120
2 111 123
4 51 124
8 43 135
16 40 145
32 35 144
64 -- 140
Client base v17
------- ------- ------
1 118 117
2 217 220
4 210 379
8 166 574
16 145 594
32 124 599
64 ---- 609

Notes:
---------
If I do some change in this patch in strategy of searching the block,
performance remains almost the same.
1. Like search in two block like v15 or v17 does.
2. Search first using target block and if don't get then search from top.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v17.patchapplication/octet-stream; name=multi_extend_v17.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..f8594d0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,67 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	BlockNumber	blockNum,
+				firstBlock;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Size		freespace;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	firstBlock = InvalidBlockNumber;
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		freespace = PageGetHeapFreeSpace(page);
+		UnlockReleaseBuffer(buffer);
+
+		if (firstBlock == InvalidBlockNumber)
+			firstBlock = blockNum;
+
+		/*
+		 * Put the page in the freespace map so other backends can find it.
+		 * This is what will keep those other backends from also queueing up
+		 * on the relation extension lock.
+		 */
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+
+	/*
+	 * Update the free space map all the way up to root, so that other
+	 * backend searching free space map from root can also find new blocks.
+	 */
+	UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,8 +294,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
@@ -308,6 +369,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -440,10 +502,46 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			/*
+			 * Check if some other backend has extended a block for us while
+			 * we were waiting on the lock.
+			 */
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..696aaa3 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,7 +109,10 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
-
+static bool fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2);
+static void fsm_set_range(Relation rel, FSMAddress addr, uint16 firstSlot,
+					uint16 lastSlot, uint8 newValue);
+static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat);
 
 /******** Public API ********/
 
@@ -189,6 +192,59 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
 }
 
 /*
+ * Update the free space map all the way up to root, so that other
+ * backend searching free space map from root can also find new blocks.
+ */
+void
+UpdateFreeSpaceMap(Relation rel, BlockNumber firstBlkNum,
+					BlockNumber lastBlkNum, Size freespace)
+{
+	int			new_cat = fsm_space_avail_to_cat(freespace);
+	FSMAddress	addr1,
+				addr2;
+	uint16		firstSlot;
+	uint16		lastSlot;
+	uint16		slot;
+	BlockNumber	blockNum;
+
+	blockNum = firstBlkNum;
+
+	/*
+	 * If all complete block range is on same FSM page then no need check
+	 * block by block.
+	 */
+	addr1 = fsm_get_location(firstBlkNum, &slot);
+	addr2 = fsm_get_location(lastBlkNum, &slot);
+	if (fsm_addr_on_same_page(addr1, addr2))
+	{
+		fsm_update_recursive(rel, addr1, new_cat);
+		return;
+	}
+
+	while (blockNum < lastBlkNum)
+	{
+		blockNum++;
+
+		addr2 = fsm_get_location(blockNum, &slot);
+
+		if (!fsm_addr_on_same_page(addr1, addr2)
+			|| blockNum == lastBlkNum)
+		{
+			/* This BLock is on the next FSM page so update the FSM page */
+			fsm_get_location(firstBlkNum, &firstSlot);
+			fsm_get_location(blockNum - 1, &lastSlot);
+			fsm_update_recursive(rel, addr1, new_cat);
+
+			/*
+			 * Continue updating FSM till last page, so set the addr1 as new
+			 * Address. and continue search for this page.
+			 */
+			addr1 = addr2;
+		}
+	}
+}
+
+/*
  * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
  *		WAL replay
  */
@@ -788,3 +844,60 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
 
 	return max_avail;
 }
+
+static bool
+fsm_addr_on_same_page(FSMAddress addr1, FSMAddress addr2)
+{
+	Assert(addr1);
+	Assert(addr2);
+
+	if ((addr1.level != addr2.level)
+		|| (addr1.logpageno != addr2.logpageno))
+		return false;
+
+	return true;
+}
+
+/*
+ * Set value in given FSM page for given slot range.
+ */
+static void
+fsm_set_range(Relation rel, FSMAddress addr, uint16 firstSlot,
+			uint16 lastSlot, uint8 newValue)
+{
+	Buffer		buf;
+	Page		page;
+	int			slot;
+
+	buf = fsm_readbuf(rel, addr, true);
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+	page = BufferGetPage(buf);
+
+	slot = firstSlot;
+
+	for (slot = firstSlot; slot <= lastSlot; slot++)
+		fsm_set_avail(page, slot, newValue);
+
+	MarkBufferDirtyHint(buf, false);
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Recursively update the FSM tree from given address to
+ * all the way up to root.
+ */
+static void
+fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat)
+{
+	uint16		parentslot;
+	FSMAddress	parent;
+
+	if (addr.level == FSM_ROOT_LEVEL)
+		return;
+
+	parent = fsm_get_parent(addr, &parentslot);
+	fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
+	fsm_update_recursive(rel, parent, new_cat);
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..16c052b 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,9 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern void UpdateFreeSpaceMap(Relation rel,
+							BlockNumber firtsBlkNum,
+							BlockNumber lastBlkNum,
+							Size freespace);
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#112Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#111)
1 attachment(s)
Re: Relation extension scalability

On Tue, Mar 29, 2016 at 2:09 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Attaching new version v18

- Some cleanup work on v17.
- Improved *UpdateFreeSpaceMap *function.
- Performance and space utilization are same as V17

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v18.patchapplication/octet-stream; name=multi_extend_v18.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..f8594d0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,67 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	BlockNumber	blockNum,
+				firstBlock;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Size		freespace;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	firstBlock = InvalidBlockNumber;
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		freespace = PageGetHeapFreeSpace(page);
+		UnlockReleaseBuffer(buffer);
+
+		if (firstBlock == InvalidBlockNumber)
+			firstBlock = blockNum;
+
+		/*
+		 * Put the page in the freespace map so other backends can find it.
+		 * This is what will keep those other backends from also queueing up
+		 * on the relation extension lock.
+		 */
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+
+	/*
+	 * Update the free space map all the way up to root, so that other
+	 * backend searching free space map from root can also find new blocks.
+	 */
+	UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,8 +294,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
@@ -308,6 +369,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -440,10 +502,46 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			/*
+			 * Check if some other backend has extended a block for us while
+			 * we were waiting on the lock.
+			 */
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..676f5d6 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,7 +109,8 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
-
+static BlockNumber fsm_get_lastblckno(Relation rel, FSMAddress addr);
+static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat);
 
 /******** Public API ********/
 
@@ -189,6 +190,53 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
 }
 
 /*
+ * Update the free space map all the way up to root, so that other
+ * backend searching free space map from root can also find new blocks.
+ *
+ * This function will update the FSM tree up till root only for those
+ * all FSM pages where give block range from startBlkNum to endBlkNum
+ * resides.
+ */
+void
+UpdateFreeSpaceMap(Relation rel, BlockNumber startBlkNum,
+					BlockNumber endBlkNum, Size freespace)
+{
+	int			new_cat = fsm_space_avail_to_cat(freespace);
+	FSMAddress	addr;
+	uint16		slot;
+	BlockNumber	blockNum;
+	BlockNumber	lastBlkOnPage;
+
+	blockNum = startBlkNum;
+
+	while (blockNum <= endBlkNum)
+	{
+		/*
+		 * Get the FSM Address where this block resides and update the
+		 * FSM tree starting from that FSM address to all the way up till
+		 * root.
+		 */
+		addr = fsm_get_location(blockNum, &slot);
+		fsm_update_recursive(rel, addr, new_cat);
+
+		/*
+		 * Get the last block number on this FSM page, If this pages covers
+		 * till our endBlkNum then we are done, otherwise continue with next
+		 * FSM page and so on
+		 */
+		lastBlkOnPage = fsm_get_lastblckno(rel, addr);
+		if (lastBlkOnPage >= endBlkNum)
+			break;
+
+		/*
+		 * We are not done yet so move to the first block num of
+		 * next FSM page
+		 */
+		blockNum = lastBlkOnPage + 1;
+	}
+}
+
+/*
  * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
  *		WAL replay
  */
@@ -788,3 +836,42 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
 
 	return max_avail;
 }
+
+/*
+ * This function will return the last block number stored on given
+ * FSM page address.
+ */
+static BlockNumber
+fsm_get_lastblckno(Relation rel, FSMAddress addr)
+{
+	int			slot;
+
+	/*
+	 * Get the last slot number on the given address and convert that to
+	 * block number
+	 */
+	slot = SlotsPerFSMPage - 1;
+	return fsm_get_heap_blk(addr, slot);
+}
+
+/*
+ * Recursively update the FSM tree from given address to
+ * all the way up to root.
+ */
+static void
+fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat)
+{
+	uint16		parentslot;
+	FSMAddress	parent;
+
+	if (addr.level == FSM_ROOT_LEVEL)
+		return;
+
+	/*
+	 * Get the parent page and our slot in the parent page, and
+	 * update the information in that.
+	 */
+	parent = fsm_get_parent(addr, &parentslot);
+	fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
+	fsm_update_recursive(rel, parent, new_cat);
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7e04137 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the lock requester for the RelationExtension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..353f705 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock		*partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..16c052b 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,9 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern void UpdateFreeSpaceMap(Relation rel,
+							BlockNumber firtsBlkNum,
+							BlockNumber lastBlkNum,
+							Size freespace);
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..4460756 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionLockRelationForExtension(Relation relation,
+								  LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..9c08679 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#113Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#112)
1 attachment(s)
Re: Relation extension scalability

On Tue, Mar 29, 2016 at 1:29 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Attaching new version v18

- Some cleanup work on v17.
- Improved UpdateFreeSpaceMap function.
- Performance and space utilization are same as V17

Looks better. Here's a v19 that I hacked on a bit.

Unfortunately, one compiler I tried this with had a pretty legitimate complaint:

hio.c: In function ‘RelationGetBufferForTuple’:
hio.c:231:20: error: ‘freespace’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:185:7: note: ‘freespace’ was declared here
hio.c:231:20: error: ‘blockNum’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:181:14: note: ‘blockNum’ was declared here

There's nothing whatsoever to prevent RelationExtensionLockWaiterCount
from returning 0.

It's also rather ugly that the call to UpdateFreeSpaceMap() assumes
that the last value returned by PageGetHeapFreeSpace() is as good as
any other, but maybe we can just install a comment explaining that
point; there's not an obviously better approach that I can see.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

multi_extend_v19.patchtext/x-diff; charset=US-ASCII; name=multi_extend_v19.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..aeed40b 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,69 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	BlockNumber	blockNum,
+				firstBlock;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Size		freespace;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	firstBlock = InvalidBlockNumber;
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		freespace = PageGetHeapFreeSpace(page);
+		UnlockReleaseBuffer(buffer);
+
+		if (firstBlock == InvalidBlockNumber)
+			firstBlock = blockNum;
+
+		/*
+		 * Immediately update the bottom level of the FSM.  This has a good
+		 * chance of making this page visible to other concurrently inserting
+		 * backends, and we want that to happen without delay.
+		 */
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+
+	/*
+	 * Updating the upper levels of the free space map is too expensive
+	 * to do for every block, but it's worth doing once at the end to make
+	 * sure that subsequent insertion activity sees all of those nifty free
+	 * pages we just inserted.
+	 */
+	UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,8 +296,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
@@ -308,6 +371,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -440,10 +504,46 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			/*
+			 * Check if some other backend has extended a block for us while
+			 * we were waiting on the lock.
+			 */
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..b2361e5 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,6 +109,8 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
+static BlockNumber fsm_get_lastblckno(Relation rel, FSMAddress addr);
+static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat);
 
 
 /******** Public API ********/
@@ -189,6 +191,47 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
 }
 
 /*
+ * Update the upper levels of the free space map all the way up to the root
+ * to make sure we don't lose track of new blocks we just inserted.  This is
+ * intended to be used after adding many new blocks to the relation; we judge
+ * it not worth updating the upper levels of the tree every time data for
+ * a single page changes, but for a bulk-extend it's worth it.
+ */
+void
+UpdateFreeSpaceMap(Relation rel, BlockNumber startBlkNum,
+					BlockNumber endBlkNum, Size freespace)
+{
+	int			new_cat = fsm_space_avail_to_cat(freespace);
+	FSMAddress	addr;
+	uint16		slot;
+	BlockNumber	blockNum;
+	BlockNumber	lastBlkOnPage;
+
+	blockNum = startBlkNum;
+
+	while (blockNum <= endBlkNum)
+	{
+		/*
+		 * Get the FSM Address where this block resides and update the
+		 * FSM tree starting from that FSM address to all the way up till
+		 * root.
+		 */
+		addr = fsm_get_location(blockNum, &slot);
+		fsm_update_recursive(rel, addr, new_cat);
+
+		/*
+		 * Get the last block number on this FSM page.  If that's greater
+		 * than or equal to our endBlkNum, we're done.  Otherwise, advance
+		 * to the first block on the next page.
+		 */
+		lastBlkOnPage = fsm_get_lastblckno(rel, addr);
+		if (lastBlkOnPage >= endBlkNum)
+			break;
+		blockNum = lastBlkOnPage + 1;
+	}
+}
+
+/*
  * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
  *		WAL replay
  */
@@ -788,3 +831,42 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
 
 	return max_avail;
 }
+
+/*
+ * This function will return the last block number stored on given
+ * FSM page address.
+ */
+static BlockNumber
+fsm_get_lastblckno(Relation rel, FSMAddress addr)
+{
+	int			slot;
+
+	/*
+	 * Get the last slot number on the given address and convert that to
+	 * block number
+	 */
+	slot = SlotsPerFSMPage - 1;
+	return fsm_get_heap_blk(addr, slot);
+}
+
+/*
+ * Recursively update the FSM tree from given address to
+ * all the way up to root.
+ */
+static void
+fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat)
+{
+	uint16		parentslot;
+	FSMAddress	parent;
+
+	if (addr.level == FSM_ROOT_LEVEL)
+		return;
+
+	/*
+	 * Get the parent page and our slot in the parent page, and
+	 * update the information in that.
+	 */
+	parent = fsm_get_parent(addr, &parentslot);
+	fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
+	fsm_update_recursive(rel, parent, new_cat);
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7b08555 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..41f6930 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock	   *partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..16c052b 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,9 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern void UpdateFreeSpaceMap(Relation rel,
+							BlockNumber firtsBlkNum,
+							BlockNumber lastBlkNum,
+							Size freespace);
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..8288e7d 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionalLockRelationForExtension(Relation relation,
+									LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..efa75ec 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int	LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#114Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#113)
1 attachment(s)
Re: Relation extension scalability

On Wed, Mar 30, 2016 at 7:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Thanks for review and better comments..

hio.c: In function ‘RelationGetBufferForTuple’:
hio.c:231:20: error: ‘freespace’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:185:7: note: ‘freespace’ was declared here
hio.c:231:20: error: ‘blockNum’ may be used uninitialized in this
function [-Werror=uninitialized]
hio.c:181:14: note: ‘blockNum’ was declared here

I have fixed those in v20

There's nothing whatsoever to prevent RelationExtensionLockWaiterCount
from returning 0.

It's also rather ugly that the call to UpdateFreeSpaceMap() assumes
that the last value returned by PageGetHeapFreeSpace() is as good as
any other, but maybe we can just install a comment explaining that
point; there's not an obviously better approach that I can see.

Added comments..

+ if (lockWaiters)
+ /*
+ * Here we are using same freespace for all the Blocks, but that
+ * is Ok, because all are newly added blocks and have same freespace
+ * And even some block which we just added to FreespaceMap above, is
+ * used by some backend and now freespace is not same, will not harm
+ * anything, because actual freespace will be calculated by user
+ * after getting the page.
+ */
+ UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);

Does this look good ?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v20.patchapplication/octet-stream; name=multi_extend_v20.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..31b84d0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,78 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	BlockNumber	blockNum,
+				firstBlock;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Size		freespace = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	blockNum = firstBlock = InvalidBlockNumber;
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		freespace = PageGetHeapFreeSpace(page);
+		UnlockReleaseBuffer(buffer);
+
+		if (firstBlock == InvalidBlockNumber)
+			firstBlock = blockNum;
+
+		/*
+		 * Immediately update the bottom level of the FSM.  This has a good
+		 * chance of making this page visible to other concurrently inserting
+		 * backends, and we want that to happen without delay.
+		 */
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+
+	/*
+	 * Updating the upper levels of the free space map is too expensive
+	 * to do for every block, but it's worth doing once at the end to make
+	 * sure that subsequent insertion activity sees all of those nifty free
+	 * pages we just inserted.
+	 */
+	if (lockWaiters)
+		/*
+		 * Here we are using same freespace for all the Blocks, but that
+		 * is Ok, because all are newly added blocks and have same freespace
+		 * And even some block which we just added to FreespaceMap above, is
+		 * used by some backend and now freespace is not same, will not harm
+		 * anything, because actual freespace will be calculated by user
+		 * after getting the page.
+		 */
+		UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,8 +305,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
@@ -308,6 +380,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -440,10 +513,46 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			/*
+			 * Check if some other backend has extended a block for us while
+			 * we were waiting on the lock.
+			 */
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..b2361e5 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,6 +109,8 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
+static BlockNumber fsm_get_lastblckno(Relation rel, FSMAddress addr);
+static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat);
 
 
 /******** Public API ********/
@@ -189,6 +191,47 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
 }
 
 /*
+ * Update the upper levels of the free space map all the way up to the root
+ * to make sure we don't lose track of new blocks we just inserted.  This is
+ * intended to be used after adding many new blocks to the relation; we judge
+ * it not worth updating the upper levels of the tree every time data for
+ * a single page changes, but for a bulk-extend it's worth it.
+ */
+void
+UpdateFreeSpaceMap(Relation rel, BlockNumber startBlkNum,
+					BlockNumber endBlkNum, Size freespace)
+{
+	int			new_cat = fsm_space_avail_to_cat(freespace);
+	FSMAddress	addr;
+	uint16		slot;
+	BlockNumber	blockNum;
+	BlockNumber	lastBlkOnPage;
+
+	blockNum = startBlkNum;
+
+	while (blockNum <= endBlkNum)
+	{
+		/*
+		 * Get the FSM Address where this block resides and update the
+		 * FSM tree starting from that FSM address to all the way up till
+		 * root.
+		 */
+		addr = fsm_get_location(blockNum, &slot);
+		fsm_update_recursive(rel, addr, new_cat);
+
+		/*
+		 * Get the last block number on this FSM page.  If that's greater
+		 * than or equal to our endBlkNum, we're done.  Otherwise, advance
+		 * to the first block on the next page.
+		 */
+		lastBlkOnPage = fsm_get_lastblckno(rel, addr);
+		if (lastBlkOnPage >= endBlkNum)
+			break;
+		blockNum = lastBlkOnPage + 1;
+	}
+}
+
+/*
  * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
  *		WAL replay
  */
@@ -788,3 +831,42 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
 
 	return max_avail;
 }
+
+/*
+ * This function will return the last block number stored on given
+ * FSM page address.
+ */
+static BlockNumber
+fsm_get_lastblckno(Relation rel, FSMAddress addr)
+{
+	int			slot;
+
+	/*
+	 * Get the last slot number on the given address and convert that to
+	 * block number
+	 */
+	slot = SlotsPerFSMPage - 1;
+	return fsm_get_heap_blk(addr, slot);
+}
+
+/*
+ * Recursively update the FSM tree from given address to
+ * all the way up to root.
+ */
+static void
+fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat)
+{
+	uint16		parentslot;
+	FSMAddress	parent;
+
+	if (addr.level == FSM_ROOT_LEVEL)
+		return;
+
+	/*
+	 * Get the parent page and our slot in the parent page, and
+	 * update the information in that.
+	 */
+	parent = fsm_get_parent(addr, &parentslot);
+	fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
+	fsm_update_recursive(rel, parent, new_cat);
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7b08555 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..41f6930 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock	   *partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..16c052b 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,9 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern void UpdateFreeSpaceMap(Relation rel,
+							BlockNumber firtsBlkNum,
+							BlockNumber lastBlkNum,
+							Size freespace);
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..8288e7d 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionalLockRelationForExtension(Relation relation,
+									LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..efa75ec 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int	LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#115Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#114)
1 attachment(s)
Re: Relation extension scalability

On Wed, Mar 30, 2016 at 7:51 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

+ if (lockWaiters)
+ /*
+ * Here we are using same freespace for all the Blocks, but that
+ * is Ok, because all are newly added blocks and have same freespace
+ * And even some block which we just added to FreespaceMap above, is
+ * used by some backend and now freespace is not same, will not harm
+ * anything, because actual freespace will be calculated by user
+ * after getting the page.
+ */
+ UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);

Does this look good ?

Done in better way..

+ lockWaiters = RelationExtensionLockWaiterCount(relation);
+
+ if (lockWaiters == 0)
+ return;
+

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

multi_extend_v21.patchapplication/octet-stream; name=multi_extend_v21.patchDownload
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..b8feac9 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,80 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	BlockNumber	blockNum,
+				firstBlock;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Size		freespace = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+
+	if (lockWaiters == 0)
+		return;
+
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	blockNum = firstBlock = InvalidBlockNumber;
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		freespace = PageGetHeapFreeSpace(page);
+		UnlockReleaseBuffer(buffer);
+
+		if (firstBlock == InvalidBlockNumber)
+			firstBlock = blockNum;
+
+		/*
+		 * Immediately update the bottom level of the FSM.  This has a good
+		 * chance of making this page visible to other concurrently inserting
+		 * backends, and we want that to happen without delay.
+		 */
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+
+	/*
+	 * Updating the upper levels of the free space map is too expensive
+	 * to do for every block, but it's worth doing once at the end to make
+	 * sure that subsequent insertion activity sees all of those nifty free
+	 * pages we just inserted.
+	 *
+	 * Here we are using same freespace for all the Blocks, but that
+	 * is Ok, because all are newly added blocks and have same freespace
+	 * and even some block which we just added to FreespaceMap above, is
+	 * used by some backend and now freespace is not same, will not harm
+	 * anything, because actual freespace will be calculated by actual user
+	 * of the block before using it.
+	 */
+	UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,8 +307,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-				saveFreeSpace;
+	Size		pageFreeSpace = 0,
+				saveFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
@@ -308,6 +382,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -440,10 +515,46 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+	/*
+	 * If we need the lock but are not able to acquire it immediately, we'll
+	 * consider extending the relation by multiple blocks at a time to manage
+	 * contention on the relation extension lock.  However, this only makes
+	 * sense if we're using the FSM; otherwise, there's no point.
+	 */
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	{
+		if (!use_fsm)
+			LockRelationForExtension(relation, ExclusiveLock);
+		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+		{
+			/* Couldn't get the lock immediately; wait for it. */
+			LockRelationForExtension(relation, ExclusiveLock);
+
+			/*
+			 * Check if some other backend has extended a block for us while
+			 * we were waiting on the lock.
+			 */
+			targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+			/*
+			 * If some other waiter has already extended the relation, we
+			 * don't need to do so; just use the existing freespace.
+			 */
+			if (targetBlock != InvalidBlockNumber)
+			{
+				UnlockRelationForExtension(relation, ExclusiveLock);
+				goto loop;
+			}
+
+			/* Time to bulk-extend. */
+			RelationAddExtraBlocks(relation, bistate);
+		}
+	}
 
 	/*
+	 * In addition to whatever extension we performed above, we always add
+	 * at least one block to satisfy our own request.
+	 *
 	 * XXX This does an lseek - rather expensive - but at the moment it is the
 	 * only way to accurately determine how many blocks are in a relation.  Is
 	 * it worth keeping an accurate file length in shared memory someplace,
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2631080..b2361e5 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -109,6 +109,8 @@ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 				   uint8 newValue, uint8 minValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof);
+static BlockNumber fsm_get_lastblckno(Relation rel, FSMAddress addr);
+static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat);
 
 
 /******** Public API ********/
@@ -189,6 +191,47 @@ RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
 }
 
 /*
+ * Update the upper levels of the free space map all the way up to the root
+ * to make sure we don't lose track of new blocks we just inserted.  This is
+ * intended to be used after adding many new blocks to the relation; we judge
+ * it not worth updating the upper levels of the tree every time data for
+ * a single page changes, but for a bulk-extend it's worth it.
+ */
+void
+UpdateFreeSpaceMap(Relation rel, BlockNumber startBlkNum,
+					BlockNumber endBlkNum, Size freespace)
+{
+	int			new_cat = fsm_space_avail_to_cat(freespace);
+	FSMAddress	addr;
+	uint16		slot;
+	BlockNumber	blockNum;
+	BlockNumber	lastBlkOnPage;
+
+	blockNum = startBlkNum;
+
+	while (blockNum <= endBlkNum)
+	{
+		/*
+		 * Get the FSM Address where this block resides and update the
+		 * FSM tree starting from that FSM address to all the way up till
+		 * root.
+		 */
+		addr = fsm_get_location(blockNum, &slot);
+		fsm_update_recursive(rel, addr, new_cat);
+
+		/*
+		 * Get the last block number on this FSM page.  If that's greater
+		 * than or equal to our endBlkNum, we're done.  Otherwise, advance
+		 * to the first block on the next page.
+		 */
+		lastBlkOnPage = fsm_get_lastblckno(rel, addr);
+		if (lastBlkOnPage >= endBlkNum)
+			break;
+		blockNum = lastBlkOnPage + 1;
+	}
+}
+
+/*
  * XLogRecordPageWithFreeSpace - like RecordPageWithFreeSpace, for use in
  *		WAL replay
  */
@@ -788,3 +831,42 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p)
 
 	return max_avail;
 }
+
+/*
+ * This function will return the last block number stored on given
+ * FSM page address.
+ */
+static BlockNumber
+fsm_get_lastblckno(Relation rel, FSMAddress addr)
+{
+	int			slot;
+
+	/*
+	 * Get the last slot number on the given address and convert that to
+	 * block number
+	 */
+	slot = SlotsPerFSMPage - 1;
+	return fsm_get_heap_blk(addr, slot);
+}
+
+/*
+ * Recursively update the FSM tree from given address to
+ * all the way up to root.
+ */
+static void
+fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat)
+{
+	uint16		parentslot;
+	FSMAddress	parent;
+
+	if (addr.level == FSM_ROOT_LEVEL)
+		return;
+
+	/*
+	 * Get the parent page and our slot in the parent page, and
+	 * update the information in that.
+	 */
+	parent = fsm_get_parent(addr, &parentslot);
+	fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
+	fsm_update_recursive(rel, parent, new_cat);
+}
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 0632fc0..7b08555 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -341,6 +341,41 @@ LockRelationForExtension(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		ConditionalLockRelationForExtension
+ *
+ * As above, but only lock if we can get the lock without blocking.
+ * Returns TRUE iff the lock was acquired.
+ */
+bool
+ConditionalLockRelationForExtension(Relation relation, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return (LockAcquire(&tag, lockmode, false, true) != LOCKACQUIRE_NOT_AVAIL);
+}
+
+/*
+ *		RelationExtensionLockWaiterCount
+ *
+ * Count the number of processes waiting for the given relation extension lock.
+ */
+int
+RelationExtensionLockWaiterCount(Relation relation)
+{
+	LOCKTAG		tag;
+
+	SET_LOCKTAG_RELATION_EXTEND(tag,
+								relation->rd_lockInfo.lockRelId.dbId,
+								relation->rd_lockInfo.lockRelId.relId);
+
+	return LockWaiterCount(&tag);
+}
+
+/*
  *		UnlockRelationForExtension
  */
 void
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b30b7b1..41f6930 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4380,3 +4380,40 @@ VirtualXactLock(VirtualTransactionId vxid, bool wait)
 	LockRelease(&tag, ShareLock, false);
 	return true;
 }
+
+/*
+ * LockWaiterCount
+ *
+ * Find the number of lock requester on this locktag
+ */
+int
+LockWaiterCount(const LOCKTAG *locktag)
+{
+	LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
+	LOCK	   *lock;
+	bool		found;
+	uint32		hashcode;
+	LWLock	   *partitionLock;
+	int			waiters = 0;
+
+	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
+		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
+
+	hashcode = LockTagHashCode(locktag);
+	partitionLock = LockHashPartitionLock(hashcode);
+	LWLockAcquire(partitionLock, LW_EXCLUSIVE);
+
+	lock = (LOCK *) hash_search_with_hash_value(LockMethodLockHash,
+												(const void *) locktag,
+												hashcode,
+												HASH_FIND,
+												&found);
+	if (found)
+	{
+		Assert(lock != NULL);
+		waiters = lock->nRequested;
+	}
+	LWLockRelease(partitionLock);
+
+	return waiters;
+}
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 19dcb8d..16c052b 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -32,5 +32,9 @@ extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 
 extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
 extern void FreeSpaceMapVacuum(Relation rel);
+extern void UpdateFreeSpaceMap(Relation rel,
+							BlockNumber firtsBlkNum,
+							BlockNumber lastBlkNum,
+							Size freespace);
 
 #endif   /* FREESPACE_H_ */
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index 975b6f8..8288e7d 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -53,6 +53,9 @@ extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
+extern bool ConditionalLockRelationForExtension(Relation relation,
+									LOCKMODE lockmode);
+extern int	RelationExtensionLockWaiterCount(Relation relation);
 
 /* Lock a page (currently only used within indexes) */
 extern void LockPage(Relation relation, BlockNumber blkno, LOCKMODE lockmode);
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index b26427d..efa75ec 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -574,6 +574,8 @@ extern void RememberSimpleDeadLock(PGPROC *proc1,
 					   PGPROC *proc2);
 extern void InitDeadLockChecking(void);
 
+extern int	LockWaiterCount(const LOCKTAG *locktag);
+
 #ifdef LOCK_DEBUG
 extern void DumpLocks(PGPROC *proc);
 extern void DumpAllLocks(void);
#116Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#110)
Re: Relation extension scalability

On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Yes, that makes sense. One more point is that if the reason for v13
giving better performance is extra blocks (which we believe in certain
cases can leak till the time Vacuum updates the FSM tree), do you think it
makes sense to once test by increasing lockWaiters * 20 limit to may
be lockWaiters * 25 or lockWaiters * 30.

I tested COPY 10000 record, by increasing the number of blocks just to find
out why we are not as good as V13
with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..

COPY 10000
--------------------
Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
-------- ---------
16 752
32 708

This proves that main reason of v13 being better is its adding extra blocks
without control.
though v13 is better than these results, I think we can get that also by
changing multiplier and max limit .

But I think we are ok with the max size as 4MB (512 blocks) right?.

Does this test make sense ?

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#117Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#116)
Re: Relation extension scalability

On Thu, Mar 31, 2016 at 10:29 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Yes, that makes sense. One more point is that if the reason for v13
giving better performance is extra blocks (which we believe in certain
cases can leak till the time Vacuum updates the FSM tree), do you think it
makes sense to once test by increasing lockWaiters * 20 limit to may
be lockWaiters * 25 or lockWaiters * 30.

I tested COPY 10000 record, by increasing the number of blocks just to
find out why we are not as good as V13
with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..

COPY 10000
--------------------
Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
-------- ---------
16 752
32 708

This proves that main reason of v13 being better is its adding extra
blocks without control.
though v13 is better than these results, I think we can get that also by
changing multiplier and max limit .

But I think we are ok with the max size as 4MB (512 blocks) right?.

This shows that there is performance increase of ~26% (599 to 752) at 16
client count if we raise the limit of max extend size from 4MB to 16MB
which is a good boost, but not sure if it is worth extending the relation
for cases where the newly added pages won't get used. I think it should be
okay to go for 4MB as a limit for now and then if during beta testing or in
future, there are use cases where tuning this max limit helps, then we can
come back to it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#118Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#116)
Re: Relation extension scalability

On Thu, Mar 31, 2016 at 12:59 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Yes, that makes sense. One more point is that if the reason for v13
giving better performance is extra blocks (which we believe in certain cases
can leak till the time Vacuum updates the FSM tree), do you think it makes
sense to once test by increasing lockWaiters * 20 limit to may be
lockWaiters * 25 or lockWaiters * 30.

I tested COPY 10000 record, by increasing the number of blocks just to find
out why we are not as good as V13
with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..

COPY 10000
--------------------
Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
-------- ---------
16 752
32 708

This proves that main reason of v13 being better is its adding extra blocks
without control.
though v13 is better than these results, I think we can get that also by
changing multiplier and max limit .

But I think we are ok with the max size as 4MB (512 blocks) right?

Yeah, kind of. But obviously if we could make the limit smaller
without hurting performance, that would be better.

Per my note yesterday about performance degradation with parallel
COPY, I wasn't able to demonstrate that this patch gives a consistent
performance benefit on hydra - the best result I got was speeding up a
9.5 minute load to 8 minutes where linear scalability would have been
2 minutes. And I found cases where it was actually slower with the
patch. Now maybe hydra is just a crap machine, but I'm feeling
nervous.

What machines did you use to test this? Have you tested really large
data loads, like many MB or even GB of data?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#118)
Re: Relation extension scalability

On Thu, Mar 31, 2016 at 4:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 31, 2016 at 12:59 AM, Dilip Kumar <dilipbalaut@gmail.com>
wrote:

On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Yes, that makes sense. One more point is that if the reason for v13
giving better performance is extra blocks (which we believe in certain

cases

can leak till the time Vacuum updates the FSM tree), do you think it

makes

sense to once test by increasing lockWaiters * 20 limit to may be
lockWaiters * 25 or lockWaiters * 30.

I tested COPY 10000 record, by increasing the number of blocks just to

find

out why we are not as good as V13
with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..

COPY 10000
--------------------
Client Patch(extraBlocks = Min( lockWaiters * 40, 2048))
-------- ---------
16 752
32 708

This proves that main reason of v13 being better is its adding extra

blocks

without control.
though v13 is better than these results, I think we can get that also by
changing multiplier and max limit .

But I think we are ok with the max size as 4MB (512 blocks) right?

Yeah, kind of. But obviously if we could make the limit smaller
without hurting performance, that would be better.

Per my note yesterday about performance degradation with parallel
COPY, I wasn't able to demonstrate that this patch gives a consistent
performance benefit on hydra - the best result I got was speeding up a
9.5 minute load to 8 minutes where linear scalability would have been
2 minutes.

Is this test for unlogged tables? As far as I understand this patch will
show benefit if Data and WAL are on separate disks or if you test them with
unlogged tables, otherwise the WAL contention supersedes the benefit of
this patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#120Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#118)
Re: Relation extension scalability

On Thu, Mar 31, 2016 at 4:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Yeah, kind of. But obviously if we could make the limit smaller
without hurting performance, that would be better.

Per my note yesterday about performance degradation with parallel
COPY, I wasn't able to demonstrate that this patch gives a consistent
performance benefit on hydra - the best result I got was speeding up a
9.5 minute load to 8 minutes where linear scalability would have been
2 minutes. And I found cases where it was actually slower with the
patch. Now maybe hydra is just a crap machine, but I'm feeling
nervous.

I see the performance gain when either
"*complete data is in SSD*"
or *"data on MD and WAL on SSD"*
or *unlogged table*.

What machines did you use to test this? Have you tested really large

data loads, like many MB or even GB of data?

With INSERT Script within 2 mins run data size is 18GB I am running 5-10
Mins means at least 85GB data.
(Inserts 1000 1KB tuples in each transactions)

With COPY Script within 2 mins run data size is 23GB and I am running 5-10
Mins means at least 100GB data.
(Inserts 10000 tuples in each transactions (tuple size is 1byte to 5 bytes)

Machine Details
-----------------------
I tested in 8 socket NUMA machine with 64 core.
Machine Details:
----------------------
[dilip.kumar@cthulhu ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 8
NUMA node(s): 8
Vendor ID: GenuineIntel
CPU family: 6
Model: 47
Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz
Stepping: 2
CPU MHz: 1064.000
BogoMIPS: 4266.62

If you need some more information please let me know ?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#121Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#120)
Re: Relation extension scalability

On Thu, Mar 31, 2016 at 9:03 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

If you need some more information please let me know ?

I repeated the testing described in
/messages/by-id/CA+TgmoYoUQf9cGcpgyGNgZQHcY-gCcKRyAqQtDU8KFE4N6HVkA@mail.gmail.com
on a MacBook Pro (OS X 10.8.5, 2.4 GHz Intel Core i7, 8GB, early 2013)
and got the following results. Note that I did not adjust
*_flush_delay in this test because that's always 0, apparently, on
MacOS.

master, unlogged tables, 1 copy: 0m18.928s, 0m20.276s, 0m18.040s
patched, unlogged tables, 1 copy: 0m20.499s, 0m20.879s, 0m18.912s
master, unlogged tables, 4 parallel copies: 0m57.301s, 0m58.045s, 0m57.556s
patched, unlogged tables, 4 parallel copies: 0m47.994s, 0m45.586s, 0m44.440s

master, logged tables, 1 copy: 0m29.353s, 0m29.693s, 0m31.840s
patched, logged tables, 1 copy: 0m30.837s, 0m31.567s, 0m36.843s
master, logged tables, 4 parallel copies: 1m45.691s, 1m53.085s, 1m35.674s
patched, logged tables, 4 parallel copies: 1m21.137s, 1m20.678s, 1m22.419s

So the first thing here is that the patch seems to be a clear win in
this test. For a single copy, it seems to be pretty much a wash.
When running 4 copies in parallel, it is about 20-25% faster with both
logged and unlogged tables. The second thing that is interesting is
that we are getting super-linear scalability even without the patch:
if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but
it really takes 60 unpatched or 45 patched. If 1 copy takes 30
seconds, you might expect 4 to take 120 seconds, but in really takes
105 unpatched or 80 patched. So we're not actually I/O constrained on
this test, I think, perhaps because this machine has an SSD.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#121)
Re: Relation extension scalability

On Tue, Apr 5, 2016 at 10:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

So the first thing here is that the patch seems to be a clear win in
this test. For a single copy, it seems to be pretty much a wash.
When running 4 copies in parallel, it is about 20-25% faster with both
logged and unlogged tables. The second thing that is interesting is
that we are getting super-linear scalability even without the patch:
if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but
it really takes 60 unpatched or 45 patched. If 1 copy takes 30
seconds, you might expect 4 to take 120 seconds, but in really takes
105 unpatched or 80 patched. So we're not actually I/O constrained on
this test, I think, perhaps because this machine has an SSD.

It's not unusual for COPY to not be I/O constrained, I believe.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#122)
Re: Relation extension scalability

On Tue, Apr 5, 2016 at 1:05 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Tue, Apr 5, 2016 at 10:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:

So the first thing here is that the patch seems to be a clear win in
this test. For a single copy, it seems to be pretty much a wash.
When running 4 copies in parallel, it is about 20-25% faster with both
logged and unlogged tables. The second thing that is interesting is
that we are getting super-linear scalability even without the patch:
if 1 copy takes 20 seconds, you might expect 4 to take 80 seconds, but
it really takes 60 unpatched or 45 patched. If 1 copy takes 30
seconds, you might expect 4 to take 120 seconds, but in really takes
105 unpatched or 80 patched. So we're not actually I/O constrained on
this test, I think, perhaps because this machine has an SSD.

It's not unusual for COPY to not be I/O constrained, I believe.

Yeah. I've committed the patch now, with some cosmetic cleanup.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#123)
Re: Relation extension scalability

On Fri, Apr 8, 2016 at 11:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Yeah. I've committed the patch now, with some cosmetic cleanup.

Thanks Robert !!!

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com