patch: avoid heavyweight locking on hash metapage

Started by Robert Haasover 13 years ago7 messages
#1Robert Haas
robertmhaas@gmail.com
1 attachment(s)

I developed the attached patch to avoid taking a heavyweight lock on
the metapage of a hash index. Instead, an exclusive buffer content
lock is viewed as sufficient permission to modify the metapage, and a
shared buffer content lock is used when such modifications need to be
prevented. For the most part this is a trivial change, because we
were already taking these locks: we were just taking the heavyweight
locks in addition. The only sticking point is that, when we're
searching or inserting, we previously locked the bucket before
releasing the heavyweight metapage lock, which is unworkable when
holding only a buffer content lock because (1) we might deadlock and
(2) buffer content locks can't be held for long periods of time even
when there's no deadlock risk. To fix this, I implemented a simple
loop-and-retry system: we release the metapage content lock, acquire
the heavyweight lock on the target bucket, and then reacquire the
metapage content lock and check that the bucket mapping has not
changed. Normally it hasn't, and we're done. But if by chance it
has, we simply unlock the metapage, release the heavyweight lock we
acquired previously, lock the new bucket, and loop around again. Even
in the worst case we cannot loop very many times here, since we don't
split the same bucket again until we've split all the other buckets,
and 2^N gets big pretty fast.

I tested the effect of this by setting up a series of 5-minute
read-only pgbench run at scale factor 300 with 8GB of shared buffers
on the IBM POWER7 machine. For these runs, I dropped the the primary
key constraint on pgbench_accounts (aid) and created a hash index on
that column instead. I ran each test three times and took the median
result. Here are the results on unpatched master, at various client
counts:

m01 tps = 9004.070422 (including connections establishing)
m04 tps = 34838.126542 (including connections establishing)
m08 tps = 70584.356826 (including connections establishing)
m16 tps = 128726.248198 (including connections establishing)
m32 tps = 123639.248172 (including connections establishing)
m64 tps = 104650.296143 (including connections establishing)
m80 tps = 88412.736416 (including connections establishing)

And here are the results with the patch:

h01 tps = 9110.561413 (including connections establishing) [+1.2%]
h04 tps = 36012.787524 (including connections establishing) [+3.4%]
h08 tps = 72606.302993 (including connections establishing) [+2.9%]
h16 tps = 141938.762793 (including connections establishing) [+10%]
h32 tps = 205325.232316 (including connections establishing) [+66%]
h64 tps = 274156.881975 (including connections establishing) [+162%]
h80 tps = 291224.012066 (including connections establishing) [+229%]

Obviously, even with this change, there's a lot not to like about hash
indexes: they still won't be crash-safe, and they still won't perform
as well under high concurrency as btree indexes. But neither of those
problems seems like a good reason not to fix this problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

hash-avoid-heavyweight-metapage-locks-v1.patchapplication/octet-stream; name=hash-avoid-heavyweight-metapage-locks-v1.patchDownload
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index cd4e058..4082581 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -132,15 +132,6 @@ long-term locking since there is a (small) risk of deadlock, which we must
 be able to detect.  Buffer context locks are used for short-term access
 control to individual pages of the index.
 
-We define the following lmgr locks for a hash index:
-
-LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
-mapping.  A process attempting to enlarge the hash table by splitting a
-bucket must exclusive-lock this lock before modifying the metapage data
-representing the mapping.  Processes intending to access a particular
-bucket must share-lock this lock until they have acquired lock on the
-correct target bucket.
-
 LockPage(rel, page), where page is the page number of a hash bucket page,
 represents the right to split or compact an individual bucket.  A process
 splitting a bucket must exclusive-lock both old and new halves of the
@@ -150,7 +141,10 @@ insertions must share-lock the bucket they are scanning or inserting into.
 (It is okay to allow concurrent scans and insertions.)
 
 The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements.
+These are available for possible future refinements.  LockPage(rel, 0)
+is also currently undefined (it was previously used to represent the right
+to modify the hash-code-to-bucket mapping, but it is no longer needed for
+that purpose).
 
 Note that these lock definitions are conceptually distinct from any sort
 of lock on the pages whose numbers they share.  A process must also obtain
@@ -165,9 +159,7 @@ hash index code, since a process holding one of these locks could block
 waiting for an unrelated lock held by another process.  If that process
 then does something that requires exclusive lock on the bucket, we have
 deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.  This also forces the page-zero lock
-to be an lmgr lock, because as we'll see below it is held while attempting
-to acquire a bucket lock, and so it could also participate in a deadlock.
+can be detected and recovered from.
 
 Processes must obtain read (share) buffer context lock on any hash index
 page while reading it, and write (exclusive) lock while modifying it.
@@ -195,12 +187,14 @@ track of available overflow pages.
 
 The reader algorithm is:
 
-	share-lock page 0 (to prevent active split)
 	read/sharelock meta page
-	compute bucket number for target hash key
-	release meta page
-	share-lock bucket page (to prevent split/compact of this bucket)
-	release page 0 share-lock
+    loop:
+		compute bucket number for target hash key
+		release meta page
+		if (correct bucket page is already locked)
+			break
+		release any existing bucket page lock (if a concurrent split happened)
+		share-lock bucket page
 -- then, per read request:
 	read/sharelock current page of bucket
 		step to next page if necessary (no chaining of locks)
@@ -209,10 +203,12 @@ The reader algorithm is:
 -- at scan shutdown:
 	release bucket share-lock
 
-By holding the page-zero lock until lock on the target bucket is obtained,
-the reader ensures that the target bucket calculation is valid (otherwise
-the bucket might be split before the reader arrives at it, and the target
-entries might go into the new bucket).  Holding the bucket sharelock for
+We can't hold the metapage lock while acquiring a lock on the target bucket,
+because that might result in an undetected deadlock (lwlocks do not participate
+in deadlock detection).  Instead, we relock the metapage after acquiring the
+bucket page lock and check whether the bucket has been split.  If not, we're
+done.  If so, we release our previously-acquired lock and repeat the process
+using the new bucket number.  Holding the bucket sharelock for
 the remainder of the scan prevents the reader's current-tuple pointer from
 being invalidated by splits or compactions.  Notice that the reader's lock
 does not prevent other buckets from being split or compacted.
@@ -229,12 +225,14 @@ as it was before.
 
 The insertion algorithm is rather similar:
 
-	share-lock page 0 (to prevent active split)
 	read/sharelock meta page
-	compute bucket number for target hash key
-	release meta page
-	share-lock bucket page (to prevent split/compact of this bucket)
-	release page 0 share-lock
+    loop:
+		compute bucket number for target hash key
+		release meta page
+		if (correct bucket page is already locked)
+			break
+		release any existing bucket page lock (if a concurrent split happened)
+		share-lock bucket page
 -- (so far same as reader)
 	read/exclusive-lock current page of bucket
 	if full, release, read/exclusive-lock next page; repeat as needed
@@ -285,10 +283,9 @@ existing bucket in two, thereby lowering the fill ratio:
 	>> see below about acquiring needed extra space
 	Release X-locks of old and new buckets
 
-Note the page zero and metapage locks are not held while the actual tuple
-rearrangement is performed, so accesses to other buckets can proceed in
-parallel; in fact, it's possible for multiple bucket splits to proceed
-in parallel.
+Note the metapage lock is not held while the actual tuple rearrangement is
+performed, so accesses to other buckets can proceed in parallel; in fact,
+it's possible for multiple bucket splits to proceed in parallel.
 
 Split's attempt to X-lock the old bucket number could fail if another
 process holds S-lock on it.  We do not want to wait if that happens, first
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 66084f4..c534372 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -32,6 +32,8 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	Buffer		metabuf;
 	HashMetaPage metap;
 	BlockNumber blkno;
+	BlockNumber oldblkno = InvalidBlockNumber;
+	bool		retry = false;
 	Page		page;
 	HashPageOpaque pageopaque;
 	Size		itemsz;
@@ -49,12 +51,6 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	itemsz = MAXALIGN(itemsz);	/* be safe, PageAddItem will do this but we
 								 * need to be consistent */
 
-	/*
-	 * Acquire shared split lock so we can compute the target bucket safely
-	 * (see README).
-	 */
-	_hash_getlock(rel, 0, HASH_SHARE);
-
 	/* Read the metapage */
 	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
@@ -75,24 +71,44 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
 	/*
-	 * Compute the target bucket number, and convert to block number.
+	 * Loop until we get a lock on the correct target bucket.
 	 */
-	bucket = _hash_hashkey2bucket(hashkey,
-								  metap->hashm_maxbucket,
-								  metap->hashm_highmask,
-								  metap->hashm_lowmask);
+	for (;;)
+	{
+		/*
+		 * Compute the target bucket number, and convert to block number.
+		 */
+		bucket = _hash_hashkey2bucket(hashkey,
+									  metap->hashm_maxbucket,
+									  metap->hashm_highmask,
+									  metap->hashm_lowmask);
 
-	blkno = BUCKET_TO_BLKNO(metap, bucket);
+		blkno = BUCKET_TO_BLKNO(metap, bucket);
 
-	/* release lock on metapage, but keep pin since we'll need it again */
-	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		/* Release metapage lock, but keep pin. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
 
-	/*
-	 * Acquire share lock on target bucket; then we can release split lock.
-	 */
-	_hash_getlock(rel, blkno, HASH_SHARE);
+		/*
+		 * If the previous iteration of this loop locked what is still the
+		 * correct target bucket, we are done.  Otherwise, drop any old lock
+		 * and lock what now appears to be the correct bucket.
+		 */
+		if (retry)
+		{
+			if (oldblkno == blkno)
+				break;
+			_hash_droplock(rel, oldblkno, HASH_SHARE);
+		}
+		_hash_getlock(rel, blkno, HASH_SHARE);
 
-	_hash_droplock(rel, 0, HASH_SHARE);
+		/*
+		 * Reacquire metapage lock and check that no bucket split has taken
+		 * place while we were awaiting the bucket lock.
+		 */
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
 
 	/* Fetch the primary bucket page for the bucket */
 	buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 6b647a8..c0b6eb0 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -57,9 +57,9 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 /*
  * _hash_getlock() -- Acquire an lmgr lock.
  *
- * 'whichlock' should be zero to acquire the split-control lock, or the
- * block number of a bucket's primary bucket page to acquire the per-bucket
- * lock.  (See README for details of the use of these locks.)
+ * 'whichlock' should the block number of a bucket's primary bucket page to
+ * acquire the per-bucket lock.  (See README for details of the use of these
+ * locks.)
  *
  * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
  */
@@ -507,21 +507,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	uint32		lowmask;
 
 	/*
-	 * Obtain the page-zero lock to assert the right to begin a split (see
-	 * README).
-	 *
-	 * Note: deadlock should be impossible here. Our own backend could only be
-	 * holding bucket sharelocks due to stopped indexscans; those will not
-	 * block other holders of the page-zero lock, who are only interested in
-	 * acquiring bucket sharelocks themselves.	Exclusive bucket locks are
-	 * only taken here and in hashbulkdelete, and neither of these operations
-	 * needs any additional locks to complete.	(If, due to some flaw in this
-	 * reasoning, we manage to deadlock anyway, it's okay to error out; the
-	 * index will be left in a consistent state.)
+	 * Write-lock the meta page.  It used to be necessary to acquire a
+	 * heavyweight lock to begin a split, but that is no longer required.
 	 */
-	_hash_getlock(rel, 0, HASH_EXCLUSIVE);
-
-	/* Write-lock the meta page */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
 	_hash_checkpage(rel, metabuf, LH_META_PAGE);
@@ -571,6 +559,12 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, old_bucket))
 		goto fail;
 
+	/*
+	 * It's normally a bad idea to grab a heavyweight lock while holding
+	 * a buffer content lock, both because of deadlock risk and because
+	 * content locks should be held only briefly.  But since we are only
+	 * trylocking here it should be OK.
+	 */
 	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
 		goto fail;
 
@@ -587,6 +581,12 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, new_bucket))
 		elog(ERROR, "scan in progress on supposedly new bucket");
 
+	/*
+	 * It's normally a bad idea to grab a heavyweight lock while holding
+	 * a buffer content lock, both because of deadlock risk and because
+	 * content locks should be held only briefly.  But since we are only
+	 * trylocking here it should be OK.
+	 */
 	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
 		elog(ERROR, "could not get lock on supposedly new bucket");
 
@@ -663,9 +663,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	/* Write out the metapage and drop lock, but keep pin */
 	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
 
-	/* Release split lock; okay for other splits to occur now */
-	_hash_droplock(rel, 0, HASH_EXCLUSIVE);
-
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf, old_bucket, new_bucket,
 					  start_oblkno, start_nblkno,
@@ -682,9 +679,6 @@ fail:
 
 	/* We didn't write the metapage, so just drop lock */
 	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
-
-	/* Release split lock */
-	_hash_droplock(rel, 0, HASH_EXCLUSIVE);
 }
 
 
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 88a2ad1..b8abc6a 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -125,6 +125,8 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	uint32		hashkey;
 	Bucket		bucket;
 	BlockNumber blkno;
+	BlockNumber oldblkno = InvalidBuffer;
+	bool		retry = false;
 	Buffer		buf;
 	Buffer		metabuf;
 	Page		page;
@@ -184,35 +186,52 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 
 	so->hashso_sk_hash = hashkey;
 
-	/*
-	 * Acquire shared split lock so we can compute the target bucket safely
-	 * (see README).
-	 */
-	_hash_getlock(rel, 0, HASH_SHARE);
-
 	/* Read the metapage */
 	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
 
 	/*
-	 * Compute the target bucket number, and convert to block number.
+	 * Loop until we get a lock on the correct target bucket.
 	 */
-	bucket = _hash_hashkey2bucket(hashkey,
-								  metap->hashm_maxbucket,
-								  metap->hashm_highmask,
-								  metap->hashm_lowmask);
-
-	blkno = BUCKET_TO_BLKNO(metap, bucket);
+	for (;;)
+	{
+		/*
+		 * Compute the target bucket number, and convert to block number.
+		 */
+		bucket = _hash_hashkey2bucket(hashkey,
+									  metap->hashm_maxbucket,
+									  metap->hashm_highmask,
+									  metap->hashm_lowmask);
+
+		blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+		/* Release metapage lock, but keep pin. */
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+		/*
+		 * If the previous iteration of this loop locked what is still the
+		 * correct target bucket, we are done.  Otherwise, drop any old lock
+		 * and lock what now appears to be the correct bucket.
+		 */
+		if (retry)
+		{
+			if (oldblkno == blkno)
+				break;
+			_hash_droplock(rel, oldblkno, HASH_SHARE);
+		}
+		_hash_getlock(rel, blkno, HASH_SHARE);
+
+		/*
+		 * Reacquire metapage lock and check that no bucket split has taken
+		 * place while we were awaiting the bucket lock.
+		 */
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		oldblkno = blkno;
+		retry = true;
+	}
 
 	/* done with the metapage */
-	_hash_relbuf(rel, metabuf);
-
-	/*
-	 * Acquire share lock on target bucket; then we can release split lock.
-	 */
-	_hash_getlock(rel, blkno, HASH_SHARE);
-
-	_hash_droplock(rel, 0, HASH_SHARE);
+	_hash_dropbuf(rel, metabuf);
 
 	/* Update scan opaque state to show we have lock on the bucket */
 	so->hashso_bucket = bucket;
#2Dickson S. Guedes
listas@guedesoft.net
In reply to: Robert Haas (#1)
Re: patch: avoid heavyweight locking on hash metapage

2012/5/30 Robert Haas <robertmhaas@gmail.com>:

I tested the effect of this by setting up a series of 5-minute
read-only pgbench run at scale factor 300 with 8GB of shared buffers
on the IBM POWER7 machine.

I know it doesn't matter, but out of curiosity what OS you used?

best regards,
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br

#3Robert Haas
robertmhaas@gmail.com
In reply to: Dickson S. Guedes (#2)
Re: patch: avoid heavyweight locking on hash metapage

On Thu, May 31, 2012 at 7:07 AM, Dickson S. Guedes <listas@guedesoft.net> wrote:

2012/5/30 Robert Haas <robertmhaas@gmail.com>:

I tested the effect of this by setting up a series of 5-minute
read-only pgbench run at scale factor 300 with 8GB of shared buffers
on the IBM POWER7 machine.

I know it doesn't matter, but out of curiosity what OS you used?

Fedora 16, Linux 3.2.6-3.fc16.ppc64

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Jeff Janes
jeff.janes@gmail.com
In reply to: Robert Haas (#1)
Re: patch: avoid heavyweight locking on hash metapage

On Wed, May 30, 2012 at 3:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I developed the attached patch to avoid taking a heavyweight lock on
the metapage of a hash index.  Instead, an exclusive buffer content
lock is viewed as sufficient permission to modify the metapage, and a
shared buffer content lock is used when such modifications need to be
prevented.  For the most part this is a trivial change, because we
were already taking these locks: we were just taking the heavyweight
locks in addition.  The only sticking point is that, when we're
searching or inserting, we previously locked the bucket before
releasing the heavyweight metapage lock, which is unworkable when
holding only a buffer content lock because (1) we might deadlock and
(2) buffer content locks can't be held for long periods of time even
when there's no deadlock risk.  To fix this, I implemented a simple
loop-and-retry system: we release the metapage content lock, acquire
the heavyweight lock on the target bucket, and then reacquire the
metapage content lock and check that the bucket mapping has not
changed.   Normally it hasn't, and we're done.  But if by chance it
has, we simply unlock the metapage, release the heavyweight lock we
acquired previously, lock the new bucket, and loop around again.  Even
in the worst case we cannot loop very many times here, since we don't
split the same bucket again until we've split all the other buckets,
and 2^N gets big pretty fast.

Do we need the retry flag (applies to two places)? If oldblkno is
still InvalidBlockNumber then it can't equal blkno.
I think the extra variable might be clearer than the magic value, but
we already have the magic value so do we want to have both a flag
variable and a magic value?

+       if (retry)
+       {
+           if (oldblkno == blkno)
+               break;
+           _hash_droplock(rel, oldblkno, HASH_SHARE);
+       }

In the README, the psuedo codes probably needs to mention re-locking
the meta page in the loop.

Also, "page" is used to mean either the disk page or the shared buffer
currently holding that page, depending on context. This is confusing.
Maybe we should clarify "Lock the meta page buffer". Of course this
gripe precedes your patch and applies to other parts of the code as
well, but since we mingle LW locks (on buffers) and heavy locks (on
names of disk pages) it might make sense to be more meticulous here.

"exclusive-lock page 0 (assert the right to begin a split)" is no
longer true, nor is "release X-lock on page 0"

Also in the README, section "To prevent deadlock we enforce these
coding rules:" would need to be changed as those rules are being
changed. But, should we change them at all?

In _hash_expandtable, the claim "But since we are only trylocking here
it should be OK" doesn't completely satisfy me. Even a conditional
heavy-lock acquire still takes a transient non-conditional LW Lock on
the lock manager partition. Unless there is a global rule that no one
can take a buffer content lock while holding a lock manager partition
lock, this seems dangerous. Could this be redone to follow the
pattern of heavy locking with no content lock, then reacquiring the
buffer content lock to check to make sure we locked the correct
things? I don't know if it would be better to loop, or just give up,
if the meta page changed underneath us.

Cheers,

Jeff

#5Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Janes (#4)
1 attachment(s)
Re: patch: avoid heavyweight locking on hash metapage

On Fri, Jun 15, 2012 at 1:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Do we need the retry flag (applies to two places)?  If oldblkno is
still InvalidBlockNumber then it can't equal blkno.

I was a bit concerned that blkno = BUCKET_TO_BLKNO(metap, bucket)
might set blkno to InvalidBlockNumber, thus possibly messing up the
algorithm. This way seemed better in terms of not requiring any
assumption in that area.

In the README, the psuedo codes probably needs to mention re-locking
the meta page in the loop.

Good point. Fixed.

Also, "page" is used to mean either the disk page or the shared buffer
currently holding that page, depending on context.  This is confusing.
 Maybe we should clarify "Lock the meta page buffer".  Of course this
gripe precedes your patch and applies to other parts of the code as
well, but since we mingle LW locks (on buffers) and heavy locks (on
names of disk pages) it might make sense to be more meticulous here.

I attempted to improve this somewhat in the README, but I think it may
be more than I can do completely.

"exclusive-lock page 0 (assert the right to begin a split)" is no
longer true, nor is "release X-lock on page 0"

I think this is fixed.

Also in the README, section "To prevent deadlock we enforce these
coding rules:" would need to be changed as those rules are being
changed.  But, should we change them at all?

In _hash_expandtable, the claim "But since we are only trylocking here
it should be OK" doesn't completely satisfy me.  Even a conditional
heavy-lock acquire still takes a transient non-conditional LW Lock on
the lock manager partition.  Unless there is a global rule that no one
can take a buffer content lock while holding a lock manager partition
lock, this seems dangerous.  Could this be redone to follow the
pattern of heavy locking with no content lock, then reacquiring the
buffer content lock to check to make sure we locked the correct
things?  I don't know if it would be better to loop, or just give up,
if the meta page changed underneath us.

Hmm. That was actually a gloss I added on existing code to try to
convince myself that it was safe; I don't think that the changes I
made make that any more or less safe than it was before. But the
question of whether or not it is safe is an interesting one. I do
believe that your statement is true, though: lock manager locks - or
backend locks, for the fast-path - should be the last thing we lock.
In some cases we lock more than one lock manager partition lock, but
we never lock anything else afterwards, AFAIK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

hash-avoid-heavyweight-metapage-locks-v2.patchapplication/octet-stream; name=hash-avoid-heavyweight-metapage-locks-v2.patchDownload
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 4082581..2652cae 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -187,19 +187,21 @@ track of available overflow pages.
 
 The reader algorithm is:
 
-	read/sharelock meta page
-    loop:
+	pin meta page and take buffer content lock in shared mode
+	loop:
 		compute bucket number for target hash key
-		release meta page
+		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
 		release any existing bucket page lock (if a concurrent split happened)
-		share-lock bucket page
+		take heavyweight bucket lock
+		retake meta page buffer content lock in shared mode
 -- then, per read request:
-	read/sharelock current page of bucket
+	release pin on metapage
+	read current page of bucket and take shared buffer content lock
 		step to next page if necessary (no chaining of locks)
 	get tuple
-	release current page
+	release buffer content lock and pin on current page
 -- at scan shutdown:
 	release bucket share-lock
 
@@ -225,24 +227,26 @@ as it was before.
 
 The insertion algorithm is rather similar:
 
-	read/sharelock meta page
-    loop:
+	pin meta page and take buffer content lock in shared mode
+	loop:
 		compute bucket number for target hash key
-		release meta page
+		release meta page buffer content lock
 		if (correct bucket page is already locked)
 			break
 		release any existing bucket page lock (if a concurrent split happened)
-		share-lock bucket page
+		take heavyweight bucket lock in shared mode
+		retake meta page buffer content lock in shared mode
 -- (so far same as reader)
-	read/exclusive-lock current page of bucket
+	release pin on metapage
+	pin current page of bucket and take exclusive buffer content lock
 	if full, release, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
 	insert tuple at appropriate place in page
-	write/release current page
-	release bucket share-lock
-	read/exclusive-lock meta page
+	mark current page dirty and release buffer content lock and pin
+	release heavyweight share-lock
+	pin meta page and take buffer content lock in shared mode
 	increment tuple count, decide if split needed
-	write/release meta page
+	mark meta page dirty and release buffer content lock and pin
 	done if no split needed, else enter Split algorithm below
 
 To speed searches, the index entries within any individual index page are
@@ -267,17 +271,15 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 The algorithm attempts, but does not necessarily succeed, to split one
 existing bucket in two, thereby lowering the fill ratio:
 
-	exclusive-lock page 0 (assert the right to begin a split)
-	read/exclusive-lock meta page
+	pin meta page and take buffer content lock in exclusive mode
 	check split still needed
-	if split not needed anymore, drop locks and exit
+	if split not needed anymore, drop buffer content lock and pin and exit
 	decide which bucket to split
 	Attempt to X-lock old bucket number (definitely could fail)
 	Attempt to X-lock new bucket number (shouldn't fail, but...)
-	if above fail, drop locks and exit
+	if above fail, drop locks and pin and exit
 	update meta page to reflect new number of buckets
-	write/release meta page
-	release X-lock on page 0
+	mark meta page dirty and release buffer content lock and pin
 	-- now, accesses to all other buckets can proceed.
 	Perform actual split of bucket, moving tuples as needed
 	>> see below about acquiring needed extra space
@@ -313,20 +315,20 @@ go-round.
 The fourth operation is garbage collection (bulk deletion):
 
 	next bucket := 0
-	read/sharelock meta page
+	pin metapage and take buffer content lock in exclusive mode
 	fetch current max bucket number
-	release meta page
+	release meta page buffer content lock and pin
 	while next bucket <= max bucket do
 		Acquire X lock on target bucket
 		Scan and remove tuples, compact free space as needed
 		Release X lock
 		next bucket ++
 	end loop
-	exclusive-lock meta page
+	pin metapage and take buffer content lock in exclusive mode
 	check if number of buckets changed
-	if so, release lock and return to for-each-bucket loop
+	if so, release content lock and pin and return to for-each-bucket loop
 	else update metapage tuple count
-	write/release meta page
+	mark meta page dirty and release buffer content lock and pin
 
 Note that this is designed to allow concurrent splits.  If a split occurs,
 tuples relocated into the new bucket will be visited twice by the scan,
@@ -357,25 +359,25 @@ overflow page to the free pool.
 
 Obtaining an overflow page:
 
-	read/exclusive-lock meta page
+	take metapage content lock in exclusive mode
 	determine next bitmap page number; if none, exit loop
-	release meta page lock
-	read/exclusive-lock bitmap page
+	release meta page content lock
+	pin bitmap page and take content lock in exclusive mode
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		write/release bitmap page
-		read/exclusive-lock meta page
+		mark bitmap page dirty and release content lock
+		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
-			update it and write meta page
-		release meta page
+			update it and mark meta page dirty
+		release meta page buffer content lock
 		return page number
 	else (not found):
-	release bitmap page
+	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	write/release meta page
+	mark meta page dirty and release buffer content lock
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -425,17 +427,17 @@ algorithm is:
 
 	delink overflow page from bucket chain
 	(this requires read/update/write/release of fore and aft siblings)
-	read/share-lock meta page
+	pin meta page and take buffer content lock in shared mode
 	determine which bitmap page contains the free space bit for page
-	release meta page
-	read/exclusive-lock bitmap page
+	relase meta page buffer content lock
+	pin bitmap page and take buffer content lock in exclusie mode
 	update bitmap bit
-	write/release bitmap page
+	mark bitmap page dirty and release buffer content lock and pin
 	if page number is less than what we saw as first-free-bit in meta:
-	read/exclusive-lock meta page
+	retake meta page buffer content lock in exclusive mode
 	if page number is still less than first-free-bit,
-		update first-free-bit field and write meta page
-	release meta page
+		update first-free-bit field and mark meta page dirty
+	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
 changing the first-free-bit field (hashm_firstfree).  It is possible that
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index c0b6eb0..3d067e7 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -559,12 +559,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, old_bucket))
 		goto fail;
 
-	/*
-	 * It's normally a bad idea to grab a heavyweight lock while holding
-	 * a buffer content lock, both because of deadlock risk and because
-	 * content locks should be held only briefly.  But since we are only
-	 * trylocking here it should be OK.
-	 */
 	if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
 		goto fail;
 
@@ -581,12 +575,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	if (_hash_has_active_scan(rel, new_bucket))
 		elog(ERROR, "scan in progress on supposedly new bucket");
 
-	/*
-	 * It's normally a bad idea to grab a heavyweight lock while holding
-	 * a buffer content lock, both because of deadlock risk and because
-	 * content locks should be held only briefly.  But since we are only
-	 * trylocking here it should be OK.
-	 */
 	if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
 		elog(ERROR, "could not get lock on supposedly new bucket");
 
#6Jeff Janes
jeff.janes@gmail.com
In reply to: Robert Haas (#5)
Re: patch: avoid heavyweight locking on hash metapage

On Mon, Jun 18, 2012 at 5:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm.  That was actually a gloss I added on existing code to try to
convince myself that it was safe; I don't think that the changes I
made make that any more or less safe than it was before.

Right, sorry. I thought there was some strength reduction going on
there as well.

Thanks for the various explanations, they address my concerns. I see
that v2 applies over v1.

I've verified performance improvements using 8 cores with my proposed
pgbench -P benchmark, with a scale that fits in shared_buffers.
It brings it most of the way, but not quite, up to the btree performance.

I've marked this as ready for committer.

Cheers,

Jeff

#7Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Janes (#6)
Re: patch: avoid heavyweight locking on hash metapage

On Mon, Jun 25, 2012 at 11:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Mon, Jun 18, 2012 at 5:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm.  That was actually a gloss I added on existing code to try to
convince myself that it was safe; I don't think that the changes I
made make that any more or less safe than it was before.

Right, sorry.  I thought there was some strength reduction going on
there as well.

Thanks for the various explanations, they address my concerns.  I see
that v2 applies over v1.

I've verified performance improvements using 8 cores with my proposed
pgbench -P benchmark, with a scale that fits in shared_buffers.
It brings it most of the way, but not quite, up to the btree performance.

I've marked this as ready for committer.

Thanks for the review; committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company