Hash Indexes
For making hash indexes usable in production systems, we need to improve
its concurrency and make them crash-safe by WAL logging them. The first
problem I would like to tackle is improve the concurrency of hash
indexes. First
advantage, I see with improving concurrency of hash indexes is that it has
the potential of out performing btree for "equal to" searches (with my WIP
patch attached with this mail, I could see hash index outperform btree
index by 20 to 30% for very simple cases which are mentioned later in this
e-mail). Another advantage as explained by Robert [1]/messages/by-id/CA+TgmoZyMoJSrFxHXQ06G8jhjXQcsKvDiHB_8z_7nc7hj7iHYQ@mail.gmail.com earlier is that if
we remove heavy weight locks under which we perform arbitrarily large
number of operations, it can help us to sensibly WAL log it. With this
patch, I would also like to make hash indexes capable of completing the
incomplete_splits which can occur due to interrupts (like cancel) or errors
or crash.
I have studied the concurrency problems of hash index and some of the
solutions proposed for same previously and based on that came up with below
solution which is based on idea by Robert [1]/messages/by-id/CA+TgmoZyMoJSrFxHXQ06G8jhjXQcsKvDiHB_8z_7nc7hj7iHYQ@mail.gmail.com, community discussion on
thread [2]/messages/by-id/531992AF.2080306@vmware.com and some of my own thoughts.
Maintain a flag that can be set and cleared on the primary bucket page,
call it split-in-progress, and a flag that can optionally be set on
particular index tuples, call it moved-by-split. We will allow scans of all
buckets and insertions into all buckets while the split is in progress, but
(as now) we will not allow more than one split for a bucket to be in
progress at the same time. We start the split by updating metapage to
incrementing the number of buckets and set the split-in-progress flag in
primary bucket pages for old and new buckets (lets number them as old
bucket - N+1/2; new bucket - N + 1 for the matter of discussion). While the
split-in-progress flag is set, any scans of N+1 will first scan that
bucket, ignoring any tuples flagged moved-by-split, and then ALSO scan
bucket N+1/2. To ensure that vacuum doesn't clean any tuples from old or
new buckets till this scan is in progress, maintain a pin on both of the
buckets (first pin on old bucket needs to be acquired). The moved-by-split
flag never has any effect except when scanning the new bucket that existed
at the start of that particular scan, and then only if
the split-in-progress flag was also set at that time.
Once the split operation has set the split-in-progress flag, it will begin
scanning bucket (N+1)/2. Every time it finds a tuple that properly belongs
in bucket N+1, it will insert the tuple into bucket N+1 with the
moved-by-split flag set. Tuples inserted by anything other than a split
operation will leave this flag clear, and tuples inserted while the split
is in progress will target the same bucket that they would hit if the split
were already complete. Thus, bucket N+1 will end up with a mix
of moved-by-split tuples, coming from bucket (N+1)/2, and unflagged tuples
coming from parallel insertion activity. When the scan of bucket (N+1)/2
is complete, we know that bucket N+1 now contains all the tuples that are
supposed to be there, so we clear the split-in-progress flag on both
buckets. Future scans of both buckets can proceed normally. Split
operation needs to take a cleanup lock on primary bucket to ensure that it
doesn't start if there is any Insertion happening in the bucket. It will
leave the lock on primary bucket, but not pin as it proceeds for next
overflow page. Retaining pin on primary bucket will ensure that vacuum
doesn't start on this bucket till the split is finished.
Insertion will happen by scanning the appropriate bucket and needs to
retain pin on primary bucket to ensure that concurrent split doesn't
happen, otherwise split might leave this tuple unaccounted.
Now for deletion of tuples from (N+1/2) bucket, we need to wait for the
completion of any scans that began before we finished populating bucket
N+1, because otherwise we might remove tuples that they're still expecting
to find in bucket (N+1)/2. The scan will always maintain a pin on primary
bucket and Vacuum can take a buffer cleanup lock (cleanup lock includes
Exclusive lock on bucket and wait till all the pins on buffer becomes zero)
on primary bucket for the buffer. I think we can relax the requirement for
vacuum to take cleanup lock (instead take Exclusive Lock on buckets where
no split has happened) with the additional flag has_garbage which will be
set on primary bucket, if any tuples have been moved from that bucket,
however I think for squeeze phase (in this phase, we try to move the tuples
from later overflow pages to earlier overflow pages in the bucket and then
if there are any empty overflow pages, then we move them to kind of a free
pool) of vacuum, we need a cleanup lock, otherwise scan results might get
effected.
Incomplete Splits
--------------------------
Incomplete splits can be completed either by vacuum or insert as both needs
exclusive lock on bucket. If vacuum finds split-in-progress flag on a
bucket then it will complete the split operation, vacuum won't see this
flag if actually split is in progress on that bucket as vacuum needs
cleanup lock and split retains pin till end of operation. To make it work
for Insert operation, one simple idea could be that if insert finds
split-in-progress flag, then it releases the current exclusive lock on
bucket and tries to acquire a cleanup lock on bucket, if it gets cleanup
lock, then it can complete the split and then the insertion of tuple, else
it will have a exclusive lock on bucket and just perform the insertion of
tuple. The disadvantage of trying to complete the split in vacuum is that
split might require new pages and allocating new pages at time of vacuum is
not advisable. The disadvantage of doing it at time of Insert is that
Insert might skip it even if there is some scan on the bucket is going on
as scan will also retain pin on the bucket, but I think that is not a big
deal. The actual completion of split can be done in two ways: (a) scan the
new bucket and build a hash table with all of the TIDs you find there.
When copying tuples from the old bucket, first probe the hash table; if you
find a match, just skip that tuple (idea suggested by Robert Haas offlist)
(b) delete all the tuples that are marked as moved_by_split in the new
bucket and perform the split operation from the beginning using old bucket.
Although, I don't think it is a very good idea to take any performance data
with WIP patch, still I couldn't resist myself from doing so and below are
the performance numbers. To get the performance data, I have dropped the
primary key constraint on pgbench_accounts and created a hash index on aid
column as below.
alter table pgbench_accounts drop constraint pgbench_accounts_pkey;
create index pgbench_accounts_pkey on pgbench_accounts using hash(aid);
Below data is for read-only pgbench test and is a median of 3 5-min runs.
The performance tests are executed on a power-8 m/c.
Data fits in shared buffers
scale_factor - 300
shared_buffers - 8GB
Patch_Ver/Client count 1 8 16 32 64 72 80 88 96 128
HEAD-Btree 19397 122488 194433 344524 519536 527365 597368 559381 614321
609102
HEAD-Hindex 18539 141905 218635 363068 512067 522018 492103 484372 440265
393231
Patch 22504 146937 235948 419268 637871 637595 674042 669278 683704 639967
% improvement between HEAD-Hash index vs Patch and HEAD-Btree index vs
Patch-Hash index is:
Head-Hash vs Patch 21.38 3.5 7.9 15.47 24.56 22.14 36.97 38.17 55.29 62.74
Head-Btree vs. Patch 16.01 19.96 21.35 21.69 22.77 20.9 12.83 19.64 11.29
5.06
This data shows that patch improves the performance of hash index upto
62.74 and it also makes hash-index faster than btree-index by ~20% (most
client counts show the performance improvement in the range of 15~20%.
For the matter of comparison with btree, I think the impact of performance
improvement of hash index will be more when the data doesn't fit shared
buffers and the performance data for same is as below:
Data doesn't fits in shared buffers
scale_factor - 3000
shared_buffers - 8GB
Client_Count/Patch 16 64 96
Head-Btree 170042 463721 520656
Patch-Hash 227528 603594 659287
% diff 33.8 30.16 26.62
The performance with hash-index is ~30% better than Btree. Note, that for
now, I have not taken the data for HEAD- Hash index. I think there will
many more cases like when hash index is on char (20) column where the
performance of hash-index can be much better than btree-index for equal to
searches.
Note that this patch is a very-much WIP patch and I am posting it mainly to
facilitate the discussion. Currently, it doesn't have any code to perform
incomplete splits, the logic for locking/pins during Insert is yet to be
done and many more things.
[1]: /messages/by-id/CA+TgmoZyMoJSrFxHXQ06G8jhjXQcsKvDiHB_8z_7nc7hj7iHYQ@mail.gmail.com
/messages/by-id/CA+TgmoZyMoJSrFxHXQ06G8jhjXQcsKvDiHB_8z_7nc7hj7iHYQ@mail.gmail.com
[2]: /messages/by-id/531992AF.2080306@vmware.com
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
concurrent_hash_index_v1.patchapplication/octet-stream; name=concurrent_hash_index_v1.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
Buffer buf;
Page page;
- _hash_getlock(rel, blkno, HASH_SHARE);
buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
page = BufferGetPage(buf);
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
}
_hash_relbuf(rel, buf);
- _hash_droplock(rel, blkno, HASH_SHARE);
}
/*
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 49a6c81..f95ac00 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -407,12 +407,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
so->hashso_bucket_valid = false;
- so->hashso_bucket_blkno = 0;
so->hashso_curbuf = InvalidBuffer;
+ so->hashso_bucket_buf = InvalidBuffer;
+ so->hashso_old_bucket_buf = InvalidBuffer;
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
ItemPointerSetInvalid(&(so->hashso_heappos));
+ so->hashso_skip_moved_tuples = false;
+
scan->opaque = so;
/* register scan in case we change pages it's using */
@@ -436,10 +439,15 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_hash_dropbuf(rel, so->hashso_curbuf);
so->hashso_curbuf = InvalidBuffer;
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ /* release pin we hold on old primary bucket */
+ if (BufferIsValid(so->hashso_old_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+
+ /* release pin we hold on primary bucket */
+ if (BufferIsValid(so->hashso_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_bucket_buf);
+ so->hashso_bucket_buf = InvalidBuffer;
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -453,6 +461,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->hashso_bucket_valid = false;
}
+
+ so->hashso_skip_moved_tuples = false;
}
/*
@@ -472,10 +482,15 @@ hashendscan(IndexScanDesc scan)
_hash_dropbuf(rel, so->hashso_curbuf);
so->hashso_curbuf = InvalidBuffer;
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ /* release pin we hold on old primary bucket */
+ if (BufferIsValid(so->hashso_old_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+
+ /* release pin we hold on primary bucket */
+ if (BufferIsValid(so->hashso_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_bucket_buf);
+ so->hashso_bucket_buf = InvalidBuffer;
pfree(so);
scan->opaque = NULL;
@@ -486,6 +501,9 @@ hashendscan(IndexScanDesc scan)
* The set of target tuples is specified via a callback routine that tells
* whether any given heap tuple (identified by ItemPointer) is being deleted.
*
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
* Result: a palloc'd struct containing statistical info for VACUUM displays.
*/
IndexBulkDeleteResult *
@@ -530,35 +548,61 @@ loop_top:
{
BlockNumber bucket_blkno;
BlockNumber blkno;
+ Buffer bucket_buf;
+ Buffer buf;
+ HashPageOpaque bucket_opaque;
+ Page page;
bool bucket_dirty = false;
+ bool bucket_has_garbage = false;
/* Get address of bucket's start page */
bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
- /* Exclusive-lock the bucket so we can shrink it */
- _hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
/* Shouldn't have any active scans locally, either */
if (_hash_has_active_scan(rel, cur_bucket))
elog(ERROR, "hash index has active scan during VACUUM");
- /* Scan each page in bucket */
blkno = bucket_blkno;
- while (BlockNumberIsValid(blkno))
+
+ /*
+ * Maintain a cleanup lock on primary bucket till we scan all the
+ * pages in bucket. This is required to ensure that we don't delete
+ * tuples which are needed for concurrent scans on buckets where split
+ * is in progress. Retaining it till end of bucket scan ensures that
+ * concurrent split can't be started on it. In future, we might want
+ * to relax the requirement for vacuum to take cleanup lock only for
+ * buckets where split is in progress, however for squeeze phase we
+ * need a cleanup lock, otherwise squeeze will move the tuples to a
+ * different location and that can lead to change in order of results.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+ LockBufferForCleanup(buf);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+
+ page = BufferGetPage(buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ /*
+ * If the bucket contains tuples that are moved by split, then we need
+ * to delete such tuples as well.
+ */
+ if (bucket_opaque->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+ bucket_has_garbage = true;
+
+ bucket_buf = buf;
+
+ /* Scan each page in bucket */
+ for (;;)
{
- Buffer buf;
- Page page;
HashPageOpaque opaque;
OffsetNumber offno;
OffsetNumber maxoffno;
OffsetNumber deletable[MaxOffsetNumber];
int ndeletable = 0;
+ bool release_buf = false;
vacuum_delay_point();
- buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
- LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- info->strategy);
page = BufferGetPage(buf);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->hasho_bucket == cur_bucket);
@@ -571,6 +615,7 @@ loop_top:
{
IndexTuple itup;
ItemPointer htup;
+ Bucket bucket;
itup = (IndexTuple) PageGetItem(page,
PageGetItemId(page, offno));
@@ -581,32 +626,72 @@ loop_top:
deletable[ndeletable++] = offno;
tuples_removed += 1;
}
+ else if (bucket_has_garbage)
+ {
+ /* delete the tuples that are moved by split. */
+ bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+ local_metapage.hashm_maxbucket,
+ local_metapage.hashm_highmask,
+ local_metapage.hashm_lowmask);
+ if (bucket != cur_bucket)
+ {
+ /* mark the item for deletion */
+ deletable[ndeletable++] = offno;
+ tuples_removed += 1;
+ }
+ }
else
num_index_tuples += 1;
}
/*
- * Apply deletions and write page if needed, advance to next page.
+ * We don't release the lock on primary bucket till end of bucket
+ * scan.
*/
+ if (blkno != bucket_blkno)
+ release_buf = true;
+
blkno = opaque->hasho_nextblkno;
+ /*
+ * Apply deletions and write page if needed, advance to next page.
+ */
if (ndeletable > 0)
{
PageIndexMultiDelete(page, deletable, ndeletable);
- _hash_wrtbuf(rel, buf);
+ if (release_buf)
+ _hash_wrtbuf(rel, buf);
+ else
+ MarkBufferDirty(buf);
bucket_dirty = true;
}
- else
+ else if (release_buf)
_hash_relbuf(rel, buf);
+
+ /* bail out if there are no more pages to scan. */
+ if (!BlockNumberIsValid(blkno))
+ break;
+
+ buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+ LH_OVERFLOW_PAGE,
+ info->strategy);
}
+ /*
+ * Clear the garbage flag from bucket after deleting the tuples that
+ * are moved by split. We purposefully clear the flag before squeeze
+ * bucket, so that after restart, vacuum shouldn't again try to delete
+ * the moved by split tuples.
+ */
+ if (bucket_has_garbage)
+ bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+
/* If we deleted anything, try to compact free space */
if (bucket_dirty)
- _hash_squeezebucket(rel, cur_bucket, bucket_blkno,
+ _hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
info->strategy);
- /* Release bucket lock */
- _hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, bucket_buf);
/* Advance to next bucket */
cur_bucket++;
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..eedf6ae 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -32,8 +32,6 @@ _hash_doinsert(Relation rel, IndexTuple itup)
Buffer metabuf;
HashMetaPage metap;
BlockNumber blkno;
- BlockNumber oldblkno = InvalidBlockNumber;
- bool retry = false;
Page page;
HashPageOpaque pageopaque;
Size itemsz;
@@ -70,45 +68,22 @@ _hash_doinsert(Relation rel, IndexTuple itup)
errhint("Values larger than a buffer page cannot be indexed.")));
/*
- * Loop until we get a lock on the correct target bucket.
+ * Compute the target bucket number, and convert to block number.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
+ bucket = _hash_hashkey2bucket(hashkey,
metap->hashm_maxbucket,
metap->hashm_highmask,
metap->hashm_lowmask);
- blkno = BUCKET_TO_BLKNO(metap, bucket);
-
- /* Release metapage lock, but keep pin. */
- _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
- */
- if (retry)
- {
- if (oldblkno == blkno)
- break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
-
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
- }
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ /*
+ * FixMe: If the split operation happens during insertion and it
+ * doesn't account the tuple being inserted, then it can be lost
+ * for future searches.
+ */
/* Fetch the primary bucket page for the bucket */
buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
page = BufferGetPage(buf);
@@ -141,10 +116,10 @@ _hash_doinsert(Relation rel, IndexTuple itup)
*/
/* release our write lock without modifying buffer */
- _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
/* chain to a new overflow page */
- buf = _hash_addovflpage(rel, metabuf, buf);
+ buf = _hash_addovflpage(rel, metabuf, buf, false);
page = BufferGetPage(buf);
/* should fit now, given test above */
@@ -161,9 +136,6 @@ _hash_doinsert(Relation rel, IndexTuple itup)
/* write and release the modified page */
_hash_wrtbuf(rel, buf);
- /* We can drop the bucket lock now */
- _hash_droplock(rel, blkno, HASH_SHARE);
-
/*
* Write-lock the metapage so we can increment the tuple count. After
* incrementing it, check to see if it's time for a split.
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..184236c 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
*
* On entry, the caller must hold a pin but no lock on 'buf'. The pin is
* dropped before exiting (we assume the caller is not interested in 'buf'
- * anymore). The returned overflow page will be pinned and write-locked;
- * it is guaranteed to be empty.
+ * anymore) if not asked to retain. The pin will be retained only for the
+ * primary bucket. The returned overflow page will be pinned and
+ * write-locked; it is guaranteed to be empty.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* That buffer is returned in the same state.
*
- * The caller must hold at least share lock on the bucket, to ensure that
- * no one else tries to compact the bucket meanwhile. This guarantees that
- * 'buf' won't stop being part of the bucket while it's unlocked.
- *
* NB: since this could be executed concurrently by multiple processes,
* one should not assume that the returned overflow page will be the
* immediate successor of the originally passed 'buf'. Additional overflow
* pages might have been added to the bucket chain in between.
*/
Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
{
Buffer ovflbuf;
Page page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
break;
/* we assume we do not need to write the unmodified page */
- _hash_relbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
}
@@ -149,7 +149,13 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
/* logically chain overflow page to previous page */
pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
- _hash_wrtbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ {
+ MarkBufferDirty(buf);
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ }
+ else
+ _hash_wrtbuf(rel, buf);
return ovflbuf;
}
@@ -570,7 +576,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
* required that to be true on entry as well, but it's a lot easier for
* callers to leave empty overflow pages and let this guy clean it up.
*
- * Caller must hold exclusive lock on the target bucket. This allows
+ * Caller must hold cleanup lock on the target bucket. This allows
* us to safely lock multiple pages in the bucket.
*
* Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +586,7 @@ void
_hash_squeezebucket(Relation rel,
Bucket bucket,
BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy)
{
BlockNumber wblkno;
@@ -591,16 +598,13 @@ _hash_squeezebucket(Relation rel,
HashPageOpaque wopaque;
HashPageOpaque ropaque;
bool wbuf_dirty;
+ bool release_buf = false;
/*
* start squeezing into the base bucket page.
*/
wblkno = bucket_blkno;
- wbuf = _hash_getbuf_with_strategy(rel,
- wblkno,
- HASH_WRITE,
- LH_BUCKET_PAGE,
- bstrategy);
+ wbuf = bucket_buf;
wpage = BufferGetPage(wbuf);
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
@@ -669,12 +673,17 @@ _hash_squeezebucket(Relation rel,
{
Assert(!PageIsEmpty(wpage));
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
wblkno = wopaque->hasho_nextblkno;
Assert(BlockNumberIsValid(wblkno));
- if (wbuf_dirty)
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
/* nothing more to do if we reached the read page */
@@ -700,6 +709,7 @@ _hash_squeezebucket(Relation rel,
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
Assert(wopaque->hasho_bucket == bucket);
wbuf_dirty = false;
+ release_buf = false;
}
/*
@@ -733,11 +743,17 @@ _hash_squeezebucket(Relation rel,
/* are we freeing the page adjacent to wbuf? */
if (rblkno == wblkno)
{
- /* yes, so release wbuf lock first */
- if (wbuf_dirty)
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
+ /* yes, so release wbuf lock first if needed */
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
+
/* free this overflow page (releases rbuf) */
_hash_freeovflpage(rel, rbuf, bstrategy);
/* done */
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..1ba4d52 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,7 +38,7 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
uint32 nblocks);
static void _hash_splitbucket(Relation rel, Buffer metabuf,
Bucket obucket, Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask, uint32 lowmask);
@@ -55,46 +55,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
/*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock. (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- return ConditionalLockPage(rel, whichlock, access);
- else
- return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- UnlockPage(rel, whichlock, access);
-}
-
-/*
* _hash_getbuf() -- Get a buffer by block number for read or write.
*
* 'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -489,9 +449,8 @@ _hash_pageinit(Page page, Size size)
/*
* Attempt to expand the hash table by creating one new bucket.
*
- * This will silently do nothing if it cannot get the needed locks.
- *
- * The caller should hold no locks on the hash index.
+ * This will silently do nothing if there are active scans of our own
+ * backend or the old bucket contains tuples from previous split.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* The buffer is returned in the same state.
@@ -506,6 +465,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
BlockNumber start_oblkno;
BlockNumber start_nblkno;
Buffer buf_nblkno;
+ Buffer buf_oblkno;
+ Page opage;
+ HashPageOpaque oopaque;
uint32 maxbucket;
uint32 highmask;
uint32 lowmask;
@@ -548,11 +510,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
goto fail;
/*
- * Determine which bucket is to be split, and attempt to lock the old
- * bucket. If we can't get the lock, give up.
- *
- * The lock protects us against other backends, but not against our own
- * backend. Must check for active scans separately.
+ * Determine which bucket is to be split, and if it still contains tuples
+ * from previous split or there is any active scan of our own backend,
+ * then give up.
*/
new_bucket = metap->hashm_maxbucket + 1;
@@ -563,11 +523,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, old_bucket))
goto fail;
- if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+ buf_oblkno = _hash_getbuf(rel, start_oblkno, HASH_READ, LH_BUCKET_PAGE);
+
+ opage = BufferGetPage(buf_oblkno);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ if (oopaque->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+ {
+ _hash_relbuf(rel, buf_oblkno);
goto fail;
+ }
/*
- * Likewise lock the new bucket (should never fail).
+ * There shouldn't be any active scan on new bucket.
*
* Note: it is safe to compute the new bucket's blkno here, even though we
* may still need to update the BUCKET_TO_BLKNO mapping. This is because
@@ -579,9 +546,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, new_bucket))
elog(ERROR, "scan in progress on supposedly new bucket");
- if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
- elog(ERROR, "could not get lock on supposedly new bucket");
-
/*
* If the split point is increasing (hashm_maxbucket's log base 2
* increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +564,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
{
/* can't split due to BlockNumber overflow */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, buf_oblkno);
goto fail;
}
}
@@ -665,13 +628,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/* Relocate records to the new bucket */
_hash_splitbucket(rel, metabuf,
old_bucket, new_bucket,
- start_oblkno, buf_nblkno,
+ buf_oblkno, buf_nblkno,
maxbucket, highmask, lowmask);
- /* Release bucket locks, allowing others to access them */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
return;
/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +704,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
* The buffer is returned in the same state. (The metapage is only
* touched if it becomes necessary to add or remove overflow pages.)
*
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation. This is to prevent vacuum to start
+ * when split is in progress.
+ *
* In addition, the caller must have created the new bucket's base page,
* which is passed in buffer nbuf, pinned and write-locked. That lock and
* pin are released here. (The API is set up this way because we must do
@@ -756,37 +719,46 @@ _hash_splitbucket(Relation rel,
Buffer metabuf,
Bucket obucket,
Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask,
uint32 lowmask)
{
- Buffer obuf;
+ Buffer bucket_obuf;
+ Buffer bucket_nbuf;
Page opage;
Page npage;
HashPageOpaque oopaque;
HashPageOpaque nopaque;
+ HashPageOpaque bucket_nopaque;
- /*
- * It should be okay to simultaneously write-lock pages from each bucket,
- * since no one else can be trying to acquire buffer lock on pages of
- * either bucket.
- */
- obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
+ bucket_nbuf = nbuf;
+ bucket_obuf = obuf;
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ /*
+ * Mark the old bucket to indicate that it has deletable tuples. Vacuum
+ * will clear this flag after deleting such tuples.
+ */
+ oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE;
+
npage = BufferGetPage(nbuf);
- /* initialize the new bucket's primary page */
+ /*
+ * initialize the new bucket's primary page and mark it to indicate that
+ * split is in progress.
+ */
nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
nopaque->hasho_prevblkno = InvalidBlockNumber;
nopaque->hasho_nextblkno = InvalidBlockNumber;
nopaque->hasho_bucket = nbucket;
- nopaque->hasho_flag = LH_BUCKET_PAGE;
+ nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_PAGE_SPLIT;
nopaque->hasho_page_id = HASHO_PAGE_ID;
+ bucket_nopaque = nopaque;
+
/*
* Partition the tuples in the old bucket between the old bucket and the
* new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +770,6 @@ _hash_splitbucket(Relation rel,
BlockNumber oblkno;
OffsetNumber ooffnum;
OffsetNumber omaxoffnum;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
/* Scan each tuple in old page */
omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -822,6 +792,18 @@ _hash_splitbucket(Relation rel,
if (bucket == nbucket)
{
+ Size itupsize = 0;
+
+ /*
+ * mark the index tuple as moved by split, such tuples are
+ * skipped by scan if there is split in progress for a primary
+ * bucket.
+ */
+ itupsize = itup->t_info & INDEX_SIZE_MASK;
+ itup->t_info &= ~INDEX_SIZE_MASK;
+ itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+ itup->t_info |= itupsize;
+
/*
* insert the tuple into the new bucket. if it doesn't fit on
* the current page in the new bucket, we must allocate a new
@@ -840,9 +822,10 @@ _hash_splitbucket(Relation rel,
/* write out nbuf and drop lock, but keep pin */
_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
/* chain to a new overflow page */
- nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+ nbuf = _hash_addovflpage(rel, metabuf, nbuf,
+ nopaque->hasho_flag & LH_BUCKET_PAGE ? true : false);
npage = BufferGetPage(nbuf);
- /* we don't need nopaque within the loop */
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
}
/*
@@ -853,11 +836,6 @@ _hash_splitbucket(Relation rel,
* the new page and qsort them before insertion.
*/
(void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
-
- /*
- * Mark tuple for deletion from old page.
- */
- deletable[ndeletable++] = ooffnum;
}
else
{
@@ -870,15 +848,9 @@ _hash_splitbucket(Relation rel,
oblkno = oopaque->hasho_nextblkno;
- /*
- * Done scanning this old page. If we moved any tuples, delete them
- * from the old page.
- */
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(opage, deletable, ndeletable);
- _hash_wrtbuf(rel, obuf);
- }
+ /* retain the pin on the old primary bucket */
+ if (obuf == bucket_obuf)
+ _hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, obuf);
@@ -887,18 +859,24 @@ _hash_splitbucket(Relation rel,
break;
/* Else, advance to next old page */
- obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+ obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
}
+ /* indicate that split is finished */
+ bucket_nopaque->hasho_flag &= ~LH_BUCKET_PAGE_SPLIT;
+
+ /* release the pin on the old primary bucket */
+ _hash_dropbuf(rel, bucket_obuf);
+
/*
* We're at the end of the old bucket chain, so we're done partitioning
- * the tuples. Before quitting, call _hash_squeezebucket to ensure the
- * tuples remaining in the old bucket (including the overflow pages) are
- * packed as tightly as possible. The new bucket is already tight.
+ * the tuples.
*/
_hash_wrtbuf(rel, nbuf);
- _hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+ /* release the pin on the new primary bucket */
+ if (!(nopaque->hasho_flag & LH_BUCKET_PAGE))
+ _hash_dropbuf(rel, bucket_nbuf);
}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..b73559a 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -20,6 +20,7 @@
#include "pgstat.h"
#include "utils/rel.h"
+static BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
/*
* _hash_next() -- Get the next item in a scan.
@@ -72,7 +73,23 @@ _hash_readnext(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_nextblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan to ensure
+ * that Vacuum can't delete the tuples (that are moved by split to new
+ * bucket) which are required by the scans that are started on splitted
+ * buckets before a new bucket's split in progress flag
+ * (LH_BUCKET_PAGE_SPLIT) is cleared. Now the requirement to retain a pin
+ * on primary bucket can be relaxed for buckets that are not splitted by
+ * maintaining a flag like has_garbage in bucket but still we need to
+ * retain pin for squeeze phase otherwise the movement of tuples could
+ * lead to change the ordering of scan results.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -94,7 +111,23 @@ _hash_readprev(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_prevblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan to ensure
+ * that Vacuum can't delete the tuples (that are moved by split to new
+ * bucket) which are required by the scans that are started on splitted
+ * buckets before a new bucket's split in progress flag
+ * (LH_BUCKET_PAGE_SPLIT) is cleared. Now the requirement to retain a pin
+ * on primary bucket can be relaxed for buckets that are not splitted by
+ * maintaining a flag like has_garbage in bucket but still we need to
+ * retain pin for squeeze phase otherwise the movemenet of tuples could
+ * lead to change the ordering of scan results.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -125,8 +158,6 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
uint32 hashkey;
Bucket bucket;
BlockNumber blkno;
- BlockNumber oldblkno = InvalidBuffer;
- bool retry = false;
Buffer buf;
Buffer metabuf;
Page page;
@@ -192,52 +223,21 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
metap = HashPageGetMeta(page);
/*
- * Loop until we get a lock on the correct target bucket.
+ * Compute the target bucket number, and convert to block number.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
-
- blkno = BUCKET_TO_BLKNO(metap, bucket);
-
- /* Release metapage lock, but keep pin. */
- _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
-
- /*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
- */
- if (retry)
- {
- if (oldblkno == blkno)
- break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
- }
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
/* done with the metapage */
- _hash_dropbuf(rel, metabuf);
+ _hash_relbuf(rel, metabuf);
/* Update scan opaque state to show we have lock on the bucket */
so->hashso_bucket = bucket;
so->hashso_bucket_valid = true;
- so->hashso_bucket_blkno = blkno;
/* Fetch the primary bucket page for the bucket */
buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
@@ -245,6 +245,54 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->hasho_bucket == bucket);
+ so->hashso_bucket_buf = buf;
+
+ /*
+ * If the bucket split is in progress, then we need to skip tuples that
+ * are moved from old bucket. To ensure that vacuum doesn't clean any
+ * tuples from old or new buckets till this scan is in progress, maintain
+ * a pin on both of the buckets. Here, we have to be cautious about lock
+ * ordering, first acquire the lock on old bucket, release the lock on old
+ * bucket, but not pin, then acuire the lock on new bucket and again
+ * re-verify whether the bucket split still is in progress. Acquiring lock
+ * on old bucket first ensures that the vacuum waits for this scan to
+ * finish.
+ */
+ if (opaque->hasho_flag & LH_BUCKET_PAGE_SPLIT)
+ {
+ BlockNumber old_blkno;
+ Buffer old_buf;
+
+ old_blkno = _hash_get_oldblk(rel, opaque);
+
+ /*
+ * release the lock on new bucket and re-acquire it after acquiring
+ * the lock on old bucket.
+ */
+ _hash_relbuf(rel, buf);
+
+ old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * remember the old bucket buffer so as to release it later.
+ */
+ so->hashso_old_bucket_buf = buf;
+ _hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+ buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ Assert(opaque->hasho_bucket == bucket);
+
+ if (opaque->hasho_flag & LH_BUCKET_PAGE_SPLIT)
+ so->hashso_skip_moved_tuples = true;
+ else
+ {
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+ }
+ }
+
/* If a backwards scan is requested, move to the end of the chain */
if (ScanDirectionIsBackward(dir))
{
@@ -273,6 +321,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
* false. Else, return true and set the hashso_curpos for the
* scan to the right thing.
*
+ * Here we also scan the old bucket if the split for current bucket
+ * was in progress at the start of scan. The basic idea is that
+ * skip the tuples that are moved by split while scanning current
+ * bucket and then scan the old bucket to cover all such tuples. This
+ * is done ensure that we don't miss any tuples in the current scan
+ * when split was in progress.
+ *
* 'bufP' points to the current buffer, which is pinned and read-locked.
* On success exit, we have pin and read-lock on whichever page
* contains the right item; on failure, we have released all buffers.
@@ -338,6 +393,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum >= FirstOffsetNumber);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ continue;
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -353,9 +418,52 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ page = BufferGetPage(so->hashso_bucket_buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ blkno = _hash_get_oldblk(rel, opaque);
+
+ Assert(BlockNumberIsValid(blkno));
+ buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * remember the old bucket buffer so as to release
+ * the pin at end of scan. If this scan already
+ * has a pin on old buffer, then release it as one
+ * pin is sufficient to hold-off vacuum to clean
+ * the bucket where scan is in progress.
+ */
+ if (BufferIsValid(so->hashso_old_bucket_buf))
+ {
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+ }
+ so->hashso_old_bucket_buf = buf;
+
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
@@ -379,6 +487,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum <= maxoff);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ continue;
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -394,9 +512,64 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ page = BufferGetPage(so->hashso_bucket_buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ blkno = _hash_get_oldblk(rel, opaque);
+
+ /* read the old page */
+ Assert(BlockNumberIsValid(blkno));
+ buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * remember the old bucket buffer so as to release
+ * the pin at end of scan. If this scan already
+ * has a pin on old buffer, then release it as one
+ * pin is sufficient to hold-off vacuum to clean
+ * the bucket where scan is in progress.
+ */
+ if (BufferIsValid(so->hashso_old_bucket_buf))
+ {
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+ }
+ so->hashso_old_bucket_buf = buf;
+
+ page = BufferGetPage(buf);
+
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ /*
+ * For backward scan, we need to start scan from
+ * the last overflow page of old bucket till
+ * primary bucket page.
+ */
+ while (BlockNumberIsValid(opaque->hasho_nextblkno))
+ _hash_readnext(rel, &buf, &page, &opaque);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
@@ -425,3 +598,39 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
ItemPointerSet(current, blkno, offnum);
return true;
}
+
+/*
+ * _hash_get_oldblk() -- get the block number from which current bucket
+ * is being splitted.
+ */
+static BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket old_bucket;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ /*
+ * To get the old bucket from the current bucket, we need a mask to modulo
+ * into lower half of table. This mask is stored in meta page as
+ * hashm_lowmask, but here we can't rely on the same, because we need a
+ * value of lowmask that was prevalent at the time when bucket split was
+ * started. lowmask is always equal to last bucket number in lower half
+ * of the table which can be calculate from current bucket.
+ */
+ curr_bucket = opaque->hasho_bucket;
+ mask = (((uint32) 1) << _hash_log2((uint32) curr_bucket) / 2) - 1;
+ old_bucket = curr_bucket & mask;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index fa3f9b6..cd40ed7 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -52,6 +52,8 @@ typedef uint32 Bucket;
#define LH_BUCKET_PAGE (1 << 1)
#define LH_BITMAP_PAGE (1 << 2)
#define LH_META_PAGE (1 << 3)
+#define LH_BUCKET_PAGE_SPLIT (1 << 4)
+#define LH_BUCKET_PAGE_HAS_GARBAGE (1 << 5)
typedef struct HashPageOpaqueData
{
@@ -88,12 +90,6 @@ typedef struct HashScanOpaqueData
bool hashso_bucket_valid;
/*
- * If we have a share lock on the bucket, we record it here. When
- * hashso_bucket_blkno is zero, we have no such lock.
- */
- BlockNumber hashso_bucket_blkno;
-
- /*
* We also want to remember which buffer we're currently examining in the
* scan. We keep the buffer pinned (but not locked) across hashgettuple
* calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -101,11 +97,23 @@ typedef struct HashScanOpaqueData
*/
Buffer hashso_curbuf;
+ /* remember the buffer associated with primary bucket */
+ Buffer hashso_bucket_buf;
+
+ /*
+ * remember the buffer associated with old primary bucket which is
+ * required during the scan of the bucket for which split is in progress.
+ */
+ Buffer hashso_old_bucket_buf;
+
/* Current position of the scan, as an index TID */
ItemPointerData hashso_curpos;
/* Current position of the scan, as a heap TID */
ItemPointerData hashso_heappos;
+
+ /* Whether scan needs to skip tuples that are moved by split */
+ bool hashso_skip_moved_tuples;
} HashScanOpaqueData;
typedef HashScanOpaqueData *HashScanOpaque;
@@ -176,6 +184,8 @@ typedef HashMetaPageData *HashMetaPage;
sizeof(ItemIdData) - \
MAXALIGN(sizeof(HashPageOpaqueData)))
+#define INDEX_MOVED_BY_SPLIT_MASK 0x2000
+
#define HASH_MIN_FILLFACTOR 10
#define HASH_DEFAULT_FILLFACTOR 75
@@ -299,19 +309,17 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
Size itemsize, IndexTuple itup);
/* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
BufferAccessStrategy bstrategy);
extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
BlockNumber blkno, ForkNumber forkNum);
extern void _hash_squeezebucket(Relation rel,
Bucket bucket, BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy);
/* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
int access, int flags);
extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
On Tue, May 10, 2016 at 5:39 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
Incomplete Splits
--------------------------
Incomplete splits can be completed either by vacuum or insert as both
needs exclusive lock on bucket. If vacuum finds split-in-progress flag on
a bucket then it will complete the split operation, vacuum won't see this
flag if actually split is in progress on that bucket as vacuum needs
cleanup lock and split retains pin till end of operation. To make it work
for Insert operation, one simple idea could be that if insert finds
split-in-progress flag, then it releases the current exclusive lock on
bucket and tries to acquire a cleanup lock on bucket, if it gets cleanup
lock, then it can complete the split and then the insertion of tuple, else
it will have a exclusive lock on bucket and just perform the insertion of
tuple. The disadvantage of trying to complete the split in vacuum is that
split might require new pages and allocating new pages at time of vacuum is
not advisable. The disadvantage of doing it at time of Insert is that
Insert might skip it even if there is some scan on the bucket is going on
as scan will also retain pin on the bucket, but I think that is not a big
deal. The actual completion of split can be done in two ways: (a) scan
the new bucket and build a hash table with all of the TIDs you find
there. When copying tuples from the old bucket, first probe the hash
table; if you find a match, just skip that tuple (idea suggested by
Robert Haas offlist) (b) delete all the tuples that are marked as
moved_by_split in the new bucket and perform the split operation from the
beginning using old bucket.
I have completed the patch with respect to incomplete splits and delayed
cleanup of garbage tuples. For incomplete splits, I have used the option
(a) as mentioned above. The incomplete splits are completed if the
insertion sees split-in-progress flag in a bucket. The second major thing
this new version of patch has achieved is cleanup of garbage tuples i.e the
tuples that are left in old bucket during split. Currently (in HEAD), as
part of a split operation, we clean the tuples from old bucket after moving
them to new bucket, as we have heavy-weight locks on both old and new
bucket till the whole split operation. In the new design, we need to take
cleanup lock on old bucket and exclusive lock on new bucket to perform the
split operation and we don't retain those locks till the end (release the
lock as we move on to overflow buckets). Now to cleanup the tuples we need
a cleanup lock on a bucket which we might not have at split-end. So I
choose to perform the cleanup of garbage tuples during vacuum and when
re-split of the bucket happens as during both the operations, we do hold
cleanup lock. We can extend the cleanup of garbage to other operations as
well if required.
I have done some performance tests with this new version of patch and
results are on same lines as in my previous e-mail. I have done some
functional testing of the patch as well. I think more detailed testing is
required, however it is better to do that once the design is discussed and
agreed upon.
I have improved the code comments to make the new design clear, but still
one can have questions related to locking decisions I have taken in patch.
I think one of the important thing to verify in the patch is locking
strategy used for different operations. I have changed heavy-weight locks
to a light-weight read and write locks and a cleanup lock for vacuum and
split operation.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
concurrent_hash_index_v2.patchapplication/octet-stream; name=concurrent_hash_index_v2.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
Buffer buf;
Page page;
- _hash_getlock(rel, blkno, HASH_SHARE);
buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
page = BufferGetPage(buf);
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
}
_hash_relbuf(rel, buf);
- _hash_droplock(rel, blkno, HASH_SHARE);
}
/*
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 49a6c81..861dbc8 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -407,12 +407,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
so->hashso_bucket_valid = false;
- so->hashso_bucket_blkno = 0;
so->hashso_curbuf = InvalidBuffer;
+ so->hashso_bucket_buf = InvalidBuffer;
+ so->hashso_old_bucket_buf = InvalidBuffer;
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
ItemPointerSetInvalid(&(so->hashso_heappos));
+ so->hashso_skip_moved_tuples = false;
+
scan->opaque = so;
/* register scan in case we change pages it's using */
@@ -436,10 +439,15 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_hash_dropbuf(rel, so->hashso_curbuf);
so->hashso_curbuf = InvalidBuffer;
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ /* release pin we hold on old primary bucket */
+ if (BufferIsValid(so->hashso_old_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+
+ /* release pin we hold on primary bucket */
+ if (BufferIsValid(so->hashso_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_bucket_buf);
+ so->hashso_bucket_buf = InvalidBuffer;
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -453,6 +461,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->hashso_bucket_valid = false;
}
+
+ so->hashso_skip_moved_tuples = false;
}
/*
@@ -472,10 +482,15 @@ hashendscan(IndexScanDesc scan)
_hash_dropbuf(rel, so->hashso_curbuf);
so->hashso_curbuf = InvalidBuffer;
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ /* release pin we hold on old primary bucket */
+ if (BufferIsValid(so->hashso_old_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+
+ /* release pin we hold on primary bucket */
+ if (BufferIsValid(so->hashso_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_bucket_buf);
+ so->hashso_bucket_buf = InvalidBuffer;
pfree(so);
scan->opaque = NULL;
@@ -486,6 +501,9 @@ hashendscan(IndexScanDesc scan)
* The set of target tuples is specified via a callback routine that tells
* whether any given heap tuple (identified by ItemPointer) is being deleted.
*
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
* Result: a palloc'd struct containing statistical info for VACUUM displays.
*/
IndexBulkDeleteResult *
@@ -530,83 +548,60 @@ loop_top:
{
BlockNumber bucket_blkno;
BlockNumber blkno;
- bool bucket_dirty = false;
+ Buffer bucket_buf;
+ Buffer buf;
+ HashPageOpaque bucket_opaque;
+ Page page;
+ bool bucket_has_garbage = false;
/* Get address of bucket's start page */
bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
- /* Exclusive-lock the bucket so we can shrink it */
- _hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
/* Shouldn't have any active scans locally, either */
if (_hash_has_active_scan(rel, cur_bucket))
elog(ERROR, "hash index has active scan during VACUUM");
- /* Scan each page in bucket */
blkno = bucket_blkno;
- while (BlockNumberIsValid(blkno))
- {
- Buffer buf;
- Page page;
- HashPageOpaque opaque;
- OffsetNumber offno;
- OffsetNumber maxoffno;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
-
- vacuum_delay_point();
- buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
- LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- info->strategy);
- page = BufferGetPage(buf);
- opaque = (HashPageOpaque) PageGetSpecialPointer(page);
- Assert(opaque->hasho_bucket == cur_bucket);
-
- /* Scan each tuple in page */
- maxoffno = PageGetMaxOffsetNumber(page);
- for (offno = FirstOffsetNumber;
- offno <= maxoffno;
- offno = OffsetNumberNext(offno))
- {
- IndexTuple itup;
- ItemPointer htup;
+ /*
+ * Maintain a cleanup lock on primary bucket till we scan all the
+ * pages in bucket. This is required to ensure that we don't delete
+ * tuples which are needed for concurrent scans on buckets where split
+ * is in progress. Retaining it till end of bucket scan ensures that
+ * concurrent split can't be started on it. In future, we might want
+ * to relax the requirement for vacuum to take cleanup lock only for
+ * buckets where split is in progress, however for squeeze phase we
+ * need a cleanup lock, otherwise squeeze will move the tuples to a
+ * different location and that can lead to change in order of results.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+ LockBufferForCleanup(buf);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
- itup = (IndexTuple) PageGetItem(page,
- PageGetItemId(page, offno));
- htup = &(itup->t_tid);
- if (callback(htup, callback_state))
- {
- /* mark the item for deletion */
- deletable[ndeletable++] = offno;
- tuples_removed += 1;
- }
- else
- num_index_tuples += 1;
- }
+ page = BufferGetPage(buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
- /*
- * Apply deletions and write page if needed, advance to next page.
- */
- blkno = opaque->hasho_nextblkno;
+ /*
+ * If the bucket contains tuples that are moved by split, then we need
+ * to delete such tuples on completion of split. The cleanup lock on
+ * bucket is not sufficient to detect whether a split is complete, as
+ * the previous split could have been interrupted by cancel request or
+ * error.
+ */
+ if (H_HAS_GARBAGE(bucket_opaque) &&
+ !H_INCOMPLETE_SPLIT(bucket_opaque))
+ bucket_has_garbage = true;
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(page, deletable, ndeletable);
- _hash_wrtbuf(rel, buf);
- bucket_dirty = true;
- }
- else
- _hash_relbuf(rel, buf);
- }
+ bucket_buf = buf;
- /* If we deleted anything, try to compact free space */
- if (bucket_dirty)
- _hash_squeezebucket(rel, cur_bucket, bucket_blkno,
- info->strategy);
+ hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+ local_metapage.hashm_maxbucket,
+ local_metapage.hashm_highmask,
+ local_metapage.hashm_lowmask, &tuples_removed,
+ &num_index_tuples, bucket_has_garbage, true,
+ callback, callback_state);
- /* Release bucket lock */
- _hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, bucket_buf);
/* Advance to next bucket */
cur_bucket++;
@@ -687,6 +682,155 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
return stats;
}
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+ BlockNumber bucket_blkno,
+ BufferAccessStrategy bstrategy,
+ uint32 maxbucket,
+ uint32 highmask, uint32 lowmask,
+ double *tuples_removed,
+ double *num_index_tuples,
+ bool bucket_has_garbage,
+ bool delay,
+ IndexBulkDeleteCallback callback,
+ void *callback_state)
+{
+ BlockNumber blkno;
+ Buffer buf;
+ Bucket cur_bucket;
+ Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY;
+ Page page;
+ bool bucket_dirty = false;
+
+ blkno = bucket_blkno;
+ buf = bucket_buf;
+ page = BufferGetPage(buf);
+ cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+ if (bucket_has_garbage)
+ new_bucket = _hash_get_newbucket(rel, cur_bucket,
+ lowmask, maxbucket);
+
+ /* Scan each page in bucket */
+ for (;;)
+ {
+ HashPageOpaque opaque;
+ OffsetNumber offno;
+ OffsetNumber maxoffno;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ bool release_buf = false;
+
+ if (delay)
+ vacuum_delay_point();
+
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ /* Scan each tuple in page */
+ maxoffno = PageGetMaxOffsetNumber(page);
+ for (offno = FirstOffsetNumber;
+ offno <= maxoffno;
+ offno = OffsetNumberNext(offno))
+ {
+ IndexTuple itup;
+ ItemPointer htup;
+ Bucket bucket;
+
+ itup = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, offno));
+ htup = &(itup->t_tid);
+ if (callback && callback(htup, callback_state))
+ {
+ /* mark the item for deletion */
+ deletable[ndeletable++] = offno;
+ tuples_removed += 1;
+ }
+ else if (bucket_has_garbage)
+ {
+ /* delete the tuples that are moved by split. */
+ bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+ maxbucket,
+ highmask,
+ lowmask);
+ /* mark the item for deletion */
+ if (bucket != cur_bucket)
+ {
+ /*
+ * We expect tuples to either belong to curent bucket or
+ * new_bucket. This is ensured because we don't allow
+ * further splits from bucket that contains garbage. See
+ * comments in _hash_expandtable.
+ */
+ Assert(bucket == new_bucket);
+ deletable[ndeletable++] = offno;
+ }
+ }
+ else
+ num_index_tuples += 1;
+ }
+
+ /*
+ * We don't release the lock on primary bucket till end of bucket
+ * scan.
+ */
+ if (blkno != bucket_blkno)
+ release_buf = true;
+
+ blkno = opaque->hasho_nextblkno;
+
+ /*
+ * Apply deletions and write page if needed, advance to next page.
+ */
+ if (ndeletable > 0)
+ {
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ if (release_buf)
+ _hash_wrtbuf(rel, buf);
+ else
+ MarkBufferDirty(buf);
+ bucket_dirty = true;
+ }
+ else if (release_buf)
+ _hash_relbuf(rel, buf);
+
+ /* bail out if there are no more pages to scan. */
+ if (!BlockNumberIsValid(blkno))
+ break;
+
+ buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+ LH_OVERFLOW_PAGE,
+ bstrategy);
+ }
+
+ /*
+ * Clear the garbage flag from bucket after deleting the tuples that are
+ * moved by split. We purposefully clear the flag before squeeze bucket,
+ * so that after restart, vacuum shouldn't again try to delete the moved
+ * by split tuples.
+ */
+ if (bucket_has_garbage)
+ {
+ HashPageOpaque bucket_opaque;
+
+ page = BufferGetPage(bucket_buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+ }
+
+ /* If we deleted anything, try to compact free space */
+ if (bucket_dirty)
+ _hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+ bstrategy);
+}
void
hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..e7a7b51 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -18,6 +18,8 @@
#include "access/hash.h"
#include "utils/rel.h"
+static void
+ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf);
/*
* _hash_doinsert() -- Handle insertion of a single index tuple.
@@ -28,7 +30,8 @@
void
_hash_doinsert(Relation rel, IndexTuple itup)
{
- Buffer buf;
+ Buffer buf = InvalidBuffer;
+ Buffer bucket_buf;
Buffer metabuf;
HashMetaPage metap;
BlockNumber blkno;
@@ -70,51 +73,136 @@ _hash_doinsert(Relation rel, IndexTuple itup)
errhint("Values larger than a buffer page cannot be indexed.")));
/*
- * Loop until we get a lock on the correct target bucket.
+ * Conditionally get the lock on primary bucket page for insertion while
+ * holding lock on meta page. If we have to wait, then release the meta
+ * page lock and retry it in a hard way.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
- blkno = BUCKET_TO_BLKNO(metap, bucket);
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /* Release metapage lock, but keep pin. */
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBuffer(buf))
+ {
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ LockBuffer(buf, HASH_WRITE);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
+ else
+ {
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ }
+ if (retry)
+ {
/*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
+ * Loop until we get a lock on the correct target bucket. We get the
+ * lock on primary bucket page and retain the pin on it during insert
+ * operation to prevent the concurrent splits. Retaining pin on a
+ * primary bucket page ensures that split can't happen as it needs to
+ * acquire the cleanup lock on primary bucket page. Acquiring lock on
+ * primary bucket and rechecking if it is a target bucket is mandatory
+ * as otherwise a concurrent split might cause this insertion to fall
+ * in wrong bucket.
*/
- if (retry)
+ for (;;)
{
- if (oldblkno == blkno)
- break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
+ /*
+ * Compute the target bucket number, and convert to block number.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Release metapage lock, but keep pin. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ /*
+ * If the previous iteration of this loop locked what is still the
+ * correct target bucket, we are done. Otherwise, drop any old
+ * lock and lock what now appears to be the correct bucket.
+ */
+ if (retry)
+ {
+ if (oldblkno == blkno)
+ break;
+ _hash_relbuf(rel, buf);
+ }
+
+ /* Fetch the primary bucket page for the bucket */
+ buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+ /*
+ * Reacquire metapage lock and check that no bucket split has
+ * taken place while we were awaiting the bucket lock.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
}
- /* Fetch the primary bucket page for the bucket */
- buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+ /* remember the primary bucket buffer to release the pin on it at end. */
+ bucket_buf = buf;
+
page = BufferGetPage(buf);
pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(pageopaque->hasho_bucket == bucket);
+ /*
+ * if there is any pending split, finish it before proceeding for the
+ * insertion as insertion can cause a new split. We don't want to allow
+ * split from a bucket where there is a pending split as there is no
+ * apparent benefit by doing so and it will make the code complicated to
+ * finish the split that involves multiple buckets considering the case
+ * where new split can also fail.
+ */
+ if (H_NEW_INCOMPLETE_SPLIT(pageopaque))
+ {
+ BlockNumber oblkno;
+ Buffer obuf;
+
+ oblkno = _hash_get_oldblk(rel, pageopaque);
+
+ /* Fetch the primary bucket page for the bucket */
+ obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_BUCKET_PAGE);
+
+ _hash_finish_split(rel, metabuf, obuf, buf);
+
+ /*
+ * release the buffer here as the insertion will happen in new bucket.
+ */
+ _hash_relbuf(rel, obuf);
+ }
+ else if (H_OLD_INCOMPLETE_SPLIT(pageopaque))
+ {
+ BlockNumber nblkno;
+ Buffer nbuf;
+
+ nblkno = _hash_get_newblk(rel, pageopaque);
+
+ /* Fetch the primary bucket page for the bucket */
+ nbuf = _hash_getbuf(rel, nblkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+ _hash_finish_split(rel, metabuf, buf, nbuf);
+
+ /*
+ * release the buffer here as the insertion will happen in old bucket.
+ */
+ _hash_relbuf(rel, nbuf);
+ }
+
/* Do the insertion */
while (PageGetFreeSpace(page) < itemsz)
{
@@ -127,14 +215,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
{
/*
* ovfl page exists; go get it. if it doesn't have room, we'll
- * find out next pass through the loop test above.
+ * find out next pass through the loop test above. Retain the pin
+ * if it is a primary bucket.
*/
- _hash_relbuf(rel, buf);
+ if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
page = BufferGetPage(buf);
}
else
{
+ bool retain_pin = false;
+
+ /* page flags must be accessed before releasing lock on a page. */
+ retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
/*
* we're at the end of the bucket chain and we haven't found a
* page with enough room. allocate a new overflow page.
@@ -144,7 +241,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
/* chain to a new overflow page */
- buf = _hash_addovflpage(rel, metabuf, buf);
+ buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
page = BufferGetPage(buf);
/* should fit now, given test above */
@@ -158,11 +255,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
/* found page with enough space, so add the item here */
(void) _hash_pgaddtup(rel, buf, itemsz, itup);
- /* write and release the modified page */
+ /*
+ * write and release the modified page and ensure to release the pin on
+ * primary page.
+ */
_hash_wrtbuf(rel, buf);
-
- /* We can drop the bucket lock now */
- _hash_droplock(rel, blkno, HASH_SHARE);
+ if (buf != bucket_buf)
+ _hash_dropbuf(rel, bucket_buf);
/*
* Write-lock the metapage so we can increment the tuple count. After
@@ -188,6 +287,127 @@ _hash_doinsert(Relation rel, IndexTuple itup)
}
/*
+ * _hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state. (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking it.
+ */
+static void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf)
+{
+ HASHCTL hash_ctl;
+ HTAB *tidhtab;
+ Buffer bucket_nbuf;
+ Page opage;
+ Page npage;
+ HashPageOpaque opageopaque;
+ HashPageOpaque npageopaque;
+ HashMetaPage metap;
+ Bucket obucket;
+ Bucket nbucket;
+ uint32 maxbucket;
+ uint32 highmask;
+ uint32 lowmask;
+ bool found;
+
+ /* Initialize hash tables used to track TIDs */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ItemPointerData);
+ hash_ctl.entrysize = sizeof(ItemPointerData);
+ hash_ctl.hcxt = CurrentMemoryContext;
+
+ tidhtab =
+ hash_create("bucket ctids",
+ 256, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ /*
+ * Scan the new bucket and build hash table of TIDs
+ */
+ bucket_nbuf = nbuf;
+ npage = BufferGetPage(nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ for (;;)
+ {
+ BlockNumber nblkno;
+ OffsetNumber noffnum;
+ OffsetNumber nmaxoffnum;
+
+ /* Scan each tuple in new page */
+ nmaxoffnum = PageGetMaxOffsetNumber(npage);
+ for (noffnum = FirstOffsetNumber;
+ noffnum <= nmaxoffnum;
+ noffnum = OffsetNumberNext(noffnum))
+ {
+ IndexTuple itup;
+
+ /* Fetch the item's TID and insert it in hash table. */
+ itup = (IndexTuple) PageGetItem(npage,
+ PageGetItemId(npage, noffnum));
+
+ (void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+ Assert(!found);
+ }
+
+ nblkno = npageopaque->hasho_nextblkno;
+
+ /*
+ * release our write lock without modifying buffer and ensure to
+ * retain the pin on primary bucket.
+ */
+ if (nbuf == bucket_nbuf)
+ _hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, nbuf);
+
+ /* Exit loop if no more overflow pages in new bucket */
+ if (!BlockNumberIsValid(nblkno))
+ break;
+
+ /* Else, advance to next page */
+ nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+ npage = BufferGetPage(nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ }
+
+ /* Get the metapage info */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ maxbucket = metap->hashm_maxbucket;
+ highmask = metap->hashm_highmask;
+ lowmask = metap->hashm_lowmask;
+
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ _hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+
+ npage = BufferGetPage(bucket_nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ nbucket = npageopaque->hasho_bucket;
+
+ opage = BufferGetPage(obuf);
+ opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ obucket = opageopaque->hasho_bucket;
+
+ _hash_splitbucket_guts(rel, metabuf, obucket,
+ nbucket, obuf, bucket_nbuf, tidhtab,
+ maxbucket, highmask, lowmask);
+
+ hash_destroy(tidhtab);
+}
+
+/*
* _hash_pgaddtup() -- add a tuple to a particular page in the index.
*
* This routine adds the tuple to the page as requested; it does not write out
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
*
* On entry, the caller must hold a pin but no lock on 'buf'. The pin is
* dropped before exiting (we assume the caller is not interested in 'buf'
- * anymore). The returned overflow page will be pinned and write-locked;
- * it is guaranteed to be empty.
+ * anymore) if not asked to retain. The pin will be retained only for the
+ * primary bucket. The returned overflow page will be pinned and
+ * write-locked; it is guaranteed to be empty.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* That buffer is returned in the same state.
*
- * The caller must hold at least share lock on the bucket, to ensure that
- * no one else tries to compact the bucket meanwhile. This guarantees that
- * 'buf' won't stop being part of the bucket while it's unlocked.
- *
* NB: since this could be executed concurrently by multiple processes,
* one should not assume that the returned overflow page will be the
* immediate successor of the originally passed 'buf'. Additional overflow
* pages might have been added to the bucket chain in between.
*/
Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
{
Buffer ovflbuf;
Page page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
break;
/* we assume we do not need to write the unmodified page */
- _hash_relbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
/* logically chain overflow page to previous page */
pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
- _hash_wrtbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, buf);
return ovflbuf;
}
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
* in the bucket, or InvalidBlockNumber if no following page.
*
* NB: caller must not hold lock on metapage, nor on either page that's
- * adjacent in the bucket chain. The caller had better hold exclusive lock
- * on the bucket, too.
+ * adjacent in the bucket chain except from primary bucket. The caller had
+ * better hold cleanup lock on the primary bucket.
*/
BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
BufferAccessStrategy bstrategy)
{
HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
/*
* Fix up the bucket chain. this is a doubly-linked list, so we must fix
* up the bucket chain members behind and ahead of the overflow page being
- * deleted. No concurrency issues since we hold exclusive lock on the
- * entire bucket.
+ * deleted. No concurrency issues since we hold the cleanup lock on
+ * primary bucket. We don't need to aqcuire buffer lock to fix the
+ * primary bucket, as we already have that lock.
*/
if (BlockNumberIsValid(prevblkno))
{
- Buffer prevbuf = _hash_getbuf_with_strategy(rel,
- prevblkno,
- HASH_WRITE,
+ if (prevblkno == bucket_blkno)
+ {
+ Buffer prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+ prevblkno,
+ RBM_NORMAL,
+ bstrategy);
+
+ Page prevpage = BufferGetPage(prevbuf);
+ HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+ Assert(prevopaque->hasho_bucket == bucket);
+ prevopaque->hasho_nextblkno = nextblkno;
+ MarkBufferDirty(prevbuf);
+ ReleaseBuffer(prevbuf);
+ }
+ else
+ {
+ Buffer prevbuf = _hash_getbuf_with_strategy(rel,
+ prevblkno,
+ HASH_WRITE,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- bstrategy);
- Page prevpage = BufferGetPage(prevbuf);
- HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+ bstrategy);
+ Page prevpage = BufferGetPage(prevbuf);
+ HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
- Assert(prevopaque->hasho_bucket == bucket);
- prevopaque->hasho_nextblkno = nextblkno;
- _hash_wrtbuf(rel, prevbuf);
+ Assert(prevopaque->hasho_bucket == bucket);
+ prevopaque->hasho_nextblkno = nextblkno;
+ _hash_wrtbuf(rel, prevbuf);
+ }
}
if (BlockNumberIsValid(nextblkno))
{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
* required that to be true on entry as well, but it's a lot easier for
* callers to leave empty overflow pages and let this guy clean it up.
*
- * Caller must hold exclusive lock on the target bucket. This allows
+ * Caller must hold cleanup lock on the target bucket. This allows
* us to safely lock multiple pages in the bucket.
*
* Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
_hash_squeezebucket(Relation rel,
Bucket bucket,
BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy)
{
BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
HashPageOpaque wopaque;
HashPageOpaque ropaque;
bool wbuf_dirty;
+ bool release_buf = false;
/*
* start squeezing into the base bucket page.
*/
wblkno = bucket_blkno;
- wbuf = _hash_getbuf_with_strategy(rel,
- wblkno,
- HASH_WRITE,
- LH_BUCKET_PAGE,
- bstrategy);
+ wbuf = bucket_buf;
wpage = BufferGetPage(wbuf);
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
/*
- * if there aren't any overflow pages, there's nothing to squeeze.
+ * if there aren't any overflow pages, there's nothing to squeeze. caller
+ * is responsible to release the lock on primary bucket.
*/
if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
- {
- _hash_relbuf(rel, wbuf);
return;
- }
/*
* Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
{
Assert(!PageIsEmpty(wpage));
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
wblkno = wopaque->hasho_nextblkno;
Assert(BlockNumberIsValid(wblkno));
- if (wbuf_dirty)
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
Assert(wopaque->hasho_bucket == bucket);
wbuf_dirty = false;
+ release_buf = false;
}
/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
/* are we freeing the page adjacent to wbuf? */
if (rblkno == wblkno)
{
- /* yes, so release wbuf lock first */
- if (wbuf_dirty)
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
+ /* yes, so release wbuf lock first if needed */
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
+
/* free this overflow page (releases rbuf) */
- _hash_freeovflpage(rel, rbuf, bstrategy);
+ _hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
/* done */
return;
}
/* free this overflow page, then get the previous one */
- _hash_freeovflpage(rel, rbuf, bstrategy);
+ _hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
rbuf = _hash_getbuf_with_strategy(rel,
rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..83007ac 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,7 +38,7 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
uint32 nblocks);
static void _hash_splitbucket(Relation rel, Buffer metabuf,
Bucket obucket, Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask, uint32 lowmask);
@@ -55,46 +55,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
/*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock. (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- return ConditionalLockPage(rel, whichlock, access);
- else
- return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- UnlockPage(rel, whichlock, access);
-}
-
-/*
* _hash_getbuf() -- Get a buffer by block number for read or write.
*
* 'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -489,9 +449,11 @@ _hash_pageinit(Page page, Size size)
/*
* Attempt to expand the hash table by creating one new bucket.
*
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old bucket.
*
- * The caller should hold no locks on the hash index.
+ * We do remove the tuples from old bucket, if there are any left over from
+ * previous split.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* The buffer is returned in the same state.
@@ -506,10 +468,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
BlockNumber start_oblkno;
BlockNumber start_nblkno;
Buffer buf_nblkno;
+ Buffer buf_oblkno;
+ Page opage;
+ HashPageOpaque oopaque;
uint32 maxbucket;
uint32 highmask;
uint32 lowmask;
+restart_expand:
+
/*
* Write-lock the meta page. It used to be necessary to acquire a
* heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +515,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
goto fail;
/*
- * Determine which bucket is to be split, and attempt to lock the old
- * bucket. If we can't get the lock, give up.
+ * Determine which bucket is to be split, and attempt to take cleanup lock
+ * on the old bucket. If we can't get the lock, give up.
*
- * The lock protects us against other backends, but not against our own
- * backend. Must check for active scans separately.
+ * The cleanup lock protects us against other backends, but not against
+ * our own backend. Must check for active scans separately.
+ *
+ * The cleanup lock is mainly to protect the split from concurrent
+ * inserts, however if there is any pending scan it will give up which is
+ * not good, but harmless.
*/
new_bucket = metap->hashm_maxbucket + 1;
@@ -563,11 +534,50 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, old_bucket))
goto fail;
- if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+ buf_oblkno = ReadBuffer(rel, start_oblkno);
+ if (!ConditionalLockBufferForCleanup(buf_oblkno))
+ {
+ ReleaseBuffer(buf_oblkno);
goto fail;
+ }
+ _hash_checkpage(rel, buf_oblkno, LH_BUCKET_PAGE);
+
+ opage = BufferGetPage(buf_oblkno);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ /* we don't expect any pending split at this stage. */
+ Assert(!H_INCOMPLETE_SPLIT(oopaque));
+
+ /*
+ * Clean the tuples remained from previous split. This operation requires
+ * cleanup lock and we already have one on old bucket, so let's do it. We
+ * also don't want to allow further splits from the bucket till the
+ * garbage of previous split is cleaned. This has two advantages, first
+ * it helps in avoiding the bloat due to garbage and second is, during
+ * cleanup of bucket, we are always sure that the garbage tuples belong to
+ * most recently splitted bucket. On the contrary, if we allow cleanup of
+ * bucket after meta page is updated to indicate the new split and before
+ * the actual split, the cleanup operation won't be able to decide whether
+ * the tuple has been moved to the newly created bucket and ended up
+ * deleting such tuples.
+ */
+ if (H_HAS_GARBAGE(oopaque))
+ {
+ /* Release the metapage lock. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+ metap->hashm_maxbucket, metap->hashm_highmask,
+ metap->hashm_lowmask, NULL,
+ NULL, true, false, NULL, NULL);
+
+ _hash_relbuf(rel, buf_oblkno);
+
+ goto restart_expand;
+ }
/*
- * Likewise lock the new bucket (should never fail).
+ * There shouldn't be any active scan on new bucket.
*
* Note: it is safe to compute the new bucket's blkno here, even though we
* may still need to update the BUCKET_TO_BLKNO mapping. This is because
@@ -579,9 +589,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, new_bucket))
elog(ERROR, "scan in progress on supposedly new bucket");
- if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
- elog(ERROR, "could not get lock on supposedly new bucket");
-
/*
* If the split point is increasing (hashm_maxbucket's log base 2
* increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +607,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
{
/* can't split due to BlockNumber overflow */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, buf_oblkno);
goto fail;
}
}
@@ -609,7 +615,8 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/*
* Physically allocate the new bucket's primary page. We want to do this
* before changing the metapage's mapping info, in case we can't get the
- * disk space.
+ * disk space. We don't need to take cleanup lock on new bucket as no
+ * other backend could find this bucket unless meta page is updated.
*/
buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
@@ -665,13 +672,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/* Relocate records to the new bucket */
_hash_splitbucket(rel, metabuf,
old_bucket, new_bucket,
- start_oblkno, buf_nblkno,
+ buf_oblkno, buf_nblkno,
maxbucket, highmask, lowmask);
- /* Release bucket locks, allowing others to access them */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
return;
/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +748,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
* The buffer is returned in the same state. (The metapage is only
* touched if it becomes necessary to add or remove overflow pages.)
*
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation. This is to prevent vacuum to start
+ * when split is in progress.
+ *
* In addition, the caller must have created the new bucket's base page,
* which is passed in buffer nbuf, pinned and write-locked. That lock and
* pin are released here. (The API is set up this way because we must do
@@ -756,37 +763,87 @@ _hash_splitbucket(Relation rel,
Buffer metabuf,
Bucket obucket,
Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask,
uint32 lowmask)
{
- Buffer obuf;
Page opage;
Page npage;
HashPageOpaque oopaque;
HashPageOpaque nopaque;
- /*
- * It should be okay to simultaneously write-lock pages from each bucket,
- * since no one else can be trying to acquire buffer lock on pages of
- * either bucket.
- */
- obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ /*
+ * Mark the old bucket to indicate that split is in progress and it has
+ * deletable tuples. At operation end, we clear split in progress flag and
+ * vacuum will clear page_has_garbage flag after deleting such tuples.
+ */
+ oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
npage = BufferGetPage(nbuf);
- /* initialize the new bucket's primary page */
+ /*
+ * initialize the new bucket's primary page and mark it to indicate that
+ * split is in progress.
+ */
nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
nopaque->hasho_prevblkno = InvalidBlockNumber;
nopaque->hasho_nextblkno = InvalidBlockNumber;
nopaque->hasho_bucket = nbucket;
- nopaque->hasho_flag = LH_BUCKET_PAGE;
+ nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
nopaque->hasho_page_id = HASHO_PAGE_ID;
+ _hash_splitbucket_guts(rel, metabuf, obucket,
+ nbucket, obuf, nbuf, NULL,
+ maxbucket, highmask, lowmask);
+
+ /* all done, now release the locks and pins on primary buckets. */
+ _hash_relbuf(rel, obuf);
+ _hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations. To finish the previously
+ * interrupted split operation, caller needs to fill htab. If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+void
+_hash_splitbucket_guts(Relation rel,
+ Buffer metabuf,
+ Bucket obucket,
+ Bucket nbucket,
+ Buffer obuf,
+ Buffer nbuf,
+ HTAB *htab,
+ uint32 maxbucket,
+ uint32 highmask,
+ uint32 lowmask)
+{
+ Buffer bucket_obuf;
+ Buffer bucket_nbuf;
+ Page opage;
+ Page npage;
+ HashPageOpaque oopaque;
+ HashPageOpaque nopaque;
+
+ bucket_obuf = obuf;
+ opage = BufferGetPage(obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ bucket_nbuf = nbuf;
+ npage = BufferGetPage(nbuf);
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
/*
* Partition the tuples in the old bucket between the old bucket and the
* new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +855,6 @@ _hash_splitbucket(Relation rel,
BlockNumber oblkno;
OffsetNumber ooffnum;
OffsetNumber omaxoffnum;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
/* Scan each tuple in old page */
omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,18 +865,45 @@ _hash_splitbucket(Relation rel,
IndexTuple itup;
Size itemsz;
Bucket bucket;
+ bool found = false;
/*
- * Fetch the item's hash key (conveniently stored in the item) and
- * determine which bucket it now belongs in.
+ * Before inserting tuple, probe the hash table containing TIDs of
+ * tuples belonging to new bucket, if we find a match, then skip
+ * that tuple, else fetch the item's hash key (conveniently stored
+ * in the item) and determine which bucket it now belongs in.
*/
itup = (IndexTuple) PageGetItem(opage,
PageGetItemId(opage, ooffnum));
+
+ if (htab)
+ (void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+ if (found)
+ continue;
+
bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
maxbucket, highmask, lowmask);
if (bucket == nbucket)
{
+ Size itupsize = 0;
+ IndexTuple new_itup;
+
+ /*
+ * make a copy of index tuple as we have to scribble on it.
+ */
+ new_itup = CopyIndexTuple(itup);
+
+ /*
+ * mark the index tuple as moved by split, such tuples are
+ * skipped by scan if there is split in progress for a bucket.
+ */
+ itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+ new_itup->t_info &= ~INDEX_SIZE_MASK;
+ new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+ new_itup->t_info |= itupsize;
+
/*
* insert the tuple into the new bucket. if it doesn't fit on
* the current page in the new bucket, we must allocate a new
@@ -832,17 +914,25 @@ _hash_splitbucket(Relation rel,
* only partially complete, meaning the index is corrupt,
* since searches may fail to find entries they should find.
*/
- itemsz = IndexTupleDSize(*itup);
+ itemsz = IndexTupleDSize(*new_itup);
itemsz = MAXALIGN(itemsz);
if (PageGetFreeSpace(npage) < itemsz)
{
+ bool retain_pin = false;
+
+ /*
+ * page flags must be accessed before releasing lock on a
+ * page.
+ */
+ retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
/* write out nbuf and drop lock, but keep pin */
_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
/* chain to a new overflow page */
- nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+ nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
npage = BufferGetPage(nbuf);
- /* we don't need nopaque within the loop */
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
}
/*
@@ -852,12 +942,10 @@ _hash_splitbucket(Relation rel,
* Possible future improvement: accumulate all the items for
* the new page and qsort them before insertion.
*/
- (void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+ (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
- /*
- * Mark tuple for deletion from old page.
- */
- deletable[ndeletable++] = ooffnum;
+ /* be tidy */
+ pfree(new_itup);
}
else
{
@@ -870,15 +958,9 @@ _hash_splitbucket(Relation rel,
oblkno = oopaque->hasho_nextblkno;
- /*
- * Done scanning this old page. If we moved any tuples, delete them
- * from the old page.
- */
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(opage, deletable, ndeletable);
- _hash_wrtbuf(rel, obuf);
- }
+ /* retain the pin on the old primary bucket */
+ if (obuf == bucket_obuf)
+ _hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, obuf);
@@ -887,18 +969,42 @@ _hash_splitbucket(Relation rel,
break;
/* Else, advance to next old page */
- obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+ obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
}
/*
* We're at the end of the old bucket chain, so we're done partitioning
- * the tuples. Before quitting, call _hash_squeezebucket to ensure the
- * tuples remaining in the old bucket (including the overflow pages) are
- * packed as tightly as possible. The new bucket is already tight.
+ * the tuples. Mark the old and new buckets to indicate split is
+ * finished.
+ */
+ if (!(nopaque->hasho_flag & LH_BUCKET_PAGE))
+ _hash_wrtbuf(rel, nbuf);
+
+ _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+ opage = BufferGetPage(bucket_obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ /*
+ * need to acquire the write lock only if current bucket is not a primary
+ * bucket, otherwise we already have a lock on it.
*/
- _hash_wrtbuf(rel, nbuf);
+ if (!(nopaque->hasho_flag & LH_BUCKET_PAGE))
+ {
+ _hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+ npage = BufferGetPage(bucket_nbuf);
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ }
- _hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+ /* indicate that split is finished */
+ oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+ nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+ /*
+ * now write the buffers, here we don't release the locks as caller is
+ * responsible to release locks.
+ */
+ MarkBufferDirty(bucket_obuf);
+ MarkBufferDirty(bucket_nbuf);
}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..d87cf8b 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,23 @@ _hash_readnext(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_nextblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan to ensure
+ * that vacuum can't delete the tuples that are moved by split to new
+ * bucket. Such tuples are required by the scans that are started on
+ * splitted buckets, before a new buckets split in progress flag
+ * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared. Now the requirement to retain a
+ * pin on primary bucket can be relaxed for buckets that are not splitted
+ * by checking has_garbage flag in bucket, but still we need to retain pin
+ * for squeeze phase otherwise the movement of tuples could lead to change
+ * the ordering of scan results, so let's keep it for all buckets.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -94,7 +110,16 @@ _hash_readprev(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_prevblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan. See
+ * comments in _hash_readnext to know the reason of retaining pin.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -192,43 +217,85 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
metap = HashPageGetMeta(page);
/*
- * Loop until we get a lock on the correct target bucket.
+ * Conditionally get the lock on primary bucket page for search while
+ * holding lock on meta page. If we have to wait, then release the meta
+ * page lock and retry it in a hard way.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
- blkno = BUCKET_TO_BLKNO(metap, bucket);
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /* Release metapage lock, but keep pin. */
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBufferShared(buf))
+ {
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ LockBuffer(buf, HASH_READ);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
+ else
+ {
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ }
+ if (retry)
+ {
/*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
+ * Loop until we get a lock on the correct target bucket. We get the
+ * lock on primary bucket page and retain the pin on it during read
+ * operation to prevent the concurrent splits. Retaining pin on a
+ * primary bucket page ensures that split can't happen as it needs to
+ * acquire the cleanup lock on primary bucket page. Acquiring lock on
+ * primary bucket and rechecking if it is a target bucket is mandatory
+ * as otherwise a concurrent split followed by vacuum could remove
+ * tuples from the selected bucket which otherwise would have been
+ * visible.
*/
- if (retry)
+ for (;;)
{
- if (oldblkno == blkno)
- break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
+ /*
+ * Compute the target bucket number, and convert to block number.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Release metapage lock, but keep pin. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ /*
+ * If the previous iteration of this loop locked what is still the
+ * correct target bucket, we are done. Otherwise, drop any old
+ * lock and lock what now appears to be the correct bucket.
+ */
+ if (retry)
+ {
+ if (oldblkno == blkno)
+ break;
+ _hash_relbuf(rel, buf);
+ }
+
+ /* Fetch the primary bucket page for the bucket */
+ buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * Reacquire metapage lock and check that no bucket split has
+ * taken place while we were awaiting the bucket lock.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
}
- _hash_getlock(rel, blkno, HASH_SHARE);
-
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
}
/* done with the metapage */
@@ -237,14 +304,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
/* Update scan opaque state to show we have lock on the bucket */
so->hashso_bucket = bucket;
so->hashso_bucket_valid = true;
- so->hashso_bucket_blkno = blkno;
- /* Fetch the primary bucket page for the bucket */
- buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
page = BufferGetPage(buf);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->hasho_bucket == bucket);
+ so->hashso_bucket_buf = buf;
+
+ /*
+ * If the bucket split is in progress, then we need to skip tuples that
+ * are moved from old bucket. To ensure that vacuum doesn't clean any
+ * tuples from old or new buckets till this scan is in progress, maintain
+ * a pin on both of the buckets. Here, we have to be cautious about lock
+ * ordering, first acquire the lock on old bucket, release the lock on old
+ * bucket, but not pin, then acuire the lock on new bucket and again
+ * re-verify whether the bucket split still is in progress. Acquiring lock
+ * on old bucket first ensures that the vacuum waits for this scan to
+ * finish.
+ */
+ if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+ {
+ BlockNumber old_blkno;
+ Buffer old_buf;
+
+ old_blkno = _hash_get_oldblk(rel, opaque);
+
+ /*
+ * release the lock on new bucket and re-acquire it after acquiring
+ * the lock on old bucket.
+ */
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+ old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * remember the old bucket buffer so as to use it later for scanning.
+ */
+ so->hashso_old_bucket_buf = old_buf;
+ _hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ Assert(opaque->hasho_bucket == bucket);
+
+ if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+ so->hashso_skip_moved_tuples = true;
+ else
+ {
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+ }
+ }
+
/* If a backwards scan is requested, move to the end of the chain */
if (ScanDirectionIsBackward(dir))
{
@@ -273,6 +386,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
* false. Else, return true and set the hashso_curpos for the
* scan to the right thing.
*
+ * Here we also scan the old bucket if the split for current bucket
+ * was in progress at the start of scan. The basic idea is that
+ * skip the tuples that are moved by split while scanning current
+ * bucket and then scan the old bucket to cover all such tuples. This
+ * is done to ensure that we don't miss any tuples in the current scan
+ * when split was in progress.
+ *
* 'bufP' points to the current buffer, which is pinned and read-locked.
* On success exit, we have pin and read-lock on whichever page
* contains the right item; on failure, we have released all buffers.
@@ -338,6 +458,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum >= FirstOffsetNumber);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ {
+ offnum = OffsetNumberNext(offnum); /* move forward */
+ continue;
+ }
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -353,9 +486,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ buf = so->hashso_old_bucket_buf;
+
+ /*
+ * old buket buffer must be valid as we acquire
+ * the pin on it before the start of scan and
+ * retain it till end of scan.
+ */
+ Assert(BufferIsValid(buf));
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
@@ -379,6 +544,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum <= maxoff);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ {
+ offnum = OffsetNumberPrev(offnum); /* move back */
+ continue;
+ }
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -394,9 +572,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ buf = so->hashso_old_bucket_buf;
+
+ /*
+ * old buket buffer must be valid as we acquire
+ * the pin on it before the start of scan and
+ * retain it till end of scan.
+ */
+ Assert(BufferIsValid(buf));
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 456954b..bdbeb84 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
}
/*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+ uint32 i = 0;
+
+ while (num)
+ {
+ num = num >> 1;
+ ++i;
+ }
+
+ return i - 1;
+}
+
+/*
* _hash_checkpage -- sanity checks on the format of all hash pages
*
* If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -342,3 +359,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
return lower;
}
+
+/*
+ * _hash_get_oldblk() -- get the block number from which current bucket
+ * is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket old_bucket;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ /*
+ * To get the old bucket from the current bucket, we need a mask to modulo
+ * into lower half of table. This mask is stored in meta page as
+ * hashm_lowmask, but here we can't rely on the same, because we need a
+ * value of lowmask that was prevalent at the time when bucket split was
+ * started. Masking the most significant bit of new bucket would give us
+ * old bucket.
+ */
+ curr_bucket = opaque->hasho_bucket;
+ mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+ old_bucket = curr_bucket & mask;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
+
+/*
+ * _hash_get_newblk() -- get the block number of bucket for the new bucket
+ * that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half. It is mainly required to finsh the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket new_bucket;
+ uint32 lowmask;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ curr_bucket = opaque->hasho_bucket;
+
+ /*
+ * new bucket can be obtained by OR'ing old bucket with most significant
+ * bit of current table half. There could be multiple buckets that could
+ * have splitted from curent bucket. We need the first such bucket that
+ * exists based on current table half.
+ */
+ lowmask = metap->hashm_lowmask;
+
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = curr_bucket | mask;
+ if (new_bucket > metap->hashm_maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+ break;
+ }
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
+
+/*
+ * _hash_get_newbucket() -- get the new bucket that will be generated after
+ * split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket. New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function. There could be multiple buckets that
+ * could have splitted from curent bucket. We need the first such bucket that
+ * exists. Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+ uint32 lowmask, uint32 maxbucket)
+{
+ Bucket new_bucket;
+ uint32 mask;
+
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = curr_bucket | mask;
+ if (new_bucket > maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ break;
+ }
+
+ return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b7ca9bf..00129ed 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
}
/*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+ BufferDesc *buf;
+
+ Assert(BufferIsValid(buffer));
+ if (BufferIsLocal(buffer))
+ return true; /* act as though we got it */
+
+ buf = GetBufferDescriptor(buffer - 1);
+
+ return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+ LW_SHARED);
+}
+
+/*
* LockBufferForCleanup - lock a buffer in preparation for deleting items
*
* Items may be deleted from a disk page only when the caller (a) holds an
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index fa3f9b6..3a64c9d 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -25,6 +25,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/lockdefs.h"
+#include "utils/hsearch.h"
#include "utils/relcache.h"
/*
@@ -52,6 +53,9 @@ typedef uint32 Bucket;
#define LH_BUCKET_PAGE (1 << 1)
#define LH_BITMAP_PAGE (1 << 2)
#define LH_META_PAGE (1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT (1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT (1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE (1 << 6)
typedef struct HashPageOpaqueData
{
@@ -64,6 +68,12 @@ typedef struct HashPageOpaqueData
typedef HashPageOpaqueData *HashPageOpaque;
+#define H_HAS_GARBAGE(opaque) ((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque) (((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+ ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
/*
* The page ID is for the convenience of pg_filedump and similar utilities,
* which otherwise would have a hard time telling pages of different index
@@ -88,12 +98,6 @@ typedef struct HashScanOpaqueData
bool hashso_bucket_valid;
/*
- * If we have a share lock on the bucket, we record it here. When
- * hashso_bucket_blkno is zero, we have no such lock.
- */
- BlockNumber hashso_bucket_blkno;
-
- /*
* We also want to remember which buffer we're currently examining in the
* scan. We keep the buffer pinned (but not locked) across hashgettuple
* calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -101,11 +105,23 @@ typedef struct HashScanOpaqueData
*/
Buffer hashso_curbuf;
+ /* remember the buffer associated with primary bucket */
+ Buffer hashso_bucket_buf;
+
+ /*
+ * remember the buffer associated with old primary bucket which is
+ * required during the scan of the bucket for which split is in progress.
+ */
+ Buffer hashso_old_bucket_buf;
+
/* Current position of the scan, as an index TID */
ItemPointerData hashso_curpos;
/* Current position of the scan, as a heap TID */
ItemPointerData hashso_heappos;
+
+ /* Whether scan needs to skip tuples that are moved by split */
+ bool hashso_skip_moved_tuples;
} HashScanOpaqueData;
typedef HashScanOpaqueData *HashScanOpaque;
@@ -176,6 +192,8 @@ typedef HashMetaPageData *HashMetaPage;
sizeof(ItemIdData) - \
MAXALIGN(sizeof(HashPageOpaqueData)))
+#define INDEX_MOVED_BY_SPLIT_MASK 0x2000
+
#define HASH_MIN_FILLFACTOR 10
#define HASH_DEFAULT_FILLFACTOR 75
@@ -224,9 +242,6 @@ typedef HashMetaPageData *HashMetaPage;
#define HASH_WRITE BUFFER_LOCK_EXCLUSIVE
#define HASH_NOLOCK (-1)
-#define HASH_SHARE ShareLock
-#define HASH_EXCLUSIVE ExclusiveLock
-
/*
* Strategy number. There's only one valid strategy for hashing: equality.
*/
@@ -299,19 +314,17 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
Size itemsize, IndexTuple itup);
/* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
- BufferAccessStrategy bstrategy);
+ BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
BlockNumber blkno, ForkNumber forkNum);
extern void _hash_squeezebucket(Relation rel,
Bucket bucket, BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy);
/* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
int access, int flags);
extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
@@ -329,6 +342,10 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
ForkNumber forkNum);
extern void _hash_pageinit(Page page, Size size);
extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+ Bucket obucket, Bucket nbucket, Buffer obuf,
+ Buffer nbuf, HTAB *htab, uint32 maxbucket,
+ uint32 highmask, uint32 lowmask);
/* hashscan.c */
extern void _hash_regscan(IndexScanDesc scan);
@@ -363,10 +380,20 @@ extern IndexTuple _hash_form_tuple(Relation index,
Datum *values, bool *isnull);
extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+ uint32 lowmask, uint32 maxbucket);
/* hash.c */
extern void hash_redo(XLogReaderState *record);
extern void hash_desc(StringInfo buf, XLogReaderState *record);
extern const char *hash_identify(uint8 info);
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+ BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+ uint32 maxbucket, uint32 highmask, uint32 lowmask,
+ double *tuples_removed, double *num_index_tuples,
+ bool bucket_has_garbage, bool delay,
+ IndexBulkDeleteCallback callback, void *callback_state);
#endif /* HASH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..4b318a8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -226,6 +226,7 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
extern void UnlockBuffers(void);
extern void LockBuffer(Buffer buffer, int mode);
extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
On Tue, May 10, 2016 at 8:09 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
For making hash indexes usable in production systems, we need to improve its concurrency and make them crash-safe by WAL logging them. The first problem I would like to tackle is improve the concurrency of hash indexes. First advantage, I see with improving concurrency of hash indexes is that it has the potential of out performing btree for "equal to" searches (with my WIP patch attached with this mail, I could see hash index outperform btree index by 20 to 30% for very simple cases which are mentioned later in this e-mail). Another advantage as explained by Robert [1] earlier is that if we remove heavy weight locks under which we perform arbitrarily large number of operations, it can help us to sensibly WAL log it. With this patch, I would also like to make hash indexes capable of completing the incomplete_splits which can occur due to interrupts (like cancel) or errors or crash.
I have studied the concurrency problems of hash index and some of the solutions proposed for same previously and based on that came up with below solution which is based on idea by Robert [1], community discussion on thread [2] and some of my own thoughts.
Maintain a flag that can be set and cleared on the primary bucket page, call it split-in-progress, and a flag that can optionally be set on particular index tuples, call it moved-by-split. We will allow scans of all buckets and insertions into all buckets while the split is in progress, but (as now) we will not allow more than one split for a bucket to be in progress at the same time. We start the split by updating metapage to incrementing the number of buckets and set the split-in-progress flag in primary bucket pages for old and new buckets (lets number them as old bucket - N+1/2; new bucket - N + 1 for the matter of discussion). While the split-in-progress flag is set, any scans of N+1 will first scan that bucket, ignoring any tuples flagged moved-by-split, and then ALSO scan bucket N+1/2. To ensure that vacuum doesn't clean any tuples from old or new buckets till this scan is in progress, maintain a pin on both of the buckets (first pin on old bucket needs to be acquired). The moved-by-split flag never has any effect except when scanning the new bucket that existed at the start of that particular scan, and then only if the split-in-progress flag was also set at that time.
You really need parentheses in (N+1)/2. Because you are not trying to
add 1/2 to N. https://en.wikipedia.org/wiki/Order_of_operations
Once the split operation has set the split-in-progress flag, it will begin scanning bucket (N+1)/2. Every time it finds a tuple that properly belongs in bucket N+1, it will insert the tuple into bucket N+1 with the moved-by-split flag set. Tuples inserted by anything other than a split operation will leave this flag clear, and tuples inserted while the split is in progress will target the same bucket that they would hit if the split were already complete. Thus, bucket N+1 will end up with a mix of moved-by-split tuples, coming from bucket (N+1)/2, and unflagged tuples coming from parallel insertion activity. When the scan of bucket (N+1)/2 is complete, we know that bucket N+1 now contains all the tuples that are supposed to be there, so we clear the split-in-progress flag on both buckets. Future scans of both buckets can proceed normally. Split operation needs to take a cleanup lock on primary bucket to ensure that it doesn't start if there is any Insertion happening in the bucket. It will leave the lock on primary bucket, but not pin as it proceeds for next overflow page. Retaining pin on primary bucket will ensure that vacuum doesn't start on this bucket till the split is finished.
In the second-to-last sentence, I believe you have reversed the words
"lock" and "pin".
Insertion will happen by scanning the appropriate bucket and needs to retain pin on primary bucket to ensure that concurrent split doesn't happen, otherwise split might leave this tuple unaccounted.
What do you mean by "unaccounted"?
Now for deletion of tuples from (N+1/2) bucket, we need to wait for the completion of any scans that began before we finished populating bucket N+1, because otherwise we might remove tuples that they're still expecting to find in bucket (N+1)/2. The scan will always maintain a pin on primary bucket and Vacuum can take a buffer cleanup lock (cleanup lock includes Exclusive lock on bucket and wait till all the pins on buffer becomes zero) on primary bucket for the buffer. I think we can relax the requirement for vacuum to take cleanup lock (instead take Exclusive Lock on buckets where no split has happened) with the additional flag has_garbage which will be set on primary bucket, if any tuples have been moved from that bucket, however I think for squeeze phase (in this phase, we try to move the tuples from later overflow pages to earlier overflow pages in the bucket and then if there are any empty overflow pages, then we move them to kind of a free pool) of vacuum, we need a cleanup lock, otherwise scan results might get effected.
affected, not effected.
I think this is basically correct, although I don't find it to be as
clear as I think it could be. It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans. However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page. Also, I think that when cleaning up after a
split, an exclusive lock is likewise sufficient to remove tuples from
a particular page provided that we know that every scan currently in
progress started after split-in-progress was set. If each scan holds
a pin on the primary bucket and setting the split-in-progress flag
requires a cleanup lock on that page, then this is always true.
(Plain text email is preferred to HTML on this mailing list.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 16, 2016 at 3:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Incomplete splits can be completed either by vacuum or insert as both
needs exclusive lock on bucket. If vacuum finds split-in-progress flag on a
bucket then it will complete the split operation, vacuum won't see this flag
if actually split is in progress on that bucket as vacuum needs cleanup lock
and split retains pin till end of operation. To make it work for Insert
operation, one simple idea could be that if insert finds split-in-progress
flag, then it releases the current exclusive lock on bucket and tries to
acquire a cleanup lock on bucket, if it gets cleanup lock, then it can
complete the split and then the insertion of tuple, else it will have a
exclusive lock on bucket and just perform the insertion of tuple. The
disadvantage of trying to complete the split in vacuum is that split might
require new pages and allocating new pages at time of vacuum is not
advisable. The disadvantage of doing it at time of Insert is that Insert
might skip it even if there is some scan on the bucket is going on as scan
will also retain pin on the bucket, but I think that is not a big deal. The
actual completion of split can be done in two ways: (a) scan the new bucket
and build a hash table with all of the TIDs you find there. When copying
tuples from the old bucket, first probe the hash table; if you find a match,
just skip that tuple (idea suggested by Robert Haas offlist) (b) delete all
the tuples that are marked as moved_by_split in the new bucket and perform
the split operation from the beginning using old bucket.I have completed the patch with respect to incomplete splits and delayed
cleanup of garbage tuples. For incomplete splits, I have used the option
(a) as mentioned above. The incomplete splits are completed if the
insertion sees split-in-progress flag in a bucket.
It seems to me that there is a potential performance problem here. If
the split is still being performed, every insert will see the
split-in-progress flag set. The in-progress split retains only a pin
on the primary bucket, so other backends could also get an exclusive
lock, which is all they need for an insert. It seems that under this
algorithm they will now take the exclusive lock, release the exclusive
lock, try to take a cleanup lock, fail, again take the exclusive lock.
That seems like a lot of extra monkeying around. Wouldn't it be
better to take the exclusive lock and then afterwards check if the pin
count is 1? If so, even though we only intended to take an exclusive
lock, it is actually a cleanup lock. If not, we can simply proceed
with the insertion. This way you avoid unlocking and relocking the
buffer repeatedly.
The second major thing
this new version of patch has achieved is cleanup of garbage tuples i.e the
tuples that are left in old bucket during split. Currently (in HEAD), as
part of a split operation, we clean the tuples from old bucket after moving
them to new bucket, as we have heavy-weight locks on both old and new bucket
till the whole split operation. In the new design, we need to take cleanup
lock on old bucket and exclusive lock on new bucket to perform the split
operation and we don't retain those locks till the end (release the lock as
we move on to overflow buckets). Now to cleanup the tuples we need a
cleanup lock on a bucket which we might not have at split-end. So I choose
to perform the cleanup of garbage tuples during vacuum and when re-split of
the bucket happens as during both the operations, we do hold cleanup lock.
We can extend the cleanup of garbage to other operations as well if
required.
I think it's OK for the squeeze phase to be deferred until vacuum or a
subsequent split, but simply removing dead tuples seems like it should
be done earlier if possible. As I noted in my last email, it seems
like any process that gets an exclusive lock can do that, and probably
should. Otherwise, the index might become quite bloated.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 21, 2016 at 9:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 10, 2016 at 8:09 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
Once the split operation has set the split-in-progress flag, it will
begin scanning bucket (N+1)/2. Every time it finds a tuple that properly
belongs in bucket N+1, it will insert the tuple into bucket N+1 with the
moved-by-split flag set. Tuples inserted by anything other than a split
operation will leave this flag clear, and tuples inserted while the split
is in progress will target the same bucket that they would hit if the split
were already complete. Thus, bucket N+1 will end up with a mix of
moved-by-split tuples, coming from bucket (N+1)/2, and unflagged tuples
coming from parallel insertion activity. When the scan of bucket (N+1)/2
is complete, we know that bucket N+1 now contains all the tuples that are
supposed to be there, so we clear the split-in-progress flag on both
buckets. Future scans of both buckets can proceed normally. Split
operation needs to take a cleanup lock on primary bucket to ensure that it
doesn't start if there is any Insertion happening in the bucket. It will
leave the lock on primary bucket, but not pin as it proceeds for next
overflow page. Retaining pin on primary bucket will ensure that vacuum
doesn't start on this bucket till the split is finished.
In the second-to-last sentence, I believe you have reversed the words
"lock" and "pin".
Yes. What, I mean to say is release the lock, but retain the pin on primary
bucket till end of operation.
Insertion will happen by scanning the appropriate bucket and needs to
retain pin on primary bucket to ensure that concurrent split doesn't
happen, otherwise split might leave this tuple unaccounted.
What do you mean by "unaccounted"?
It means that split might leave this tuple in old bucket even if it can be
moved to new bucket. Consider a case where insertion has to add a tuple on
some intermediate overflow bucket in the bucket chain, if we allow split
when insertion is in progress, split might not move this newly inserted
tuple.
Now for deletion of tuples from (N+1/2) bucket, we need to wait for the
completion of any scans that began before we finished populating bucket
N+1, because otherwise we might remove tuples that they're still expecting
to find in bucket (N+1)/2. The scan will always maintain a pin on primary
bucket and Vacuum can take a buffer cleanup lock (cleanup lock includes
Exclusive lock on bucket and wait till all the pins on buffer becomes zero)
on primary bucket for the buffer. I think we can relax the requirement for
vacuum to take cleanup lock (instead take Exclusive Lock on buckets where
no split has happened) with the additional flag has_garbage which will be
set on primary bucket, if any tuples have been moved from that bucket,
however I think for squeeze phase (in this phase, we try to move the tuples
from later overflow pages to earlier overflow pages in the bucket and then
if there are any empty overflow pages, then we move them to kind of a free
pool) of vacuum, we need a cleanup lock, otherwise scan results might get
effected.
affected, not effected.
I think this is basically correct, although I don't find it to be as
clear as I think it could be. It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans. However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page.
How can we guarantee that it doesn't remove a tuple that is required by
scan which is started after split-in-progress flag is set?
Also, I think that when cleaning up after a
split, an exclusive lock is likewise sufficient to remove tuples from
a particular page provided that we know that every scan currently in
progress started after split-in-progress was set.
I think this could also have a similar issue as above, unless we have
something which prevents concurrent scans.
(Plain text email is preferred to HTML on this mailing list.)
If I turn to Plain text [1]http://www.mail-signatures.com/articles/how-to-add-or-change-an-email-signature-in-gmailgoogle-apps/, then the signature of my e-mail also changes
to Plain text which don't want. Is there a way, I can retain signature
settings in Rich Text and mail content as Plain Text.
[1]: http://www.mail-signatures.com/articles/how-to-add-or-change-an-email-signature-in-gmailgoogle-apps/
http://www.mail-signatures.com/articles/how-to-add-or-change-an-email-signature-in-gmailgoogle-apps/
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Jun 21, 2016 at 9:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Jun 16, 2016 at 3:28 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
Incomplete splits can be completed either by vacuum or insert as both
needs exclusive lock on bucket. If vacuum finds split-in-progress
flag on a
bucket then it will complete the split operation, vacuum won't see
this flag
if actually split is in progress on that bucket as vacuum needs
cleanup lock
and split retains pin till end of operation. To make it work for
Insert
operation, one simple idea could be that if insert finds
split-in-progress
flag, then it releases the current exclusive lock on bucket and tries
to
acquire a cleanup lock on bucket, if it gets cleanup lock, then it can
complete the split and then the insertion of tuple, else it will have a
exclusive lock on bucket and just perform the insertion of tuple. The
disadvantage of trying to complete the split in vacuum is that split
might
require new pages and allocating new pages at time of vacuum is not
advisable. The disadvantage of doing it at time of Insert is that
Insert
might skip it even if there is some scan on the bucket is going on as
scan
will also retain pin on the bucket, but I think that is not a big
deal. The
actual completion of split can be done in two ways: (a) scan the new
bucket
and build a hash table with all of the TIDs you find there. When
copying
tuples from the old bucket, first probe the hash table; if you find a
match,
just skip that tuple (idea suggested by Robert Haas offlist) (b)
delete all
the tuples that are marked as moved_by_split in the new bucket and
perform
the split operation from the beginning using old bucket.
I have completed the patch with respect to incomplete splits and delayed
cleanup of garbage tuples. For incomplete splits, I have used the
option
(a) as mentioned above. The incomplete splits are completed if the
insertion sees split-in-progress flag in a bucket.It seems to me that there is a potential performance problem here. If
the split is still being performed, every insert will see the
split-in-progress flag set. The in-progress split retains only a pin
on the primary bucket, so other backends could also get an exclusive
lock, which is all they need for an insert. It seems that under this
algorithm they will now take the exclusive lock, release the exclusive
lock, try to take a cleanup lock, fail, again take the exclusive lock.
That seems like a lot of extra monkeying around. Wouldn't it be
better to take the exclusive lock and then afterwards check if the pin
count is 1? If so, even though we only intended to take an exclusive
lock, it is actually a cleanup lock. If not, we can simply proceed
with the insertion. This way you avoid unlocking and relocking the
buffer repeatedly.
We can do it in the way as you are suggesting, but there is another thing
which we need to consider here. As of now, the patch tries to finish the
split if it finds split-in-progress flag in either old or new bucket. We
need to lock both old and new buckets to finish the split, so it is quite
possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to always try
to lock the old bucket first and then new bucket. To achieve that, if the
insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first on old
bucket, ensure pincount is 1 and then lock new bucket again and ensure that
pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone tries to
insert in old bucket.
The second major thing
this new version of patch has achieved is cleanup of garbage tuples i.e
the
tuples that are left in old bucket during split. Currently (in HEAD),
as
part of a split operation, we clean the tuples from old bucket after
moving
them to new bucket, as we have heavy-weight locks on both old and new
bucket
till the whole split operation. In the new design, we need to take
cleanup
lock on old bucket and exclusive lock on new bucket to perform the split
operation and we don't retain those locks till the end (release the
lock as
we move on to overflow buckets). Now to cleanup the tuples we need a
cleanup lock on a bucket which we might not have at split-end. So I
choose
to perform the cleanup of garbage tuples during vacuum and when
re-split of
the bucket happens as during both the operations, we do hold cleanup
lock.
We can extend the cleanup of garbage to other operations as well if
required.I think it's OK for the squeeze phase to be deferred until vacuum or a
subsequent split, but simply removing dead tuples seems like it should
be done earlier if possible.
Yes, probably we can do it at time of insertion in a bucket, if we don't
have concurrent scan issue.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 22, 2016 at 5:10 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Insertion will happen by scanning the appropriate bucket and needs to
retain pin on primary bucket to ensure that concurrent split doesn't happen,
otherwise split might leave this tuple unaccounted.What do you mean by "unaccounted"?
It means that split might leave this tuple in old bucket even if it can be
moved to new bucket. Consider a case where insertion has to add a tuple on
some intermediate overflow bucket in the bucket chain, if we allow split
when insertion is in progress, split might not move this newly inserted
tuple.
OK, that's a good point.
Now for deletion of tuples from (N+1/2) bucket, we need to wait for the
completion of any scans that began before we finished populating bucket N+1,
because otherwise we might remove tuples that they're still expecting to
find in bucket (N+1)/2. The scan will always maintain a pin on primary
bucket and Vacuum can take a buffer cleanup lock (cleanup lock includes
Exclusive lock on bucket and wait till all the pins on buffer becomes zero)
on primary bucket for the buffer. I think we can relax the requirement for
vacuum to take cleanup lock (instead take Exclusive Lock on buckets where no
split has happened) with the additional flag has_garbage which will be set
on primary bucket, if any tuples have been moved from that bucket, however I
think for squeeze phase (in this phase, we try to move the tuples from later
overflow pages to earlier overflow pages in the bucket and then if there are
any empty overflow pages, then we move them to kind of a free pool) of
vacuum, we need a cleanup lock, otherwise scan results might get effected.affected, not effected.
I think this is basically correct, although I don't find it to be as
clear as I think it could be. It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans. However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page.How can we guarantee that it doesn't remove a tuple that is required by scan
which is started after split-in-progress flag is set?
If the tuple is being removed by VACUUM, it is dead. We can remove
dead tuples right away, because no MVCC scan will see them. In fact,
the only snapshot that will see them is SnapshotAny, and there's no
problem with removing dead tuples while a SnapshotAny scan is in
progress. It's no different than heap_page_prune() removing tuples
that a SnapshotAny sequential scan was about to see.
If the tuple is being removed because the bucket was split, it's only
a problem if the scan predates setting the split-in-progress flag.
But since your design involves out-waiting all scans currently in
progress before setting that flag, there can't be any scan in progress
that hasn't seen it. A scan that has seen the flag won't look at the
tuple in any case.
(Plain text email is preferred to HTML on this mailing list.)
If I turn to Plain text [1], then the signature of my e-mail also changes to
Plain text which don't want. Is there a way, I can retain signature
settings in Rich Text and mail content as Plain Text.
Nope, but I don't see what you are worried about. There's no HTML
content in your signature anyway except for a link, and most
mail-reading software will turn that into a hyperlink even without the
HTML.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 22, 2016 at 5:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
We can do it in the way as you are suggesting, but there is another thing
which we need to consider here. As of now, the patch tries to finish the
split if it finds split-in-progress flag in either old or new bucket. We
need to lock both old and new buckets to finish the split, so it is quite
possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to always try
to lock the old bucket first and then new bucket. To achieve that, if the
insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first on old
bucket, ensure pincount is 1 and then lock new bucket again and ensure that
pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone tries to
insert in old bucket.
Yes, I think locking buckets in increasing order is a good solution.
I also think it's fine to only try to finish the split when the insert
targets the old bucket. Finishing the split enables us to remove
tuples from the old bucket, which lets us reuse space instead of
accelerating more. So there is at least some potential benefit to the
backend inserting into the old bucket. On the other hand, a process
inserting into the new bucket derives no direct benefit from finishing
the split.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 22, 2016 at 8:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jun 22, 2016 at 5:10 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think this is basically correct, although I don't find it to be as
clear as I think it could be. It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans. However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page.How can we guarantee that it doesn't remove a tuple that is required by scan
which is started after split-in-progress flag is set?If the tuple is being removed by VACUUM, it is dead. We can remove
dead tuples right away, because no MVCC scan will see them. In fact,
the only snapshot that will see them is SnapshotAny, and there's no
problem with removing dead tuples while a SnapshotAny scan is in
progress. It's no different than heap_page_prune() removing tuples
that a SnapshotAny sequential scan was about to see.If the tuple is being removed because the bucket was split, it's only
a problem if the scan predates setting the split-in-progress flag.
But since your design involves out-waiting all scans currently in
progress before setting that flag, there can't be any scan in progress
that hasn't seen it.
For above cases, just an exclusive lock will work.
A scan that has seen the flag won't look at the
tuple in any case.
Why so? Assume that scan started on new bucket where
split-in-progress flag was set, now it will not look at tuples that
are marked as moved-by-split in this bucket, as it will assume to find
all such tuples in old bucket. Now, if allow Vacuum or someone else
to remove tuples from old with just an Exclusive lock, it is quite
possible that scan miss the tuple in old bucket which got removed by
vacuum.
(Plain text email is preferred to HTML on this mailing list.)
If I turn to Plain text [1], then the signature of my e-mail also changes to
Plain text which don't want. Is there a way, I can retain signature
settings in Rich Text and mail content as Plain Text.Nope, but I don't see what you are worried about. There's no HTML
content in your signature anyway except for a link, and most
mail-reading software will turn that into a hyperlink even without the
HTML.
Okay, I didn't knew that mail-reading software does that. Thanks for
pointing out.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 22, 2016 at 8:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jun 22, 2016 at 5:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
We can do it in the way as you are suggesting, but there is another thing
which we need to consider here. As of now, the patch tries to finish the
split if it finds split-in-progress flag in either old or new bucket. We
need to lock both old and new buckets to finish the split, so it is quite
possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to always try
to lock the old bucket first and then new bucket. To achieve that, if the
insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first on old
bucket, ensure pincount is 1 and then lock new bucket again and ensure that
pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone tries to
insert in old bucket.Yes, I think locking buckets in increasing order is a good solution.
Okay.
I also think it's fine to only try to finish the split when the insert
targets the old bucket. Finishing the split enables us to remove
tuples from the old bucket, which lets us reuse space instead of
accelerating more. So there is at least some potential benefit to the
backend inserting into the old bucket. On the other hand, a process
inserting into the new bucket derives no direct benefit from finishing
the split.
makes sense, will change that way and will add a comment why we are
just doing it for old bucket.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 22, 2016 at 10:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
A scan that has seen the flag won't look at the
tuple in any case.Why so? Assume that scan started on new bucket where
split-in-progress flag was set, now it will not look at tuples that
are marked as moved-by-split in this bucket, as it will assume to find
all such tuples in old bucket. Now, if allow Vacuum or someone else
to remove tuples from old with just an Exclusive lock, it is quite
possible that scan miss the tuple in old bucket which got removed by
vacuum.
Oh, you're right. So we really need to CLEAR the split-in-progress
flag before removing any tuples from the old bucket. Does that sound
right?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 23, 2016 at 10:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jun 22, 2016 at 10:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
A scan that has seen the flag won't look at the
tuple in any case.Why so? Assume that scan started on new bucket where
split-in-progress flag was set, now it will not look at tuples that
are marked as moved-by-split in this bucket, as it will assume to find
all such tuples in old bucket. Now, if allow Vacuum or someone else
to remove tuples from old with just an Exclusive lock, it is quite
possible that scan miss the tuple in old bucket which got removed by
vacuum.Oh, you're right. So we really need to CLEAR the split-in-progress
flag before removing any tuples from the old bucket.
I think that alone is not sufficient, we also need to out-wait any
scan that has started when the flag is set and till it is cleared.
Before vacuum starts cleaning particular bucket, we can certainly
detect if it has to clean garbage tuples (the patch sets has_garbage
flag in old bucket for split operation) and only for that case we can
out-wait the scans. So probably, how it can work is during vacuum,
take Exclusive lock on bucket, check if has_garbage flag is set and
split-in-progress flag is cleared on bucket, if so then wait till the
pin-count on bucket is 1, else if has_garbage is not set, then just
proceed with clearing dead tuples from bucket. This will reduce the
requirement of having cleanup lock only when it is required (namely
when bucket has garbage tuples).
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 16, 2016 at 12:58 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
I have a question regarding code changes in *_hash_first*.
+ /*
+ * Conditionally get the lock on primary bucket page for search
while
+ * holding lock on meta page. If we have to wait, then release the
meta
+ * page lock and retry it in a hard way.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+
metap->hashm_maxbucket,
+
metap->hashm_highmask,
+
metap->hashm_lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBufferShared(buf))
Here we try to take lock on bucket page but I think if successful we do not
recheck whether any split happened before taking lock. Is this not
necessary now?
Also below "if" is always true as we enter here only when outer "if
(retry)" is true.
+ if (retry)
+ {
+ if (oldblkno == blkno)
+ break;
+ _hash_relbuf(rel, buf);
+ }
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
On Fri, Jun 24, 2016 at 2:38 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
On Thu, Jun 16, 2016 at 12:58 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:I have a question regarding code changes in _hash_first.
+ /* + * Conditionally get the lock on primary bucket page for search while + * holding lock on meta page. If we have to wait, then release the meta + * page lock and retry it in a hard way. + */ + bucket = _hash_hashkey2bucket(hashkey, + metap->hashm_maxbucket, + metap->hashm_highmask, + metap->hashm_lowmask); + + blkno = BUCKET_TO_BLKNO(metap, bucket); + + /* Fetch the primary bucket page for the bucket */ + buf = ReadBuffer(rel, blkno); + if (!ConditionalLockBufferShared(buf))Here we try to take lock on bucket page but I think if successful we do not
recheck whether any split happened before taking lock. Is this not necessary
now?
Yes, now that is not needed, because we are doing that by holding the
read lock on metapage. Split happens by having a write lock on
metapage. The basic idea of this optimization is that if we get the
lock immediately, then do so by holding metapage lock, else if we
decide to wait for getting a lock on bucket page then we still
fallback to previous kind of mechanism.
Also below "if" is always true as we enter here only when outer "if (retry)" is true. + if (retry) + { + if (oldblkno == blkno) + break; + _hash_relbuf(rel, buf); + }
Good catch, I think we don't need this retry check now. We do need
similar change in _hash_doinsert().
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 22, 2016 at 8:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jun 22, 2016 at 5:10 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Insertion will happen by scanning the appropriate bucket and needs to
retain pin on primary bucket to ensure that concurrent split doesn't happen,
otherwise split might leave this tuple unaccounted.What do you mean by "unaccounted"?
It means that split might leave this tuple in old bucket even if it can be
moved to new bucket. Consider a case where insertion has to add a tuple on
some intermediate overflow bucket in the bucket chain, if we allow split
when insertion is in progress, split might not move this newly inserted
tuple.I think this is basically correct, although I don't find it to be as
clear as I think it could be. It seems very clear that any operation
which potentially changes the order of tuples in the bucket chain,
such as the squeeze phase as currently implemented, also needs to
exclude all concurrent scans. However, I think that it's OK for
vacuum to remove tuples from a given page with only an exclusive lock
on that particular page.How can we guarantee that it doesn't remove a tuple that is required by scan
which is started after split-in-progress flag is set?If the tuple is being removed by VACUUM, it is dead. We can remove
dead tuples right away, because no MVCC scan will see them. In fact,
the only snapshot that will see them is SnapshotAny, and there's no
problem with removing dead tuples while a SnapshotAny scan is in
progress. It's no different than heap_page_prune() removing tuples
that a SnapshotAny sequential scan was about to see.
While again thinking about this case, it seems to me that we need a
cleanup lock even for dead tuple removal. The reason for the same is
that scans that return multiple tuples always restart the scan from
the previous offset number from which they have returned last tuple.
Now, consider the case where the first tuple is returned from offset
number-3 in page and after that another backend removes the
corresponding tuple from heap and vacuum also removes the dead tuple
corresponding to offset-3. When the scan will try to get the next
tuple, it will start from offset-3 which can lead to incorrect
results.
Now, one way to solve above problem could be if we change scan for
hash indexes such that it works page at a time like we do for btree
scans (refer BTScanPosData and comments on top of it). This has an
additional advantage that it will reduce lock/unlock calls for
retrieving tuples from a page. However, I think this solution can only
work for MVCC scans. For non-MVCC scans, still there is a problem,
because after fetching all the tuples from a page, when it tries to
check the validity of tuples in heap, we won't be able to detect if
the old tuple is deleted and a new tuple has been placed at that
location in heap.
I think what we can do to solve this for non-MVCC scans is that allow
vacuum to always take a cleanup lock on a bucket and MVCC-scans will
release both the lock and pin as it proceeds. Non-MVCC scans and
scans that are started during split-in-progress will release the lock,
but not a pin on primary bucket. This way, we can allow vacuum to
proceed even if there is a MVCC-scan going on a bucket if it is not
started during a bucket split operation. For btree code, we do
something similar, which means that vacuum always take cleanup lock on
a bucket and non-MVCC scan retains a pin on the bucket.
The insertions should work as they are currently in patch, that is
they always need to retain a pin on primary bucket to avoid the
concurrent split problem as mentioned above (refer the first paragraph
discussion of this mail).
Thoughts?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 22, 2016 at 8:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jun 22, 2016 at 5:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
We can do it in the way as you are suggesting, but there is another thing
which we need to consider here. As of now, the patch tries to finish the
split if it finds split-in-progress flag in either old or new bucket. We
need to lock both old and new buckets to finish the split, so it is quite
possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to always try
to lock the old bucket first and then new bucket. To achieve that, if the
insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first on old
bucket, ensure pincount is 1 and then lock new bucket again and ensure that
pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone tries to
insert in old bucket.Yes, I think locking buckets in increasing order is a good solution.
I also think it's fine to only try to finish the split when the insert
targets the old bucket. Finishing the split enables us to remove
tuples from the old bucket, which lets us reuse space instead of
accelerating more. So there is at least some potential benefit to the
backend inserting into the old bucket. On the other hand, a process
inserting into the new bucket derives no direct benefit from finishing
the split.
Okay, following this suggestion, I have updated the patch so that only
insertion into old-bucket can try to finish the splits. Apart from
that, I have fixed the issue reported by Mithun upthread. I have
updated the README to explain the locking used in patch. Also, I
have changed the locking around vacuum, so that it can work with
concurrent scans when ever possible. In previous patch version,
vacuum used to take cleanup lock on a bucket to remove the dead
tuples, moved-due-to-split tuples and squeeze operation, also it holds
the lock on bucket till end of cleanup. Now, also it takes cleanup
lock on a bucket to out-wait scans, but it releases the lock as it
proceeds to clean the overflow pages. The idea is first we need to
lock the next bucket page and then release the lock on current bucket
page. This ensures that any concurrent scan started after we start
cleaning the bucket will always be behind the cleanup. Allowing scans
to cross vacuum will allow it to remove tuples required for sanctity
of scan. Also for squeeze-phase we are just checking if the pincount
of buffer is one (we already have Exclusive lock on buffer of bucket
by that time), then only proceed, else will try to squeeze next time
the cleanup is required for that bucket.
Thoughts/Suggestions?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
concurrent_hash_index_v3.patchapplication/octet-stream; name=concurrent_hash_index_v3.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
Buffer buf;
Page page;
- _hash_getlock(rel, blkno, HASH_SHARE);
buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
page = BufferGetPage(buf);
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
}
_hash_relbuf(rel, buf);
- _hash_droplock(rel, blkno, HASH_SHARE);
}
/*
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..a0feb2f 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,49 +125,45 @@ the initially created buckets.
Lock Definitions
----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index. lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect. Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket. A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done. A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from. Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements. LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share. A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan. This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it. This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process. If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process. (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+We use buffer content locks (LWLocks) and buffer pins to control access to
+a hash index.
+
+Scan will take a lock in shared mode on primary or overflow buckets. Inserts
+will acquire exclusive lock on the bucket in which it has to insert. Both the
+operations releases the lock on previous bucket before moving to the next
+overflow bucket. They will retain a pin on primary bucket till end of operation.
+Split operation must acquire cleanup lock on both old and new halves of the
+bucket and mark split-in-progress on both the buckets. The cleanup lock at
+the start of split ensures that parallel insert won't get lost. Consider a
+case where insertion has to add a tuple on some intermediate overflow bucket
+in the bucket chain, if we allow split when insertion is in progress, split
+might not move this newly inserted tuple. It releases the lock on previous
+bucket before moving to the next overflow bucket either for old bucket or for
+new bucket. After partitioning the tuples between old and new buckets, it
+again needs to acquire exclusive lock on both old and new buckets to clear
+the split-in-progress flag. Like inserts and scans, it will also retain pins
+on both the old and new primary buckets till end of split operation, although
+we can do without that as well.
+
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split. The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results. Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple. If we allow vacuum to
+remove the dead tuples with just an exclusive lock, it could remove the tuple
+required to resume the scan. The need for cleanup lock to remove the tuples
+that are moved by split is to ensure that there is no pending scan that has
+started after the start of split and before the finish of split on bucket.
+If we don't do that, then vacuum can remove tuples that are required by such
+a scan. We don't need to retain this cleanup lock during whole vacuum
+operation on bucket. We releases the lock as we move ahead in the bucket
+chain. In the end, for squeeze-phase, we conditionally acquire cleanup lock
+and if we don't get, then we just abandon the squeeze phase.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
Pseudocode Algorithms
@@ -188,63 +184,105 @@ track of available overflow pages.
The reader algorithm is:
pin meta page and take buffer content lock in shared mode
- loop:
- compute bucket number for target hash key
- release meta page buffer content lock
- if (correct bucket page is already locked)
- break
- release any existing bucket page lock (if a concurrent split happened)
- take heavyweight bucket lock
- retake meta page buffer content lock in shared mode
+ compute bucket number for target hash key
+ read and pin the primary bucket page
+ conditionally get the buffer content lock in shared mode on primary bucket page for search
+ if we didn't get the lock (need to wait for lock)
+ release the buffer content lock on meta page
+ acquire buffer content lock on primary bucket page in shared mode
+ acquire the buffer content lock in shared mode on meta page
+ to check for possibility of split, we need to recompute the bucket and
+ verify, if it is a correct bucket; set the retry flag
+ else if we get the lock, then we can skip the retry path
+ if (retry)
+ loop:
+ compute bucket number for target hash key
+ release meta page buffer content lock
+ if (correct bucket page is already locked)
+ break
+ release any existing content lock on bucket page (if a concurrent split happened)
+ pin primary bucket page and take shared buffer content lock
+ retake meta page buffer content lock in shared mode
-- then, per read request:
release pin on metapage
- read current page of bucket and take shared buffer content lock
- step to next page if necessary (no chaining of locks)
+ if the split is in progress for current bucket and this is a new bucket
+ release the buffer content lock on current bucket page
+ pin and acquire the buffer content lock on old bucket in shared mode
+ release the buffer content lock on old bucket, but not pin
+ retake the buffer content lock on new bucket
+ mark the scan such that it skips the tuples that are marked as moved by split
+ step to next page if necessary (no chaining of locks)
+ if the scan indicates moved by split, then move to old bucket after the scan
+ of current bucket is finished
get tuple
release buffer content lock and pin on current page
-- at scan shutdown:
- release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection). Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split. If not, we're
-done. If so, we release our previously-acquired lock and repeat the process
-using the new bucket number. Holding the bucket sharelock for
+ release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock if we have to wait for acquiring the
+content lock on bucket page, because that might result in poor concurrency.
+Instead, we relock the metapage after acquiring the bucket page content lock
+and check whether the bucket has been split. If not, we're done. If so, we
+release our previously-acquired content lock, but not pin and repeat the
+process using the new bucket number. Holding the buffer pin on bucket page for
the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions. Notice that the reader's lock
+being invalidated by splits or compactions. Notice that the reader's pin
does not prevent other buckets from being split or compacted.
To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock. Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page. Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
searching for the same heap tuple TID previously returned. Insertion does
not move index entries across pages, so the previously-returned index entry
should always be on the same page, at the same or higher offset number,
as it was before.
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split. Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split. This happens only for the new half bucket.
+
The insertion algorithm is rather similar:
pin meta page and take buffer content lock in shared mode
- loop:
- compute bucket number for target hash key
- release meta page buffer content lock
- if (correct bucket page is already locked)
- break
- release any existing bucket page lock (if a concurrent split happened)
- take heavyweight bucket lock in shared mode
- retake meta page buffer content lock in shared mode
--- (so far same as reader)
+ compute bucket number for target hash key
+ read and pin the primary bucket page
+ conditionally get the buffer content lock in exclusive mode on primary bucket page for search
+ if we didn't get the lock (need to wait for lock)
+ release the buffer content lock on meta page
+ acquire buffer content lock on primary bucket page in exclusive mode
+ acquire the buffer content lock in shared mode on meta page
+ to check for possibility of split, we need to recompute the bucket and
+ verify, if it is a correct bucket; set the retry flag
+ else if we get the lock, then we can skip the retry path
+ if (retry)
+ loop:
+ compute bucket number for target hash key
+ release meta page buffer content lock
+ if (correct bucket page is already locked)
+ break
+ release any existing content lock on bucket page (if a concurrent split happened)
+ pin primary bucket page and take exclusive buffer content lock
+ retake meta page buffer content lock in shared mode
+-- (so far same as reader, except for acquisation of buffer content lock in
+ exclusive mode on primary bucket page)
release pin on metapage
- pin current page of bucket and take exclusive buffer content lock
- if full, release, read/exclusive-lock next page; repeat as needed
+ if the split-in-progress flag is set for bucket in old half of split
+ and pin count on it is one, then finish the split
+ we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+ if get the lock on new bucket
+ finish the split using algorithm mentioned below for split
+ release the buffer content lock and pin on new bucket
+ if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
>> see below if no space in any page of bucket
insert tuple at appropriate place in page
mark current page dirty and release buffer content lock and pin
+ if current page is not a bucket page, release the pin on bucket page
release heavyweight share-lock
- pin meta page and take buffer content lock in shared mode
+ pin meta page and take buffer content lock in exclusive mode
increment tuple count, decide if split needed
mark meta page dirty and release buffer content lock and pin
done if no split needed, else enter Split algorithm below
@@ -256,11 +294,13 @@ bucket that is being actively scanned, because readers can cope with this
as explained above. We only need the short-term buffer locks to ensure
that readers do not see a partially-updated page.
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Locking
+Definitions above. This algorithm allows them a very high degree of
+concurrency. (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
When an inserter cannot find space in any existing page of a bucket, it
must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +311,79 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
The algorithm attempts, but does not necessarily succeed, to split one
existing bucket in two, thereby lowering the fill ratio:
- pin meta page and take buffer content lock in exclusive mode
- check split still needed
- if split not needed anymore, drop buffer content lock and pin and exit
- decide which bucket to split
- Attempt to X-lock old bucket number (definitely could fail)
- Attempt to X-lock new bucket number (shouldn't fail, but...)
- if above fail, drop locks and pin and exit
+ expand:
+ take buffer content lock in exclusive mode on meta page
+ check split still needed
+ if split not needed anymore, drop buffer content lock and exit
+ decide which bucket to split
+ Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+ if above fail, release lock and pin and exit
+ if the split-in-progress flag is set, then finish the split
+ conditionally get the content lock on new bucket which was involved in split
+ if got the lock on new bucket
+ finish the split using algorithm mentioned below for split
+ release the buffer content lock and pin on old and new bucketa
+ try to expand from start
+ else
+ release the buffer conetent lock and pin on old bucket and exit
+ if the garbage flag (indicates that tuples are moved by split) is set on bucket
+ release the buffer content lock on meta page
+ remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+ Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
update meta page to reflect new number of buckets
- mark meta page dirty and release buffer content lock and pin
+ mark meta page dirty and release buffer content lock
-- now, accesses to all other buckets can proceed.
Perform actual split of bucket, moving tuples as needed
>> see below about acquiring needed extra space
Release X-locks of old and new buckets
+ split guts
+ mark the old and new buckets indicating split-in-progress
+ mark the old bucket indicating has-garbage
+ copy the tuples that belongs to new bucket from old bucket
+ during copy mark such tuples as move-by-split
+ release lock but not pin for primary bucket page of old bucket,
+ read/shared-lock next page; repeat as needed
+ >> see below if no space in bucket page of new bucket
+ ensure to have exclusive-lock on both old and new buckets in that order
+ clear the split-in-progress flag from both the buckets
+ mark buffers dirty and release the locks and pins on both old and new buckets
+
Note the metapage lock is not held while the actual tuple rearrangement is
performed, so accesses to other buckets can proceed in parallel; in fact,
it's possible for multiple bucket splits to proceed in parallel.
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it. We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock. (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split. This is all right since the index is
-overfull but perfectly functional. Every subsequent inserter will try to
-split, and eventually one will succeed. If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+Split's attempt to acquire cleanup-lock on the old bucket number could fail
+if another process holds any lock or pin on it. We do not want to wait if
+that happens, because we don't want to wait while holding the metapage
+exclusive-lock. So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split. This is all right since the
+index is overfull but perfectly functional. Every subsequent inserter will
+try to split, and eventually one will succeed. If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
not be overfull and split attempts will stop. (We could make a successful
splitter loop to see if the index is still overfull, but it seems better to
distribute the split overhead across successive insertions.)
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split. This will be set only for old bucket. Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.). This is used both by vacuum as
+well as during re-split operation. Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket. The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt. The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt. The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
must-complete action. I'm not planning to teach hash about WAL in this
-go-round.
+go-round. However, we do try to finish the incomplete splits during insert
+and split.
The fourth operation is garbage collection (bulk deletion):
@@ -319,9 +392,13 @@ The fourth operation is garbage collection (bulk deletion):
fetch current max bucket number
release meta page buffer content lock and pin
while next bucket <= max bucket do
- Acquire X lock on target bucket
- Scan and remove tuples, compact free space as needed
- Release X lock
+ Acquire cleanup lock on target bucket
+ Scan and remove tuples
+ For overflow buckets, first we need to lock the next bucket and then
+ release the lock on current bucket
+ Ensure to have X lock on bucket page
+ If buffer pincount is one, then compact free space as needed
+ Release lock
next bucket ++
end loop
pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +407,23 @@ The fourth operation is garbage collection (bulk deletion):
else update metapage tuple count
mark meta page dirty and release buffer content lock and pin
-Note that this is designed to allow concurrent splits. If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm. (We must however be careful about the statistics
-reported by the VACUUM operation. What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan. This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason. But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans. If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm. As we are releasing the locks during
+scan of a bucket, it will allow concurrent scan to start on a bucket and
+ensures that scan will always be behind cleanup. It is must to keep scans
+behind cleanup, else vacuum could remove tuples that are required to
+complete the scan as explained in Lock Definitions section above. This holds
+true for backward scans as well (backward scans first traverse each bucket
+starting from first bucket to last overflow bucket in the chain).
+We must be careful about the statistics reported by the VACUUM operation.
+What we can do is count the number of tuples scanned, and believe this in
+preference to the stored tuple count if the stored tuple count and number
+of buckets did *not* change at any time during the scan. This provides a
+way of correcting the stored tuple count if it gets out of sync for some
+reason. But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted
+from the stored tuple count and use that.
Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 19695ee..5552f2d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -271,10 +271,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
/*
* An insertion into the current index page could have happened while
* we didn't have read lock on it. Re-find our position by looking
- * for the TID we previously returned. (Because we hold share lock on
- * the bucket, no deletions or splits could have occurred; therefore
- * we can expect that the TID still exists in the current index page,
- * at an offset >= where we were.)
+ * for the TID we previously returned. (Because we hold pin on the
+ * bucket, no deletions or splits could have occurred; therefore we
+ * can expect that the TID still exists in the current index page, at
+ * an offset >= where we were.)
*/
OffsetNumber maxoffnum;
@@ -409,12 +409,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
so->hashso_bucket_valid = false;
- so->hashso_bucket_blkno = 0;
so->hashso_curbuf = InvalidBuffer;
+ so->hashso_bucket_buf = InvalidBuffer;
+ so->hashso_old_bucket_buf = InvalidBuffer;
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
ItemPointerSetInvalid(&(so->hashso_heappos));
+ so->hashso_skip_moved_tuples = false;
+
scan->opaque = so;
/* register scan in case we change pages it's using */
@@ -438,10 +441,15 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
_hash_dropbuf(rel, so->hashso_curbuf);
so->hashso_curbuf = InvalidBuffer;
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ /* release pin we hold on old primary bucket */
+ if (BufferIsValid(so->hashso_old_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+
+ /* release pin we hold on primary bucket */
+ if (BufferIsValid(so->hashso_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_bucket_buf);
+ so->hashso_bucket_buf = InvalidBuffer;
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -455,6 +463,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->hashso_bucket_valid = false;
}
+
+ so->hashso_skip_moved_tuples = false;
}
/*
@@ -474,10 +484,15 @@ hashendscan(IndexScanDesc scan)
_hash_dropbuf(rel, so->hashso_curbuf);
so->hashso_curbuf = InvalidBuffer;
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ /* release pin we hold on old primary bucket */
+ if (BufferIsValid(so->hashso_old_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+
+ /* release pin we hold on primary bucket */
+ if (BufferIsValid(so->hashso_bucket_buf))
+ _hash_dropbuf(rel, so->hashso_bucket_buf);
+ so->hashso_bucket_buf = InvalidBuffer;
pfree(so);
scan->opaque = NULL;
@@ -488,6 +503,9 @@ hashendscan(IndexScanDesc scan)
* The set of target tuples is specified via a callback routine that tells
* whether any given heap tuple (identified by ItemPointer) is being deleted.
*
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
* Result: a palloc'd struct containing statistical info for VACUUM displays.
*/
IndexBulkDeleteResult *
@@ -532,83 +550,52 @@ loop_top:
{
BlockNumber bucket_blkno;
BlockNumber blkno;
- bool bucket_dirty = false;
+ Buffer bucket_buf;
+ Buffer buf;
+ HashPageOpaque bucket_opaque;
+ Page page;
+ bool bucket_has_garbage = false;
/* Get address of bucket's start page */
bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
- /* Exclusive-lock the bucket so we can shrink it */
- _hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
/* Shouldn't have any active scans locally, either */
if (_hash_has_active_scan(rel, cur_bucket))
elog(ERROR, "hash index has active scan during VACUUM");
- /* Scan each page in bucket */
blkno = bucket_blkno;
- while (BlockNumberIsValid(blkno))
- {
- Buffer buf;
- Page page;
- HashPageOpaque opaque;
- OffsetNumber offno;
- OffsetNumber maxoffno;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
-
- vacuum_delay_point();
- buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
- LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- info->strategy);
- page = BufferGetPage(buf);
- opaque = (HashPageOpaque) PageGetSpecialPointer(page);
- Assert(opaque->hasho_bucket == cur_bucket);
-
- /* Scan each tuple in page */
- maxoffno = PageGetMaxOffsetNumber(page);
- for (offno = FirstOffsetNumber;
- offno <= maxoffno;
- offno = OffsetNumberNext(offno))
- {
- IndexTuple itup;
- ItemPointer htup;
+ /*
+ * We need to acquire a cleanup lock on the primary bucket to out wait
+ * concurrent scans.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+ LockBufferForCleanup(buf);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
- itup = (IndexTuple) PageGetItem(page,
- PageGetItemId(page, offno));
- htup = &(itup->t_tid);
- if (callback(htup, callback_state))
- {
- /* mark the item for deletion */
- deletable[ndeletable++] = offno;
- tuples_removed += 1;
- }
- else
- num_index_tuples += 1;
- }
+ page = BufferGetPage(buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
- /*
- * Apply deletions and write page if needed, advance to next page.
- */
- blkno = opaque->hasho_nextblkno;
+ /*
+ * If the bucket contains tuples that are moved by split, then we need
+ * to delete such tuples on completion of split. Before cleaning, we
+ * need to out-wait the scans that have started when the split was in
+ * progress for a bucket.
+ */
+ if (H_HAS_GARBAGE(bucket_opaque) &&
+ !H_INCOMPLETE_SPLIT(bucket_opaque))
+ bucket_has_garbage = true;
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(page, deletable, ndeletable);
- _hash_wrtbuf(rel, buf);
- bucket_dirty = true;
- }
- else
- _hash_relbuf(rel, buf);
- }
+ bucket_buf = buf;
- /* If we deleted anything, try to compact free space */
- if (bucket_dirty)
- _hash_squeezebucket(rel, cur_bucket, bucket_blkno,
- info->strategy);
+ hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+ local_metapage.hashm_maxbucket,
+ local_metapage.hashm_highmask,
+ local_metapage.hashm_lowmask, &tuples_removed,
+ &num_index_tuples, bucket_has_garbage, true,
+ callback, callback_state);
- /* Release bucket lock */
- _hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, bucket_buf);
/* Advance to next bucket */
cur_bucket++;
@@ -689,6 +676,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
return stats;
}
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow buckets, first we need to lock the next bucket and
+ * then release the lock on current bucket. This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup. Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+ BlockNumber bucket_blkno,
+ BufferAccessStrategy bstrategy,
+ uint32 maxbucket,
+ uint32 highmask, uint32 lowmask,
+ double *tuples_removed,
+ double *num_index_tuples,
+ bool bucket_has_garbage,
+ bool delay,
+ IndexBulkDeleteCallback callback,
+ void *callback_state)
+{
+ BlockNumber blkno;
+ Buffer buf;
+ Bucket cur_bucket;
+ Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY;
+ Page page;
+ bool bucket_dirty = false;
+
+ blkno = bucket_blkno;
+ buf = bucket_buf;
+ page = BufferGetPage(buf);
+ cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+ if (bucket_has_garbage)
+ new_bucket = _hash_get_newbucket(rel, cur_bucket,
+ lowmask, maxbucket);
+
+ /* Scan each page in bucket */
+ for (;;)
+ {
+ HashPageOpaque opaque;
+ OffsetNumber offno;
+ OffsetNumber maxoffno;
+ Buffer next_buf;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ bool retain_pin = false;
+ bool curr_page_dirty = false;
+
+ if (delay)
+ vacuum_delay_point();
+
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ /* Scan each tuple in page */
+ maxoffno = PageGetMaxOffsetNumber(page);
+ for (offno = FirstOffsetNumber;
+ offno <= maxoffno;
+ offno = OffsetNumberNext(offno))
+ {
+ IndexTuple itup;
+ ItemPointer htup;
+ Bucket bucket;
+
+ itup = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, offno));
+ htup = &(itup->t_tid);
+ if (callback && callback(htup, callback_state))
+ {
+ /* mark the item for deletion */
+ deletable[ndeletable++] = offno;
+ if (tuples_removed)
+ *tuples_removed += 1;
+ }
+ else if (bucket_has_garbage)
+ {
+ /* delete the tuples that are moved by split. */
+ bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+ maxbucket,
+ highmask,
+ lowmask);
+ /* mark the item for deletion */
+ if (bucket != cur_bucket)
+ {
+ /*
+ * We expect tuples to either belong to curent bucket or
+ * new_bucket. This is ensured because we don't allow
+ * further splits from bucket that contains garbage. See
+ * comments in _hash_expandtable.
+ */
+ Assert(bucket == new_bucket);
+ deletable[ndeletable++] = offno;
+ }
+ else if (num_index_tuples)
+ *num_index_tuples += 1;
+ }
+ else if (num_index_tuples)
+ *num_index_tuples += 1;
+ }
+
+ /* retain the pin on primary bucket till end of bucket scan */
+ if (blkno == bucket_blkno)
+ retain_pin = true;
+ else
+ retain_pin = false;
+
+ blkno = opaque->hasho_nextblkno;
+
+ /*
+ * Apply deletions and write page if needed, advance to next page.
+ */
+ if (ndeletable > 0)
+ {
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ bucket_dirty = true;
+ curr_page_dirty = true;
+ }
+
+ /* bail out if there are no more pages to scan. */
+ if (!BlockNumberIsValid(blkno))
+ break;
+
+ next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+ LH_OVERFLOW_PAGE,
+ bstrategy);
+
+ /*
+ * release the lock on previous page after acquiring the lock on next
+ * page
+ */
+ if (curr_page_dirty)
+ {
+ if (retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, buf);
+ curr_page_dirty = false;
+ }
+ else if (retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
+
+ buf = next_buf;
+ }
+
+ /*
+ * lock the bucket page to clear the garbage flag and squeeze the bucket.
+ * if the current buffer is same as bucket buffer, then we already have
+ * lock on bucket page.
+ */
+ if (buf != bucket_buf)
+ {
+ _hash_relbuf(rel, buf);
+ _hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+ }
+
+ /*
+ * Clear the garbage flag from bucket after deleting the tuples that are
+ * moved by split. We purposefully clear the flag before squeeze bucket,
+ * so that after restart, vacuum shouldn't again try to delete the moved
+ * by split tuples.
+ */
+ if (bucket_has_garbage)
+ {
+ HashPageOpaque bucket_opaque;
+
+ page = BufferGetPage(bucket_buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+ }
+
+ /*
+ * If we deleted anything, try to compact free space. For squeezing the
+ * bucket, we must have a cleanup lock, else it can impact the ordering of
+ * tuples for a scan that has started before it.
+ */
+ if (bucket_dirty && CheckBufferForCleanup(bucket_buf))
+ _hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+ bstrategy);
+}
void
hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..b1e79b5 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
void
_hash_doinsert(Relation rel, IndexTuple itup)
{
- Buffer buf;
+ Buffer buf = InvalidBuffer;
+ Buffer bucket_buf;
Buffer metabuf;
HashMetaPage metap;
BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
bool do_expand;
uint32 hashkey;
Bucket bucket;
+ uint32 maxbucket;
+ uint32 highmask;
+ uint32 lowmask;
/*
* Get the hash key for the item (it's stored in the index tuple itself).
@@ -70,51 +74,131 @@ _hash_doinsert(Relation rel, IndexTuple itup)
errhint("Values larger than a buffer page cannot be indexed.")));
/*
- * Loop until we get a lock on the correct target bucket.
+ * Copy bucket mapping info now; The comment in _hash_expandtable where
+ * we copy this information and calls _hash_splitbucket explains why this
+ * is OK.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
+ maxbucket = metap->hashm_maxbucket;
+ highmask = metap->hashm_highmask;
+ lowmask = metap->hashm_lowmask;
- blkno = BUCKET_TO_BLKNO(metap, bucket);
+ /*
+ * Conditionally get the lock on primary bucket page for insertion while
+ * holding lock on meta page. If we have to wait, then release the meta
+ * page lock and retry it in a hard way.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ maxbucket,
+ highmask,
+ lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /* Release metapage lock, but keep pin. */
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBuffer(buf))
+ {
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ LockBuffer(buf, HASH_WRITE);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
+ else
+ {
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ }
+ if (retry)
+ {
/*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
+ * Loop until we get a lock on the correct target bucket. We get the
+ * lock on primary bucket page and retain the pin on it during insert
+ * operation to prevent the concurrent splits. Retaining pin on a
+ * primary bucket page ensures that split can't happen as it needs to
+ * acquire the cleanup lock on primary bucket page. Acquiring lock on
+ * primary bucket and rechecking if it is a target bucket is mandatory
+ * as otherwise a concurrent split might cause this insertion to fall
+ * in wrong bucket.
*/
- if (retry)
+ for (;;)
{
+ /*
+ * Compute the target bucket number, and convert to block number.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Release metapage lock, but keep pin. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ /*
+ * If the previous iteration of this loop locked what is still the
+ * correct target bucket, we are done. Otherwise, drop any old
+ * lock and lock what now appears to be the correct bucket.
+ */
if (oldblkno == blkno)
break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
+ _hash_relbuf(rel, buf);
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
+ /* Fetch the primary bucket page for the bucket */
+ buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+ /*
+ * Reacquire metapage lock and check that no bucket split has
+ * taken place while we were awaiting the bucket lock.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ }
}
- /* Fetch the primary bucket page for the bucket */
- buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+ /* remember the primary bucket buffer to release the pin on it at end. */
+ bucket_buf = buf;
+
page = BufferGetPage(buf);
pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(pageopaque->hasho_bucket == bucket);
+ /*
+ * If there is any pending split, try to finish it before proceeding for
+ * the insertion. We try to finish the split for the insertion in old
+ * bucket, as that will allow us to remove the tuples from old bucket and
+ * reuse the space. There is no such apparent benefit from finsihing the
+ * split during insertion in new bucket.
+ *
+ * In future, if we want to finish the splits during insertion in new
+ * bucket, we must ensure the locking order such that old bucket is locked
+ * before new bucket.
+ */
+ if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && CheckBufferForCleanup(buf))
+ {
+ BlockNumber nblkno;
+ Buffer nbuf;
+
+ nblkno = _hash_get_newblk(rel, pageopaque);
+
+ /* Fetch the primary bucket page for the new bucket */
+ nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+ if (nbuf)
+ {
+ _hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+ highmask, lowmask);
+
+ /*
+ * release the buffer here as the insertion will happen in old
+ * bucket.
+ */
+ _hash_relbuf(rel, nbuf);
+ }
+ }
+
/* Do the insertion */
while (PageGetFreeSpace(page) < itemsz)
{
@@ -127,14 +211,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
{
/*
* ovfl page exists; go get it. if it doesn't have room, we'll
- * find out next pass through the loop test above.
+ * find out next pass through the loop test above. Retain the pin
+ * if it is a primary bucket.
*/
- _hash_relbuf(rel, buf);
+ if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
page = BufferGetPage(buf);
}
else
{
+ bool retain_pin = false;
+
+ /* page flags must be accessed before releasing lock on a page. */
+ retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
/*
* we're at the end of the bucket chain and we haven't found a
* page with enough room. allocate a new overflow page.
@@ -144,7 +237,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
/* chain to a new overflow page */
- buf = _hash_addovflpage(rel, metabuf, buf);
+ buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
page = BufferGetPage(buf);
/* should fit now, given test above */
@@ -158,11 +251,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
/* found page with enough space, so add the item here */
(void) _hash_pgaddtup(rel, buf, itemsz, itup);
- /* write and release the modified page */
+ /*
+ * write and release the modified page and ensure to release the pin on
+ * primary page.
+ */
_hash_wrtbuf(rel, buf);
-
- /* We can drop the bucket lock now */
- _hash_droplock(rel, blkno, HASH_SHARE);
+ if (buf != bucket_buf)
+ _hash_dropbuf(rel, bucket_buf);
/*
* Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
*
* On entry, the caller must hold a pin but no lock on 'buf'. The pin is
* dropped before exiting (we assume the caller is not interested in 'buf'
- * anymore). The returned overflow page will be pinned and write-locked;
- * it is guaranteed to be empty.
+ * anymore) if not asked to retain. The pin will be retained only for the
+ * primary bucket. The returned overflow page will be pinned and
+ * write-locked; it is guaranteed to be empty.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* That buffer is returned in the same state.
*
- * The caller must hold at least share lock on the bucket, to ensure that
- * no one else tries to compact the bucket meanwhile. This guarantees that
- * 'buf' won't stop being part of the bucket while it's unlocked.
- *
* NB: since this could be executed concurrently by multiple processes,
* one should not assume that the returned overflow page will be the
* immediate successor of the originally passed 'buf'. Additional overflow
* pages might have been added to the bucket chain in between.
*/
Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
{
Buffer ovflbuf;
Page page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
break;
/* we assume we do not need to write the unmodified page */
- _hash_relbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
/* logically chain overflow page to previous page */
pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
- _hash_wrtbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, buf);
return ovflbuf;
}
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
* in the bucket, or InvalidBlockNumber if no following page.
*
* NB: caller must not hold lock on metapage, nor on either page that's
- * adjacent in the bucket chain. The caller had better hold exclusive lock
- * on the bucket, too.
+ * adjacent in the bucket chain except from primary bucket. The caller had
+ * better hold cleanup lock on the primary bucket.
*/
BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
BufferAccessStrategy bstrategy)
{
HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
/*
* Fix up the bucket chain. this is a doubly-linked list, so we must fix
* up the bucket chain members behind and ahead of the overflow page being
- * deleted. No concurrency issues since we hold exclusive lock on the
- * entire bucket.
+ * deleted. No concurrency issues since we hold the cleanup lock on
+ * primary bucket. We don't need to aqcuire buffer lock to fix the
+ * primary bucket, as we already have that lock.
*/
if (BlockNumberIsValid(prevblkno))
{
- Buffer prevbuf = _hash_getbuf_with_strategy(rel,
- prevblkno,
- HASH_WRITE,
+ if (prevblkno == bucket_blkno)
+ {
+ Buffer prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+ prevblkno,
+ RBM_NORMAL,
+ bstrategy);
+
+ Page prevpage = BufferGetPage(prevbuf);
+ HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+ Assert(prevopaque->hasho_bucket == bucket);
+ prevopaque->hasho_nextblkno = nextblkno;
+ MarkBufferDirty(prevbuf);
+ ReleaseBuffer(prevbuf);
+ }
+ else
+ {
+ Buffer prevbuf = _hash_getbuf_with_strategy(rel,
+ prevblkno,
+ HASH_WRITE,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- bstrategy);
- Page prevpage = BufferGetPage(prevbuf);
- HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+ bstrategy);
+ Page prevpage = BufferGetPage(prevbuf);
+ HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
- Assert(prevopaque->hasho_bucket == bucket);
- prevopaque->hasho_nextblkno = nextblkno;
- _hash_wrtbuf(rel, prevbuf);
+ Assert(prevopaque->hasho_bucket == bucket);
+ prevopaque->hasho_nextblkno = nextblkno;
+ _hash_wrtbuf(rel, prevbuf);
+ }
}
if (BlockNumberIsValid(nextblkno))
{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
* required that to be true on entry as well, but it's a lot easier for
* callers to leave empty overflow pages and let this guy clean it up.
*
- * Caller must hold exclusive lock on the target bucket. This allows
+ * Caller must hold cleanup lock on the target bucket. This allows
* us to safely lock multiple pages in the bucket.
*
* Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
_hash_squeezebucket(Relation rel,
Bucket bucket,
BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy)
{
BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
HashPageOpaque wopaque;
HashPageOpaque ropaque;
bool wbuf_dirty;
+ bool release_buf = false;
/*
* start squeezing into the base bucket page.
*/
wblkno = bucket_blkno;
- wbuf = _hash_getbuf_with_strategy(rel,
- wblkno,
- HASH_WRITE,
- LH_BUCKET_PAGE,
- bstrategy);
+ wbuf = bucket_buf;
wpage = BufferGetPage(wbuf);
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
/*
- * if there aren't any overflow pages, there's nothing to squeeze.
+ * if there aren't any overflow pages, there's nothing to squeeze. caller
+ * is responsible to release the lock on primary bucket.
*/
if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
- {
- _hash_relbuf(rel, wbuf);
return;
- }
/*
* Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
{
Assert(!PageIsEmpty(wpage));
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
wblkno = wopaque->hasho_nextblkno;
Assert(BlockNumberIsValid(wblkno));
- if (wbuf_dirty)
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
Assert(wopaque->hasho_bucket == bucket);
wbuf_dirty = false;
+ release_buf = false;
}
/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
/* are we freeing the page adjacent to wbuf? */
if (rblkno == wblkno)
{
- /* yes, so release wbuf lock first */
- if (wbuf_dirty)
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
+ /* yes, so release wbuf lock first if needed */
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
+
/* free this overflow page (releases rbuf) */
- _hash_freeovflpage(rel, rbuf, bstrategy);
+ _hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
/* done */
return;
}
/* free this overflow page, then get the previous one */
- _hash_freeovflpage(rel, rbuf, bstrategy);
+ _hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
rbuf = _hash_getbuf_with_strategy(rel,
rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..6dfd411 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
uint32 nblocks);
static void _hash_splitbucket(Relation rel, Buffer metabuf,
Bucket obucket, Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+ Bucket obucket, Bucket nbucket, Buffer obuf,
+ Buffer nbuf, HTAB *htab, uint32 maxbucket,
+ uint32 highmask, uint32 lowmask);
/*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
/*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock. (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- return ConditionalLockPage(rel, whichlock, access);
- else
- return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- UnlockPage(rel, whichlock, access);
-}
-
-/*
* _hash_getbuf() -- Get a buffer by block number for read or write.
*
* 'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
}
/*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ * We try to take the conditional cleanup lock and if we get it then
+ * retrun the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+ Buffer buf;
+
+ if (blkno == P_NEW)
+ elog(ERROR, "hash AM does not use P_NEW");
+
+ buf = ReadBuffer(rel, blkno);
+
+ if (!ConditionalLockBufferForCleanup(buf))
+ {
+ ReleaseBuffer(buf);
+ return InvalidBuffer;
+ }
+
+ /* ref count and lock type are correct */
+
+ _hash_checkpage(rel, buf, flags);
+
+ return buf;
+}
+
+/*
* _hash_getinitbuf() -- Get and initialize a buffer by block number.
*
* This must be used only to fetch pages that are known to be before
@@ -489,9 +482,11 @@ _hash_pageinit(Page page, Size size)
/*
* Attempt to expand the hash table by creating one new bucket.
*
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old or new bucket.
*
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* The buffer is returned in the same state.
@@ -506,10 +501,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
BlockNumber start_oblkno;
BlockNumber start_nblkno;
Buffer buf_nblkno;
+ Buffer buf_oblkno;
+ Page opage;
+ HashPageOpaque oopaque;
uint32 maxbucket;
uint32 highmask;
uint32 lowmask;
+restart_expand:
+
/*
* Write-lock the meta page. It used to be necessary to acquire a
* heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +548,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
goto fail;
/*
- * Determine which bucket is to be split, and attempt to lock the old
- * bucket. If we can't get the lock, give up.
+ * Determine which bucket is to be split, and attempt to take cleanup lock
+ * on the old bucket. If we can't get the lock, give up.
+ *
+ * The cleanup lock protects us against other backends, but not against
+ * our own backend. Must check for active scans separately.
*
- * The lock protects us against other backends, but not against our own
- * backend. Must check for active scans separately.
+ * The cleanup lock is mainly to protect the split from concurrent
+ * inserts. See src/backend/access/hash/README, Lock Definitions for
+ * further details. Due to this locking restriction, if there is any
+ * pending scan, split will give up which is not good, but harmless.
*/
new_bucket = metap->hashm_maxbucket + 1;
@@ -563,11 +568,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, old_bucket))
goto fail;
- if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+ buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+ if (!buf_oblkno)
goto fail;
+ opage = BufferGetPage(buf_oblkno);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ /*
+ * We want to finish the split from a bucket as there is no apparent
+ * benefit by not doing so and it will make the code complicated to finish
+ * the split that involves multiple buckets considering the case where new
+ * split also fails. We don't need to cosider the new bucket for
+ * completing the split here as it is not possible that a re-split of new
+ * bucket starts when there is still a pending split from old bucket.
+ */
+ if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+ {
+ BlockNumber nblkno;
+ Buffer buf_nblkno;
+
+ /*
+ * Copy bucket mapping info now; The comment in code below where we
+ * copy this information and calls _hash_splitbucket explains why this
+ * is OK.
+ */
+ maxbucket = metap->hashm_maxbucket;
+ highmask = metap->hashm_highmask;
+ lowmask = metap->hashm_lowmask;
+
+ /* Release the metapage lock, before completing the split. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ nblkno = _hash_get_newblk(rel, oopaque);
+
+ /* Fetch the primary bucket page for the new bucket */
+ buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+ if (!buf_nblkno)
+ {
+ _hash_relbuf(rel, buf_oblkno);
+ goto fail;
+ }
+
+ _hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+ highmask, lowmask);
+
+ /*
+ * release the buffers and retry for expand.
+ */
+ _hash_relbuf(rel, buf_oblkno);
+ _hash_relbuf(rel, buf_nblkno);
+
+ goto restart_expand;
+ }
+
/*
- * Likewise lock the new bucket (should never fail).
+ * Clean the tuples remained from previous split. This operation requires
+ * cleanup lock and we already have one on old bucket, so let's do it. We
+ * also don't want to allow further splits from the bucket till the
+ * garbage of previous split is cleaned. This has two advantages, first
+ * it helps in avoiding the bloat due to garbage and second is, during
+ * cleanup of bucket, we are always sure that the garbage tuples belong to
+ * most recently splitted bucket. On the contrary, if we allow cleanup of
+ * bucket after meta page is updated to indicate the new split and before
+ * the actual split, the cleanup operation won't be able to decide whether
+ * the tuple has been moved to the newly created bucket and ended up
+ * deleting such tuples.
+ */
+ if (H_HAS_GARBAGE(oopaque))
+ {
+ /* Release the metapage lock. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+ metap->hashm_maxbucket, metap->hashm_highmask,
+ metap->hashm_lowmask, NULL,
+ NULL, true, false, NULL, NULL);
+
+ _hash_relbuf(rel, buf_oblkno);
+
+ goto restart_expand;
+ }
+
+ /*
+ * There shouldn't be any active scan on new bucket.
*
* Note: it is safe to compute the new bucket's blkno here, even though we
* may still need to update the BUCKET_TO_BLKNO mapping. This is because
@@ -579,9 +663,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, new_bucket))
elog(ERROR, "scan in progress on supposedly new bucket");
- if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
- elog(ERROR, "could not get lock on supposedly new bucket");
-
/*
* If the split point is increasing (hashm_maxbucket's log base 2
* increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +681,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
{
/* can't split due to BlockNumber overflow */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, buf_oblkno);
goto fail;
}
}
@@ -609,9 +689,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/*
* Physically allocate the new bucket's primary page. We want to do this
* before changing the metapage's mapping info, in case we can't get the
- * disk space.
+ * disk space. Ideally, we don't need to check for cleanup lock on new
+ * bucket as no other backend could find this bucket unless meta page is
+ * updated. However, it is good to be consistent with old bucket locking.
*/
buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+ if (!CheckBufferForCleanup(buf_nblkno))
+ {
+ _hash_relbuf(rel, buf_oblkno);
+ _hash_relbuf(rel, buf_nblkno);
+ goto fail;
+ }
+
/*
* Okay to proceed with split. Update the metapage bucket mapping info.
@@ -665,13 +754,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/* Relocate records to the new bucket */
_hash_splitbucket(rel, metabuf,
old_bucket, new_bucket,
- start_oblkno, buf_nblkno,
+ buf_oblkno, buf_nblkno,
maxbucket, highmask, lowmask);
- /* Release bucket locks, allowing others to access them */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
return;
/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +830,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
* The buffer is returned in the same state. (The metapage is only
* touched if it becomes necessary to add or remove overflow pages.)
*
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation. This is to prevent vacuum to start
+ * when split is in progress.
+ *
* In addition, the caller must have created the new bucket's base page,
* which is passed in buffer nbuf, pinned and write-locked. That lock and
* pin are released here. (The API is set up this way because we must do
@@ -756,37 +845,87 @@ _hash_splitbucket(Relation rel,
Buffer metabuf,
Bucket obucket,
Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask,
uint32 lowmask)
{
- Buffer obuf;
Page opage;
Page npage;
HashPageOpaque oopaque;
HashPageOpaque nopaque;
- /*
- * It should be okay to simultaneously write-lock pages from each bucket,
- * since no one else can be trying to acquire buffer lock on pages of
- * either bucket.
- */
- obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ /*
+ * Mark the old bucket to indicate that split is in progress and it has
+ * deletable tuples. At operation end, we clear split in progress flag and
+ * vacuum will clear page_has_garbage flag after deleting such tuples.
+ */
+ oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
npage = BufferGetPage(nbuf);
- /* initialize the new bucket's primary page */
+ /*
+ * initialize the new bucket's primary page and mark it to indicate that
+ * split is in progress.
+ */
nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
nopaque->hasho_prevblkno = InvalidBlockNumber;
nopaque->hasho_nextblkno = InvalidBlockNumber;
nopaque->hasho_bucket = nbucket;
- nopaque->hasho_flag = LH_BUCKET_PAGE;
+ nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
nopaque->hasho_page_id = HASHO_PAGE_ID;
+ _hash_splitbucket_guts(rel, metabuf, obucket,
+ nbucket, obuf, nbuf, NULL,
+ maxbucket, highmask, lowmask);
+
+ /* all done, now release the locks and pins on primary buckets. */
+ _hash_relbuf(rel, obuf);
+ _hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations. To finish the previously
+ * interrupted split operation, caller needs to fill htab. If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+ Buffer metabuf,
+ Bucket obucket,
+ Bucket nbucket,
+ Buffer obuf,
+ Buffer nbuf,
+ HTAB *htab,
+ uint32 maxbucket,
+ uint32 highmask,
+ uint32 lowmask)
+{
+ Buffer bucket_obuf;
+ Buffer bucket_nbuf;
+ Page opage;
+ Page npage;
+ HashPageOpaque oopaque;
+ HashPageOpaque nopaque;
+
+ bucket_obuf = obuf;
+ opage = BufferGetPage(obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ bucket_nbuf = nbuf;
+ npage = BufferGetPage(nbuf);
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
/*
* Partition the tuples in the old bucket between the old bucket and the
* new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +937,6 @@ _hash_splitbucket(Relation rel,
BlockNumber oblkno;
OffsetNumber ooffnum;
OffsetNumber omaxoffnum;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
/* Scan each tuple in old page */
omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,18 +947,45 @@ _hash_splitbucket(Relation rel,
IndexTuple itup;
Size itemsz;
Bucket bucket;
+ bool found = false;
/*
- * Fetch the item's hash key (conveniently stored in the item) and
- * determine which bucket it now belongs in.
+ * Before inserting tuple, probe the hash table containing TIDs of
+ * tuples belonging to new bucket, if we find a match, then skip
+ * that tuple, else fetch the item's hash key (conveniently stored
+ * in the item) and determine which bucket it now belongs in.
*/
itup = (IndexTuple) PageGetItem(opage,
PageGetItemId(opage, ooffnum));
+
+ if (htab)
+ (void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+ if (found)
+ continue;
+
bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
maxbucket, highmask, lowmask);
if (bucket == nbucket)
{
+ Size itupsize = 0;
+ IndexTuple new_itup;
+
+ /*
+ * make a copy of index tuple as we have to scribble on it.
+ */
+ new_itup = CopyIndexTuple(itup);
+
+ /*
+ * mark the index tuple as moved by split, such tuples are
+ * skipped by scan if there is split in progress for a bucket.
+ */
+ itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+ new_itup->t_info &= ~INDEX_SIZE_MASK;
+ new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+ new_itup->t_info |= itupsize;
+
/*
* insert the tuple into the new bucket. if it doesn't fit on
* the current page in the new bucket, we must allocate a new
@@ -832,17 +996,25 @@ _hash_splitbucket(Relation rel,
* only partially complete, meaning the index is corrupt,
* since searches may fail to find entries they should find.
*/
- itemsz = IndexTupleDSize(*itup);
+ itemsz = IndexTupleDSize(*new_itup);
itemsz = MAXALIGN(itemsz);
if (PageGetFreeSpace(npage) < itemsz)
{
+ bool retain_pin = false;
+
+ /*
+ * page flags must be accessed before releasing lock on a
+ * page.
+ */
+ retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
/* write out nbuf and drop lock, but keep pin */
_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
/* chain to a new overflow page */
- nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+ nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
npage = BufferGetPage(nbuf);
- /* we don't need nopaque within the loop */
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
}
/*
@@ -852,12 +1024,10 @@ _hash_splitbucket(Relation rel,
* Possible future improvement: accumulate all the items for
* the new page and qsort them before insertion.
*/
- (void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+ (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
- /*
- * Mark tuple for deletion from old page.
- */
- deletable[ndeletable++] = ooffnum;
+ /* be tidy */
+ pfree(new_itup);
}
else
{
@@ -870,15 +1040,9 @@ _hash_splitbucket(Relation rel,
oblkno = oopaque->hasho_nextblkno;
- /*
- * Done scanning this old page. If we moved any tuples, delete them
- * from the old page.
- */
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(opage, deletable, ndeletable);
- _hash_wrtbuf(rel, obuf);
- }
+ /* retain the pin on the old primary bucket */
+ if (obuf == bucket_obuf)
+ _hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, obuf);
@@ -887,18 +1051,153 @@ _hash_splitbucket(Relation rel,
break;
/* Else, advance to next old page */
- obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+ obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
}
/*
* We're at the end of the old bucket chain, so we're done partitioning
- * the tuples. Before quitting, call _hash_squeezebucket to ensure the
- * tuples remaining in the old bucket (including the overflow pages) are
- * packed as tightly as possible. The new bucket is already tight.
+ * the tuples. Mark the old and new buckets to indicate split is
+ * finished.
+ *
+ * To avoid deadlocks due to locking order of buckets, first lock the old
+ * bucket and then the new bucket.
+ */
+ if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, nbuf);
+
+ /*
+ * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+ * there is no pending scan that has seen the flag after it is cleared.
+ */
+ _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+ opage = BufferGetPage(bucket_obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ _hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+ npage = BufferGetPage(bucket_nbuf);
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+ /* indicate that split is finished */
+ oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+ nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+ /*
+ * now write the buffers, here we don't release the locks as caller is
+ * responsible to release locks.
*/
- _hash_wrtbuf(rel, nbuf);
+ MarkBufferDirty(bucket_obuf);
+ MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ * _hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state. (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking it.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+ uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+ HASHCTL hash_ctl;
+ HTAB *tidhtab;
+ Buffer bucket_nbuf;
+ Page opage;
+ Page npage;
+ HashPageOpaque opageopaque;
+ HashPageOpaque npageopaque;
+ Bucket obucket;
+ Bucket nbucket;
+ bool found;
+
+ /* Initialize hash tables used to track TIDs */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ItemPointerData);
+ hash_ctl.entrysize = sizeof(ItemPointerData);
+ hash_ctl.hcxt = CurrentMemoryContext;
+
+ tidhtab =
+ hash_create("bucket ctids",
+ 256, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ /*
+ * Scan the new bucket and build hash table of TIDs
+ */
+ bucket_nbuf = nbuf;
+ npage = BufferGetPage(nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ for (;;)
+ {
+ BlockNumber nblkno;
+ OffsetNumber noffnum;
+ OffsetNumber nmaxoffnum;
+
+ /* Scan each tuple in new page */
+ nmaxoffnum = PageGetMaxOffsetNumber(npage);
+ for (noffnum = FirstOffsetNumber;
+ noffnum <= nmaxoffnum;
+ noffnum = OffsetNumberNext(noffnum))
+ {
+ IndexTuple itup;
+
+ /* Fetch the item's TID and insert it in hash table. */
+ itup = (IndexTuple) PageGetItem(npage,
+ PageGetItemId(npage, noffnum));
+
+ (void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+ Assert(!found);
+ }
+
+ nblkno = npageopaque->hasho_nextblkno;
+
+ /*
+ * release our write lock without modifying buffer and ensure to
+ * retain the pin on primary bucket.
+ */
+ if (nbuf == bucket_nbuf)
+ _hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, nbuf);
+
+ /* Exit loop if no more overflow pages in new bucket */
+ if (!BlockNumberIsValid(nblkno))
+ break;
+
+ /* Else, advance to next page */
+ nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+ npage = BufferGetPage(nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ }
+
+ /* Need a cleanup lock to perform split operation. */
+ LockBufferForCleanup(bucket_nbuf);
+
+ npage = BufferGetPage(bucket_nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ nbucket = npageopaque->hasho_bucket;
+
+ opage = BufferGetPage(obuf);
+ opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ obucket = opageopaque->hasho_bucket;
+
+ _hash_splitbucket_guts(rel, metabuf, obucket,
+ nbucket, obuf, bucket_nbuf, tidhtab,
+ maxbucket, highmask, lowmask);
- _hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+ hash_destroy(tidhtab);
}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..b0cb638 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_nextblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan to ensure
+ * that vacuum can't delete the tuples that are moved by split to new
+ * bucket. Such tuples are required by the scans that are started on
+ * splitted buckets, before a new buckets split in progress flag
+ * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_prevblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan. See
+ * comments in _hash_readnext to know the reason of retaining pin.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -192,43 +213,81 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
metap = HashPageGetMeta(page);
/*
- * Loop until we get a lock on the correct target bucket.
+ * Conditionally get the lock on primary bucket page for search while
+ * holding lock on meta page. If we have to wait, then release the meta
+ * page lock and retry it in a hard way.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
- blkno = BUCKET_TO_BLKNO(metap, bucket);
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /* Release metapage lock, but keep pin. */
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBufferShared(buf))
+ {
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ LockBuffer(buf, HASH_READ);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
+ else
+ {
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ }
+ if (retry)
+ {
/*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
+ * Loop until we get a lock on the correct target bucket. We get the
+ * lock on primary bucket page and retain the pin on it during read
+ * operation to prevent the concurrent splits. Retaining pin on a
+ * primary bucket page ensures that split can't happen as it needs to
+ * acquire the cleanup lock on primary bucket page. Acquiring lock on
+ * primary bucket and rechecking if it is a target bucket is mandatory
+ * as otherwise a concurrent split followed by vacuum could remove
+ * tuples from the selected bucket which otherwise would have been
+ * visible.
*/
- if (retry)
+ for (;;)
{
+ /*
+ * Compute the target bucket number, and convert to block number.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Release metapage lock, but keep pin. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ /*
+ * If the previous iteration of this loop locked what is still the
+ * correct target bucket, we are done. Otherwise, drop any old
+ * lock and lock what now appears to be the correct bucket.
+ */
if (oldblkno == blkno)
break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
+ _hash_relbuf(rel, buf);
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
+ /* Fetch the primary bucket page for the bucket */
+ buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * Reacquire metapage lock and check that no bucket split has
+ * taken place while we were awaiting the bucket lock.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ }
}
/* done with the metapage */
@@ -237,14 +296,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
/* Update scan opaque state to show we have lock on the bucket */
so->hashso_bucket = bucket;
so->hashso_bucket_valid = true;
- so->hashso_bucket_blkno = blkno;
- /* Fetch the primary bucket page for the bucket */
- buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
page = BufferGetPage(buf);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->hasho_bucket == bucket);
+ so->hashso_bucket_buf = buf;
+
+ /*
+ * If the bucket split is in progress, then we need to skip tuples that
+ * are moved from old bucket. To ensure that vacuum doesn't clean any
+ * tuples from old or new buckets till this scan is in progress, maintain
+ * a pin on both of the buckets. Here, we have to be cautious about lock
+ * ordering, first acquire the lock on old bucket, release the lock on old
+ * bucket, but not pin, then acuire the lock on new bucket and again
+ * re-verify whether the bucket split still is in progress. Acquiring lock
+ * on old bucket first ensures that the vacuum waits for this scan to
+ * finish.
+ */
+ if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+ {
+ BlockNumber old_blkno;
+ Buffer old_buf;
+
+ old_blkno = _hash_get_oldblk(rel, opaque);
+
+ /*
+ * release the lock on new bucket and re-acquire it after acquiring
+ * the lock on old bucket.
+ */
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+ old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * remember the old bucket buffer so as to use it later for scanning.
+ */
+ so->hashso_old_bucket_buf = old_buf;
+ _hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ Assert(opaque->hasho_bucket == bucket);
+
+ if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+ so->hashso_skip_moved_tuples = true;
+ else
+ {
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+ }
+ }
+
/* If a backwards scan is requested, move to the end of the chain */
if (ScanDirectionIsBackward(dir))
{
@@ -273,6 +378,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
* false. Else, return true and set the hashso_curpos for the
* scan to the right thing.
*
+ * Here we also scan the old bucket if the split for current bucket
+ * was in progress at the start of scan. The basic idea is that
+ * skip the tuples that are moved by split while scanning current
+ * bucket and then scan the old bucket to cover all such tuples. This
+ * is done to ensure that we don't miss any tuples in the scans that
+ * started during split.
+ *
* 'bufP' points to the current buffer, which is pinned and read-locked.
* On success exit, we have pin and read-lock on whichever page
* contains the right item; on failure, we have released all buffers.
@@ -338,6 +450,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum >= FirstOffsetNumber);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ {
+ offnum = OffsetNumberNext(offnum); /* move forward */
+ continue;
+ }
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -353,9 +478,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ buf = so->hashso_old_bucket_buf;
+
+ /*
+ * old buket buffer must be valid as we acquire
+ * the pin on it before the start of scan and
+ * retain it till end of scan.
+ */
+ Assert(BufferIsValid(buf));
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
@@ -379,6 +536,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum <= maxoff);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ {
+ offnum = OffsetNumberPrev(offnum); /* move back */
+ continue;
+ }
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -394,9 +564,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ buf = so->hashso_old_bucket_buf;
+
+ /*
+ * old buket buffer must be valid as we acquire
+ * the pin on it before the start of scan and
+ * retain it till end of scan.
+ */
+ Assert(BufferIsValid(buf));
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..1648581 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
}
/*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+ uint32 i = 0;
+
+ while (num)
+ {
+ num = num >> 1;
+ ++i;
+ }
+
+ return i - 1;
+}
+
+/*
* _hash_checkpage -- sanity checks on the format of all hash pages
*
* If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
return lower;
}
+
+/*
+ * _hash_get_oldblk() -- get the block number from which current bucket
+ * is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket old_bucket;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ /*
+ * To get the old bucket from the current bucket, we need a mask to modulo
+ * into lower half of table. This mask is stored in meta page as
+ * hashm_lowmask, but here we can't rely on the same, because we need a
+ * value of lowmask that was prevalent at the time when bucket split was
+ * started. Masking the most significant bit of new bucket would give us
+ * old bucket.
+ */
+ curr_bucket = opaque->hasho_bucket;
+ mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+ old_bucket = curr_bucket & mask;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
+
+/*
+ * _hash_get_newblk() -- get the block number of bucket for the new bucket
+ * that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half. It is mainly required to finsh the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket new_bucket;
+ uint32 lowmask;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ curr_bucket = opaque->hasho_bucket;
+
+ /*
+ * new bucket can be obtained by OR'ing old bucket with most significant
+ * bit of current table half. There could be multiple buckets that could
+ * have splitted from curent bucket. We need the first such bucket that
+ * exists based on current table half.
+ */
+ lowmask = metap->hashm_lowmask;
+
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = curr_bucket | mask;
+ if (new_bucket > metap->hashm_maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+ break;
+ }
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
+
+/*
+ * _hash_get_newbucket() -- get the new bucket that will be generated after
+ * split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket. New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function. There could be multiple buckets that
+ * could have splitted from curent bucket. We need the first such bucket that
+ * exists. Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+ uint32 lowmask, uint32 maxbucket)
+{
+ Bucket new_bucket;
+ uint32 mask;
+
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = curr_bucket | mask;
+ if (new_bucket > maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ break;
+ }
+
+ return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..1c9be40 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
}
/*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+ BufferDesc *buf;
+
+ Assert(BufferIsValid(buffer));
+ if (BufferIsLocal(buffer))
+ return true; /* act as though we got it */
+
+ buf = GetBufferDescriptor(buffer - 1);
+
+ return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+ LW_SHARED);
+}
+
+/*
* LockBufferForCleanup - lock a buffer in preparation for deleting items
*
* Items may be deleted from a disk page only when the caller (a) holds an
@@ -3750,6 +3770,49 @@ ConditionalLockBufferForCleanup(Buffer buffer)
return false;
}
+/*
+ * CheckBufferForCleanup - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK. If
+ * not, return FALSE.
+ */
+bool
+CheckBufferForCleanup(Buffer buffer)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ Assert(BufferIsValid(buffer));
+
+ if (BufferIsLocal(buffer))
+ {
+ /* There should be exactly one pin */
+ if (LocalRefCount[-buffer - 1] != 1)
+ return false;
+ /* Nobody else to wait for */
+ return true;
+ }
+
+ /* There should be exactly one local pin */
+ if (GetPrivateRefCount(buffer) != 1)
+ return false;
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+
+ buf_state = LockBufHdr(bufHdr);
+
+ Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+ if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+ {
+ /* pincount is OK. */
+ UnlockBufHdr(bufHdr, buf_state);
+ return true;
+ }
+
+ UnlockBufHdr(bufHdr, buf_state);
+ return false;
+}
+
/*
* Functions for buffer I/O handling
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index ce31418..0b41563 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -25,6 +25,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/lockdefs.h"
+#include "utils/hsearch.h"
#include "utils/relcache.h"
/*
@@ -52,6 +53,9 @@ typedef uint32 Bucket;
#define LH_BUCKET_PAGE (1 << 1)
#define LH_BITMAP_PAGE (1 << 2)
#define LH_META_PAGE (1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT (1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT (1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE (1 << 6)
typedef struct HashPageOpaqueData
{
@@ -64,6 +68,12 @@ typedef struct HashPageOpaqueData
typedef HashPageOpaqueData *HashPageOpaque;
+#define H_HAS_GARBAGE(opaque) ((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque) (((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+ ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
/*
* The page ID is for the convenience of pg_filedump and similar utilities,
* which otherwise would have a hard time telling pages of different index
@@ -88,12 +98,6 @@ typedef struct HashScanOpaqueData
bool hashso_bucket_valid;
/*
- * If we have a share lock on the bucket, we record it here. When
- * hashso_bucket_blkno is zero, we have no such lock.
- */
- BlockNumber hashso_bucket_blkno;
-
- /*
* We also want to remember which buffer we're currently examining in the
* scan. We keep the buffer pinned (but not locked) across hashgettuple
* calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -101,11 +105,23 @@ typedef struct HashScanOpaqueData
*/
Buffer hashso_curbuf;
+ /* remember the buffer associated with primary bucket */
+ Buffer hashso_bucket_buf;
+
+ /*
+ * remember the buffer associated with old primary bucket which is
+ * required during the scan of the bucket for which split is in progress.
+ */
+ Buffer hashso_old_bucket_buf;
+
/* Current position of the scan, as an index TID */
ItemPointerData hashso_curpos;
/* Current position of the scan, as a heap TID */
ItemPointerData hashso_heappos;
+
+ /* Whether scan needs to skip tuples that are moved by split */
+ bool hashso_skip_moved_tuples;
} HashScanOpaqueData;
typedef HashScanOpaqueData *HashScanOpaque;
@@ -176,6 +192,8 @@ typedef HashMetaPageData *HashMetaPage;
sizeof(ItemIdData) - \
MAXALIGN(sizeof(HashPageOpaqueData)))
+#define INDEX_MOVED_BY_SPLIT_MASK 0x2000
+
#define HASH_MIN_FILLFACTOR 10
#define HASH_DEFAULT_FILLFACTOR 75
@@ -224,9 +242,6 @@ typedef HashMetaPageData *HashMetaPage;
#define HASH_WRITE BUFFER_LOCK_EXCLUSIVE
#define HASH_NOLOCK (-1)
-#define HASH_SHARE ShareLock
-#define HASH_EXCLUSIVE ExclusiveLock
-
/*
* Strategy number. There's only one valid strategy for hashing: equality.
*/
@@ -299,21 +314,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
Size itemsize, IndexTuple itup);
/* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
- BufferAccessStrategy bstrategy);
+ BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
BlockNumber blkno, ForkNumber forkNum);
extern void _hash_squeezebucket(Relation rel,
Bucket bucket, BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy);
/* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+ BlockNumber blkno, int flags);
extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
ForkNumber forkNum);
@@ -329,6 +344,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
ForkNumber forkNum);
extern void _hash_pageinit(Page page, Size size);
extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+ Buffer nbuf, uint32 maxbucket, uint32 highmask,
+ uint32 lowmask);
/* hashscan.c */
extern void _hash_regscan(IndexScanDesc scan);
@@ -364,10 +382,20 @@ extern bool _hash_convert_tuple(Relation index,
Datum *index_values, bool *index_isnull);
extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+ uint32 lowmask, uint32 maxbucket);
/* hash.c */
extern void hash_redo(XLogReaderState *record);
extern void hash_desc(StringInfo buf, XLogReaderState *record);
extern const char *hash_identify(uint8 info);
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+ BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+ uint32 maxbucket, uint32 highmask, uint32 lowmask,
+ double *tuples_removed, double *num_index_tuples,
+ bool bucket_has_garbage, bool delay,
+ IndexBulkDeleteCallback callback, void *callback_state);
#endif /* HASH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..6d0a29c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -226,8 +226,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
extern void UnlockBuffers(void);
extern void LockBuffer(Buffer buffer, int mode);
extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool CheckBufferForCleanup(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
extern void AbortBufferIO(void);
I did some basic testing of same. In that I found one issue with cursor.
+BEGIN;
+SET enable_seqscan = OFF;
+SET enable_bitmapscan = OFF;
+CREATE FUNCTION declares_cursor(int)
+ RETURNS void
+ AS 'DECLARE c CURSOR FOR SELECT * from con_hash_index_table WHERE keycol
= $1;'
+LANGUAGE SQL;
+
+SELECT declares_cursor(1);
+MOVE FORWARD ALL FROM c;
+MOVE BACKWARD 10000 FROM c;
+ CLOSE c;
+ WARNING: buffer refcount leak: [5835] (rel=base/16384/30537,
blockNum=327, flags=0x93800000, refcount=1 1)
ROLLBACK;
closing the cursor comes with the warning which say we forgot to unpin the
buffer.
I have also added tests [1]Some tests to cover hash_index. </messages/by-id/CAD__OugeoQuu3mP09erV3gBdF-nX7o844kW7hAnwCF_rdzr6Qw@mail.gmail.com> for coverage improvements.
[1]: Some tests to cover hash_index. </messages/by-id/CAD__OugeoQuu3mP09erV3gBdF-nX7o844kW7hAnwCF_rdzr6Qw@mail.gmail.com>
</messages/by-id/CAD__OugeoQuu3mP09erV3gBdF-nX7o844kW7hAnwCF_rdzr6Qw@mail.gmail.com>
On Thu, Jul 14, 2016 at 4:33 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
On Wed, Jun 22, 2016 at 8:48 PM, Robert Haas <robertmhaas@gmail.com>
wrote:On Wed, Jun 22, 2016 at 5:14 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
We can do it in the way as you are suggesting, but there is another
thing
which we need to consider here. As of now, the patch tries to finish
the
split if it finds split-in-progress flag in either old or new bucket.
We
need to lock both old and new buckets to finish the split, so it is
quite
possible that two different backends try to lock them in opposite order
leading to a deadlock. I think the correct way to handle is to alwaystry
to lock the old bucket first and then new bucket. To achieve that, if
the
insertion on new bucket finds that split-in-progress flag is set on a
bucket, it needs to release the lock and then acquire the lock first onold
bucket, ensure pincount is 1 and then lock new bucket again and ensure
that
pincount is 1. I have already maintained the order of locks in scan (old
bucket first and then new bucket; refer changes in _hash_first()).
Alternatively, we can try to finish the splits only when someone triesto
insert in old bucket.
Yes, I think locking buckets in increasing order is a good solution.
I also think it's fine to only try to finish the split when the insert
targets the old bucket. Finishing the split enables us to remove
tuples from the old bucket, which lets us reuse space instead of
accelerating more. So there is at least some potential benefit to the
backend inserting into the old bucket. On the other hand, a process
inserting into the new bucket derives no direct benefit from finishing
the split.Okay, following this suggestion, I have updated the patch so that only
insertion into old-bucket can try to finish the splits. Apart from
that, I have fixed the issue reported by Mithun upthread. I have
updated the README to explain the locking used in patch. Also, I
have changed the locking around vacuum, so that it can work with
concurrent scans when ever possible. In previous patch version,
vacuum used to take cleanup lock on a bucket to remove the dead
tuples, moved-due-to-split tuples and squeeze operation, also it holds
the lock on bucket till end of cleanup. Now, also it takes cleanup
lock on a bucket to out-wait scans, but it releases the lock as it
proceeds to clean the overflow pages. The idea is first we need to
lock the next bucket page and then release the lock on current bucket
page. This ensures that any concurrent scan started after we start
cleaning the bucket will always be behind the cleanup. Allowing scans
to cross vacuum will allow it to remove tuples required for sanctity
of scan. Also for squeeze-phase we are just checking if the pincount
of buffer is one (we already have Exclusive lock on buffer of bucket
by that time), then only proceed, else will try to squeeze next time
the cleanup is required for that bucket.Thoughts/Suggestions?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 4, 2016 at 8:02 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
I did some basic testing of same. In that I found one issue with cursor.
Thanks for the testing. The reason for failure was that the patch
didn't take into account the fact that for scrolling cursors, scan can
reacquire the lock and pin on bucket buffer multiple times. I have
fixed it such that we release the pin on bucket buffers after we scan
the last overflow page in bucket. Attached patch fixes the issue for
me, let me know if you still see the issue.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
concurrent_hash_index_v4.patchapplication/octet-stream; name=concurrent_hash_index_v4.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
Buffer buf;
Page page;
- _hash_getlock(rel, blkno, HASH_SHARE);
buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
page = BufferGetPage(buf);
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
}
_hash_relbuf(rel, buf);
- _hash_droplock(rel, blkno, HASH_SHARE);
}
/*
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..a0feb2f 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,49 +125,45 @@ the initially created buckets.
Lock Definitions
----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index. lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect. Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket. A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done. A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from. Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements. LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share. A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan. This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it. This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process. If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process. (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+We use buffer content locks (LWLocks) and buffer pins to control access to
+a hash index.
+
+Scan will take a lock in shared mode on primary or overflow buckets. Inserts
+will acquire exclusive lock on the bucket in which it has to insert. Both the
+operations releases the lock on previous bucket before moving to the next
+overflow bucket. They will retain a pin on primary bucket till end of operation.
+Split operation must acquire cleanup lock on both old and new halves of the
+bucket and mark split-in-progress on both the buckets. The cleanup lock at
+the start of split ensures that parallel insert won't get lost. Consider a
+case where insertion has to add a tuple on some intermediate overflow bucket
+in the bucket chain, if we allow split when insertion is in progress, split
+might not move this newly inserted tuple. It releases the lock on previous
+bucket before moving to the next overflow bucket either for old bucket or for
+new bucket. After partitioning the tuples between old and new buckets, it
+again needs to acquire exclusive lock on both old and new buckets to clear
+the split-in-progress flag. Like inserts and scans, it will also retain pins
+on both the old and new primary buckets till end of split operation, although
+we can do without that as well.
+
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split. The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results. Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple. If we allow vacuum to
+remove the dead tuples with just an exclusive lock, it could remove the tuple
+required to resume the scan. The need for cleanup lock to remove the tuples
+that are moved by split is to ensure that there is no pending scan that has
+started after the start of split and before the finish of split on bucket.
+If we don't do that, then vacuum can remove tuples that are required by such
+a scan. We don't need to retain this cleanup lock during whole vacuum
+operation on bucket. We releases the lock as we move ahead in the bucket
+chain. In the end, for squeeze-phase, we conditionally acquire cleanup lock
+and if we don't get, then we just abandon the squeeze phase.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
Pseudocode Algorithms
@@ -188,63 +184,105 @@ track of available overflow pages.
The reader algorithm is:
pin meta page and take buffer content lock in shared mode
- loop:
- compute bucket number for target hash key
- release meta page buffer content lock
- if (correct bucket page is already locked)
- break
- release any existing bucket page lock (if a concurrent split happened)
- take heavyweight bucket lock
- retake meta page buffer content lock in shared mode
+ compute bucket number for target hash key
+ read and pin the primary bucket page
+ conditionally get the buffer content lock in shared mode on primary bucket page for search
+ if we didn't get the lock (need to wait for lock)
+ release the buffer content lock on meta page
+ acquire buffer content lock on primary bucket page in shared mode
+ acquire the buffer content lock in shared mode on meta page
+ to check for possibility of split, we need to recompute the bucket and
+ verify, if it is a correct bucket; set the retry flag
+ else if we get the lock, then we can skip the retry path
+ if (retry)
+ loop:
+ compute bucket number for target hash key
+ release meta page buffer content lock
+ if (correct bucket page is already locked)
+ break
+ release any existing content lock on bucket page (if a concurrent split happened)
+ pin primary bucket page and take shared buffer content lock
+ retake meta page buffer content lock in shared mode
-- then, per read request:
release pin on metapage
- read current page of bucket and take shared buffer content lock
- step to next page if necessary (no chaining of locks)
+ if the split is in progress for current bucket and this is a new bucket
+ release the buffer content lock on current bucket page
+ pin and acquire the buffer content lock on old bucket in shared mode
+ release the buffer content lock on old bucket, but not pin
+ retake the buffer content lock on new bucket
+ mark the scan such that it skips the tuples that are marked as moved by split
+ step to next page if necessary (no chaining of locks)
+ if the scan indicates moved by split, then move to old bucket after the scan
+ of current bucket is finished
get tuple
release buffer content lock and pin on current page
-- at scan shutdown:
- release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection). Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split. If not, we're
-done. If so, we release our previously-acquired lock and repeat the process
-using the new bucket number. Holding the bucket sharelock for
+ release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock if we have to wait for acquiring the
+content lock on bucket page, because that might result in poor concurrency.
+Instead, we relock the metapage after acquiring the bucket page content lock
+and check whether the bucket has been split. If not, we're done. If so, we
+release our previously-acquired content lock, but not pin and repeat the
+process using the new bucket number. Holding the buffer pin on bucket page for
the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions. Notice that the reader's lock
+being invalidated by splits or compactions. Notice that the reader's pin
does not prevent other buckets from being split or compacted.
To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock. Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page. Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
searching for the same heap tuple TID previously returned. Insertion does
not move index entries across pages, so the previously-returned index entry
should always be on the same page, at the same or higher offset number,
as it was before.
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split. Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split. This happens only for the new half bucket.
+
The insertion algorithm is rather similar:
pin meta page and take buffer content lock in shared mode
- loop:
- compute bucket number for target hash key
- release meta page buffer content lock
- if (correct bucket page is already locked)
- break
- release any existing bucket page lock (if a concurrent split happened)
- take heavyweight bucket lock in shared mode
- retake meta page buffer content lock in shared mode
--- (so far same as reader)
+ compute bucket number for target hash key
+ read and pin the primary bucket page
+ conditionally get the buffer content lock in exclusive mode on primary bucket page for search
+ if we didn't get the lock (need to wait for lock)
+ release the buffer content lock on meta page
+ acquire buffer content lock on primary bucket page in exclusive mode
+ acquire the buffer content lock in shared mode on meta page
+ to check for possibility of split, we need to recompute the bucket and
+ verify, if it is a correct bucket; set the retry flag
+ else if we get the lock, then we can skip the retry path
+ if (retry)
+ loop:
+ compute bucket number for target hash key
+ release meta page buffer content lock
+ if (correct bucket page is already locked)
+ break
+ release any existing content lock on bucket page (if a concurrent split happened)
+ pin primary bucket page and take exclusive buffer content lock
+ retake meta page buffer content lock in shared mode
+-- (so far same as reader, except for acquisation of buffer content lock in
+ exclusive mode on primary bucket page)
release pin on metapage
- pin current page of bucket and take exclusive buffer content lock
- if full, release, read/exclusive-lock next page; repeat as needed
+ if the split-in-progress flag is set for bucket in old half of split
+ and pin count on it is one, then finish the split
+ we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+ if get the lock on new bucket
+ finish the split using algorithm mentioned below for split
+ release the buffer content lock and pin on new bucket
+ if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
>> see below if no space in any page of bucket
insert tuple at appropriate place in page
mark current page dirty and release buffer content lock and pin
+ if current page is not a bucket page, release the pin on bucket page
release heavyweight share-lock
- pin meta page and take buffer content lock in shared mode
+ pin meta page and take buffer content lock in exclusive mode
increment tuple count, decide if split needed
mark meta page dirty and release buffer content lock and pin
done if no split needed, else enter Split algorithm below
@@ -256,11 +294,13 @@ bucket that is being actively scanned, because readers can cope with this
as explained above. We only need the short-term buffer locks to ensure
that readers do not see a partially-updated page.
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Locking
+Definitions above. This algorithm allows them a very high degree of
+concurrency. (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
When an inserter cannot find space in any existing page of a bucket, it
must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +311,79 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
The algorithm attempts, but does not necessarily succeed, to split one
existing bucket in two, thereby lowering the fill ratio:
- pin meta page and take buffer content lock in exclusive mode
- check split still needed
- if split not needed anymore, drop buffer content lock and pin and exit
- decide which bucket to split
- Attempt to X-lock old bucket number (definitely could fail)
- Attempt to X-lock new bucket number (shouldn't fail, but...)
- if above fail, drop locks and pin and exit
+ expand:
+ take buffer content lock in exclusive mode on meta page
+ check split still needed
+ if split not needed anymore, drop buffer content lock and exit
+ decide which bucket to split
+ Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+ if above fail, release lock and pin and exit
+ if the split-in-progress flag is set, then finish the split
+ conditionally get the content lock on new bucket which was involved in split
+ if got the lock on new bucket
+ finish the split using algorithm mentioned below for split
+ release the buffer content lock and pin on old and new bucketa
+ try to expand from start
+ else
+ release the buffer conetent lock and pin on old bucket and exit
+ if the garbage flag (indicates that tuples are moved by split) is set on bucket
+ release the buffer content lock on meta page
+ remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+ Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
update meta page to reflect new number of buckets
- mark meta page dirty and release buffer content lock and pin
+ mark meta page dirty and release buffer content lock
-- now, accesses to all other buckets can proceed.
Perform actual split of bucket, moving tuples as needed
>> see below about acquiring needed extra space
Release X-locks of old and new buckets
+ split guts
+ mark the old and new buckets indicating split-in-progress
+ mark the old bucket indicating has-garbage
+ copy the tuples that belongs to new bucket from old bucket
+ during copy mark such tuples as move-by-split
+ release lock but not pin for primary bucket page of old bucket,
+ read/shared-lock next page; repeat as needed
+ >> see below if no space in bucket page of new bucket
+ ensure to have exclusive-lock on both old and new buckets in that order
+ clear the split-in-progress flag from both the buckets
+ mark buffers dirty and release the locks and pins on both old and new buckets
+
Note the metapage lock is not held while the actual tuple rearrangement is
performed, so accesses to other buckets can proceed in parallel; in fact,
it's possible for multiple bucket splits to proceed in parallel.
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it. We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock. (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split. This is all right since the index is
-overfull but perfectly functional. Every subsequent inserter will try to
-split, and eventually one will succeed. If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+Split's attempt to acquire cleanup-lock on the old bucket number could fail
+if another process holds any lock or pin on it. We do not want to wait if
+that happens, because we don't want to wait while holding the metapage
+exclusive-lock. So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split. This is all right since the
+index is overfull but perfectly functional. Every subsequent inserter will
+try to split, and eventually one will succeed. If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
not be overfull and split attempts will stop. (We could make a successful
splitter loop to see if the index is still overfull, but it seems better to
distribute the split overhead across successive insertions.)
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split. This will be set only for old bucket. Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.). This is used both by vacuum as
+well as during re-split operation. Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket. The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt. The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt. The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
must-complete action. I'm not planning to teach hash about WAL in this
-go-round.
+go-round. However, we do try to finish the incomplete splits during insert
+and split.
The fourth operation is garbage collection (bulk deletion):
@@ -319,9 +392,13 @@ The fourth operation is garbage collection (bulk deletion):
fetch current max bucket number
release meta page buffer content lock and pin
while next bucket <= max bucket do
- Acquire X lock on target bucket
- Scan and remove tuples, compact free space as needed
- Release X lock
+ Acquire cleanup lock on target bucket
+ Scan and remove tuples
+ For overflow buckets, first we need to lock the next bucket and then
+ release the lock on current bucket
+ Ensure to have X lock on bucket page
+ If buffer pincount is one, then compact free space as needed
+ Release lock
next bucket ++
end loop
pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +407,23 @@ The fourth operation is garbage collection (bulk deletion):
else update metapage tuple count
mark meta page dirty and release buffer content lock and pin
-Note that this is designed to allow concurrent splits. If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm. (We must however be careful about the statistics
-reported by the VACUUM operation. What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan. This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason. But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans. If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm. As we are releasing the locks during
+scan of a bucket, it will allow concurrent scan to start on a bucket and
+ensures that scan will always be behind cleanup. It is must to keep scans
+behind cleanup, else vacuum could remove tuples that are required to
+complete the scan as explained in Lock Definitions section above. This holds
+true for backward scans as well (backward scans first traverse each bucket
+starting from first bucket to last overflow bucket in the chain).
+We must be careful about the statistics reported by the VACUUM operation.
+What we can do is count the number of tuples scanned, and believe this in
+preference to the stored tuple count if the stored tuple count and number
+of buckets did *not* change at any time during the scan. This provides a
+way of correcting the stored tuple count if it gets out of sync for some
+reason. But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted
+from the stored tuple count and use that.
Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 30c82e1..190c394 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -285,10 +285,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
/*
* An insertion into the current index page could have happened while
* we didn't have read lock on it. Re-find our position by looking
- * for the TID we previously returned. (Because we hold share lock on
- * the bucket, no deletions or splits could have occurred; therefore
- * we can expect that the TID still exists in the current index page,
- * at an offset >= where we were.)
+ * for the TID we previously returned. (Because we hold pin on the
+ * bucket, no deletions or splits could have occurred; therefore we
+ * can expect that the TID still exists in the current index page, at
+ * an offset >= where we were.)
*/
OffsetNumber maxoffnum;
@@ -423,12 +423,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
so->hashso_bucket_valid = false;
- so->hashso_bucket_blkno = 0;
so->hashso_curbuf = InvalidBuffer;
+ so->hashso_bucket_buf = InvalidBuffer;
+ so->hashso_old_bucket_buf = InvalidBuffer;
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
ItemPointerSetInvalid(&(so->hashso_heappos));
+ so->hashso_skip_moved_tuples = false;
+
scan->opaque = so;
/* register scan in case we change pages it's using */
@@ -447,15 +450,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
HashScanOpaque so = (HashScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- /* release any pin we still hold */
- if (BufferIsValid(so->hashso_curbuf))
- _hash_dropbuf(rel, so->hashso_curbuf);
- so->hashso_curbuf = InvalidBuffer;
-
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ _hash_dropscanbuf(rel, so);
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -469,6 +464,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->hashso_bucket_valid = false;
}
+
+ so->hashso_skip_moved_tuples = false;
}
/*
@@ -482,16 +479,7 @@ hashendscan(IndexScanDesc scan)
/* don't need scan registered anymore */
_hash_dropscan(scan);
-
- /* release any pin we still hold */
- if (BufferIsValid(so->hashso_curbuf))
- _hash_dropbuf(rel, so->hashso_curbuf);
- so->hashso_curbuf = InvalidBuffer;
-
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ _hash_dropscanbuf(rel, so);
pfree(so);
scan->opaque = NULL;
@@ -502,6 +490,9 @@ hashendscan(IndexScanDesc scan)
* The set of target tuples is specified via a callback routine that tells
* whether any given heap tuple (identified by ItemPointer) is being deleted.
*
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
* Result: a palloc'd struct containing statistical info for VACUUM displays.
*/
IndexBulkDeleteResult *
@@ -546,83 +537,52 @@ loop_top:
{
BlockNumber bucket_blkno;
BlockNumber blkno;
- bool bucket_dirty = false;
+ Buffer bucket_buf;
+ Buffer buf;
+ HashPageOpaque bucket_opaque;
+ Page page;
+ bool bucket_has_garbage = false;
/* Get address of bucket's start page */
bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
- /* Exclusive-lock the bucket so we can shrink it */
- _hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
/* Shouldn't have any active scans locally, either */
if (_hash_has_active_scan(rel, cur_bucket))
elog(ERROR, "hash index has active scan during VACUUM");
- /* Scan each page in bucket */
blkno = bucket_blkno;
- while (BlockNumberIsValid(blkno))
- {
- Buffer buf;
- Page page;
- HashPageOpaque opaque;
- OffsetNumber offno;
- OffsetNumber maxoffno;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
- vacuum_delay_point();
-
- buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
- LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- info->strategy);
- page = BufferGetPage(buf);
- opaque = (HashPageOpaque) PageGetSpecialPointer(page);
- Assert(opaque->hasho_bucket == cur_bucket);
-
- /* Scan each tuple in page */
- maxoffno = PageGetMaxOffsetNumber(page);
- for (offno = FirstOffsetNumber;
- offno <= maxoffno;
- offno = OffsetNumberNext(offno))
- {
- IndexTuple itup;
- ItemPointer htup;
+ /*
+ * We need to acquire a cleanup lock on the primary bucket to out wait
+ * concurrent scans.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+ LockBufferForCleanup(buf);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
- itup = (IndexTuple) PageGetItem(page,
- PageGetItemId(page, offno));
- htup = &(itup->t_tid);
- if (callback(htup, callback_state))
- {
- /* mark the item for deletion */
- deletable[ndeletable++] = offno;
- tuples_removed += 1;
- }
- else
- num_index_tuples += 1;
- }
+ page = BufferGetPage(buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
- /*
- * Apply deletions and write page if needed, advance to next page.
- */
- blkno = opaque->hasho_nextblkno;
+ /*
+ * If the bucket contains tuples that are moved by split, then we need
+ * to delete such tuples on completion of split. Before cleaning, we
+ * need to out-wait the scans that have started when the split was in
+ * progress for a bucket.
+ */
+ if (H_HAS_GARBAGE(bucket_opaque) &&
+ !H_INCOMPLETE_SPLIT(bucket_opaque))
+ bucket_has_garbage = true;
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(page, deletable, ndeletable);
- _hash_wrtbuf(rel, buf);
- bucket_dirty = true;
- }
- else
- _hash_relbuf(rel, buf);
- }
+ bucket_buf = buf;
- /* If we deleted anything, try to compact free space */
- if (bucket_dirty)
- _hash_squeezebucket(rel, cur_bucket, bucket_blkno,
- info->strategy);
+ hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+ local_metapage.hashm_maxbucket,
+ local_metapage.hashm_highmask,
+ local_metapage.hashm_lowmask, &tuples_removed,
+ &num_index_tuples, bucket_has_garbage, true,
+ callback, callback_state);
- /* Release bucket lock */
- _hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, bucket_buf);
/* Advance to next bucket */
cur_bucket++;
@@ -703,6 +663,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
return stats;
}
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow buckets, first we need to lock the next bucket and
+ * then release the lock on current bucket. This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup. Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+ BlockNumber bucket_blkno,
+ BufferAccessStrategy bstrategy,
+ uint32 maxbucket,
+ uint32 highmask, uint32 lowmask,
+ double *tuples_removed,
+ double *num_index_tuples,
+ bool bucket_has_garbage,
+ bool delay,
+ IndexBulkDeleteCallback callback,
+ void *callback_state)
+{
+ BlockNumber blkno;
+ Buffer buf;
+ Bucket cur_bucket;
+ Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY;
+ Page page;
+ bool bucket_dirty = false;
+
+ blkno = bucket_blkno;
+ buf = bucket_buf;
+ page = BufferGetPage(buf);
+ cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+ if (bucket_has_garbage)
+ new_bucket = _hash_get_newbucket(rel, cur_bucket,
+ lowmask, maxbucket);
+
+ /* Scan each page in bucket */
+ for (;;)
+ {
+ HashPageOpaque opaque;
+ OffsetNumber offno;
+ OffsetNumber maxoffno;
+ Buffer next_buf;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ bool retain_pin = false;
+ bool curr_page_dirty = false;
+
+ if (delay)
+ vacuum_delay_point();
+
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ /* Scan each tuple in page */
+ maxoffno = PageGetMaxOffsetNumber(page);
+ for (offno = FirstOffsetNumber;
+ offno <= maxoffno;
+ offno = OffsetNumberNext(offno))
+ {
+ IndexTuple itup;
+ ItemPointer htup;
+ Bucket bucket;
+
+ itup = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, offno));
+ htup = &(itup->t_tid);
+ if (callback && callback(htup, callback_state))
+ {
+ /* mark the item for deletion */
+ deletable[ndeletable++] = offno;
+ if (tuples_removed)
+ *tuples_removed += 1;
+ }
+ else if (bucket_has_garbage)
+ {
+ /* delete the tuples that are moved by split. */
+ bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+ maxbucket,
+ highmask,
+ lowmask);
+ /* mark the item for deletion */
+ if (bucket != cur_bucket)
+ {
+ /*
+ * We expect tuples to either belong to curent bucket or
+ * new_bucket. This is ensured because we don't allow
+ * further splits from bucket that contains garbage. See
+ * comments in _hash_expandtable.
+ */
+ Assert(bucket == new_bucket);
+ deletable[ndeletable++] = offno;
+ }
+ else if (num_index_tuples)
+ *num_index_tuples += 1;
+ }
+ else if (num_index_tuples)
+ *num_index_tuples += 1;
+ }
+
+ /* retain the pin on primary bucket till end of bucket scan */
+ if (blkno == bucket_blkno)
+ retain_pin = true;
+ else
+ retain_pin = false;
+
+ blkno = opaque->hasho_nextblkno;
+
+ /*
+ * Apply deletions and write page if needed, advance to next page.
+ */
+ if (ndeletable > 0)
+ {
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ bucket_dirty = true;
+ curr_page_dirty = true;
+ }
+
+ /* bail out if there are no more pages to scan. */
+ if (!BlockNumberIsValid(blkno))
+ break;
+
+ next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+ LH_OVERFLOW_PAGE,
+ bstrategy);
+
+ /*
+ * release the lock on previous page after acquiring the lock on next
+ * page
+ */
+ if (curr_page_dirty)
+ {
+ if (retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, buf);
+ curr_page_dirty = false;
+ }
+ else if (retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
+
+ buf = next_buf;
+ }
+
+ /*
+ * lock the bucket page to clear the garbage flag and squeeze the bucket.
+ * if the current buffer is same as bucket buffer, then we already have
+ * lock on bucket page.
+ */
+ if (buf != bucket_buf)
+ {
+ _hash_relbuf(rel, buf);
+ _hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+ }
+
+ /*
+ * Clear the garbage flag from bucket after deleting the tuples that are
+ * moved by split. We purposefully clear the flag before squeeze bucket,
+ * so that after restart, vacuum shouldn't again try to delete the moved
+ * by split tuples.
+ */
+ if (bucket_has_garbage)
+ {
+ HashPageOpaque bucket_opaque;
+
+ page = BufferGetPage(bucket_buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+ }
+
+ /*
+ * If we deleted anything, try to compact free space. For squeezing the
+ * bucket, we must have a cleanup lock, else it can impact the ordering of
+ * tuples for a scan that has started before it.
+ */
+ if (bucket_dirty && CheckBufferForCleanup(bucket_buf))
+ _hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+ bstrategy);
+}
void
hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..b1e79b5 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
void
_hash_doinsert(Relation rel, IndexTuple itup)
{
- Buffer buf;
+ Buffer buf = InvalidBuffer;
+ Buffer bucket_buf;
Buffer metabuf;
HashMetaPage metap;
BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
bool do_expand;
uint32 hashkey;
Bucket bucket;
+ uint32 maxbucket;
+ uint32 highmask;
+ uint32 lowmask;
/*
* Get the hash key for the item (it's stored in the index tuple itself).
@@ -70,51 +74,131 @@ _hash_doinsert(Relation rel, IndexTuple itup)
errhint("Values larger than a buffer page cannot be indexed.")));
/*
- * Loop until we get a lock on the correct target bucket.
+ * Copy bucket mapping info now; The comment in _hash_expandtable where
+ * we copy this information and calls _hash_splitbucket explains why this
+ * is OK.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
+ maxbucket = metap->hashm_maxbucket;
+ highmask = metap->hashm_highmask;
+ lowmask = metap->hashm_lowmask;
- blkno = BUCKET_TO_BLKNO(metap, bucket);
+ /*
+ * Conditionally get the lock on primary bucket page for insertion while
+ * holding lock on meta page. If we have to wait, then release the meta
+ * page lock and retry it in a hard way.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ maxbucket,
+ highmask,
+ lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /* Release metapage lock, but keep pin. */
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBuffer(buf))
+ {
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ LockBuffer(buf, HASH_WRITE);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
+ else
+ {
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ }
+ if (retry)
+ {
/*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
+ * Loop until we get a lock on the correct target bucket. We get the
+ * lock on primary bucket page and retain the pin on it during insert
+ * operation to prevent the concurrent splits. Retaining pin on a
+ * primary bucket page ensures that split can't happen as it needs to
+ * acquire the cleanup lock on primary bucket page. Acquiring lock on
+ * primary bucket and rechecking if it is a target bucket is mandatory
+ * as otherwise a concurrent split might cause this insertion to fall
+ * in wrong bucket.
*/
- if (retry)
+ for (;;)
{
+ /*
+ * Compute the target bucket number, and convert to block number.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Release metapage lock, but keep pin. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ /*
+ * If the previous iteration of this loop locked what is still the
+ * correct target bucket, we are done. Otherwise, drop any old
+ * lock and lock what now appears to be the correct bucket.
+ */
if (oldblkno == blkno)
break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
+ _hash_relbuf(rel, buf);
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
+ /* Fetch the primary bucket page for the bucket */
+ buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+ /*
+ * Reacquire metapage lock and check that no bucket split has
+ * taken place while we were awaiting the bucket lock.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ }
}
- /* Fetch the primary bucket page for the bucket */
- buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+ /* remember the primary bucket buffer to release the pin on it at end. */
+ bucket_buf = buf;
+
page = BufferGetPage(buf);
pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(pageopaque->hasho_bucket == bucket);
+ /*
+ * If there is any pending split, try to finish it before proceeding for
+ * the insertion. We try to finish the split for the insertion in old
+ * bucket, as that will allow us to remove the tuples from old bucket and
+ * reuse the space. There is no such apparent benefit from finsihing the
+ * split during insertion in new bucket.
+ *
+ * In future, if we want to finish the splits during insertion in new
+ * bucket, we must ensure the locking order such that old bucket is locked
+ * before new bucket.
+ */
+ if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && CheckBufferForCleanup(buf))
+ {
+ BlockNumber nblkno;
+ Buffer nbuf;
+
+ nblkno = _hash_get_newblk(rel, pageopaque);
+
+ /* Fetch the primary bucket page for the new bucket */
+ nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+ if (nbuf)
+ {
+ _hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+ highmask, lowmask);
+
+ /*
+ * release the buffer here as the insertion will happen in old
+ * bucket.
+ */
+ _hash_relbuf(rel, nbuf);
+ }
+ }
+
/* Do the insertion */
while (PageGetFreeSpace(page) < itemsz)
{
@@ -127,14 +211,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
{
/*
* ovfl page exists; go get it. if it doesn't have room, we'll
- * find out next pass through the loop test above.
+ * find out next pass through the loop test above. Retain the pin
+ * if it is a primary bucket.
*/
- _hash_relbuf(rel, buf);
+ if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
page = BufferGetPage(buf);
}
else
{
+ bool retain_pin = false;
+
+ /* page flags must be accessed before releasing lock on a page. */
+ retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
/*
* we're at the end of the bucket chain and we haven't found a
* page with enough room. allocate a new overflow page.
@@ -144,7 +237,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
/* chain to a new overflow page */
- buf = _hash_addovflpage(rel, metabuf, buf);
+ buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
page = BufferGetPage(buf);
/* should fit now, given test above */
@@ -158,11 +251,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
/* found page with enough space, so add the item here */
(void) _hash_pgaddtup(rel, buf, itemsz, itup);
- /* write and release the modified page */
+ /*
+ * write and release the modified page and ensure to release the pin on
+ * primary page.
+ */
_hash_wrtbuf(rel, buf);
-
- /* We can drop the bucket lock now */
- _hash_droplock(rel, blkno, HASH_SHARE);
+ if (buf != bucket_buf)
+ _hash_dropbuf(rel, bucket_buf);
/*
* Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
*
* On entry, the caller must hold a pin but no lock on 'buf'. The pin is
* dropped before exiting (we assume the caller is not interested in 'buf'
- * anymore). The returned overflow page will be pinned and write-locked;
- * it is guaranteed to be empty.
+ * anymore) if not asked to retain. The pin will be retained only for the
+ * primary bucket. The returned overflow page will be pinned and
+ * write-locked; it is guaranteed to be empty.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* That buffer is returned in the same state.
*
- * The caller must hold at least share lock on the bucket, to ensure that
- * no one else tries to compact the bucket meanwhile. This guarantees that
- * 'buf' won't stop being part of the bucket while it's unlocked.
- *
* NB: since this could be executed concurrently by multiple processes,
* one should not assume that the returned overflow page will be the
* immediate successor of the originally passed 'buf'. Additional overflow
* pages might have been added to the bucket chain in between.
*/
Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
{
Buffer ovflbuf;
Page page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
break;
/* we assume we do not need to write the unmodified page */
- _hash_relbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
/* logically chain overflow page to previous page */
pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
- _hash_wrtbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, buf);
return ovflbuf;
}
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
* in the bucket, or InvalidBlockNumber if no following page.
*
* NB: caller must not hold lock on metapage, nor on either page that's
- * adjacent in the bucket chain. The caller had better hold exclusive lock
- * on the bucket, too.
+ * adjacent in the bucket chain except from primary bucket. The caller had
+ * better hold cleanup lock on the primary bucket.
*/
BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
BufferAccessStrategy bstrategy)
{
HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
/*
* Fix up the bucket chain. this is a doubly-linked list, so we must fix
* up the bucket chain members behind and ahead of the overflow page being
- * deleted. No concurrency issues since we hold exclusive lock on the
- * entire bucket.
+ * deleted. No concurrency issues since we hold the cleanup lock on
+ * primary bucket. We don't need to aqcuire buffer lock to fix the
+ * primary bucket, as we already have that lock.
*/
if (BlockNumberIsValid(prevblkno))
{
- Buffer prevbuf = _hash_getbuf_with_strategy(rel,
- prevblkno,
- HASH_WRITE,
+ if (prevblkno == bucket_blkno)
+ {
+ Buffer prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+ prevblkno,
+ RBM_NORMAL,
+ bstrategy);
+
+ Page prevpage = BufferGetPage(prevbuf);
+ HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+ Assert(prevopaque->hasho_bucket == bucket);
+ prevopaque->hasho_nextblkno = nextblkno;
+ MarkBufferDirty(prevbuf);
+ ReleaseBuffer(prevbuf);
+ }
+ else
+ {
+ Buffer prevbuf = _hash_getbuf_with_strategy(rel,
+ prevblkno,
+ HASH_WRITE,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- bstrategy);
- Page prevpage = BufferGetPage(prevbuf);
- HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+ bstrategy);
+ Page prevpage = BufferGetPage(prevbuf);
+ HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
- Assert(prevopaque->hasho_bucket == bucket);
- prevopaque->hasho_nextblkno = nextblkno;
- _hash_wrtbuf(rel, prevbuf);
+ Assert(prevopaque->hasho_bucket == bucket);
+ prevopaque->hasho_nextblkno = nextblkno;
+ _hash_wrtbuf(rel, prevbuf);
+ }
}
if (BlockNumberIsValid(nextblkno))
{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
* required that to be true on entry as well, but it's a lot easier for
* callers to leave empty overflow pages and let this guy clean it up.
*
- * Caller must hold exclusive lock on the target bucket. This allows
+ * Caller must hold cleanup lock on the target bucket. This allows
* us to safely lock multiple pages in the bucket.
*
* Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
_hash_squeezebucket(Relation rel,
Bucket bucket,
BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy)
{
BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
HashPageOpaque wopaque;
HashPageOpaque ropaque;
bool wbuf_dirty;
+ bool release_buf = false;
/*
* start squeezing into the base bucket page.
*/
wblkno = bucket_blkno;
- wbuf = _hash_getbuf_with_strategy(rel,
- wblkno,
- HASH_WRITE,
- LH_BUCKET_PAGE,
- bstrategy);
+ wbuf = bucket_buf;
wpage = BufferGetPage(wbuf);
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
/*
- * if there aren't any overflow pages, there's nothing to squeeze.
+ * if there aren't any overflow pages, there's nothing to squeeze. caller
+ * is responsible to release the lock on primary bucket.
*/
if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
- {
- _hash_relbuf(rel, wbuf);
return;
- }
/*
* Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
{
Assert(!PageIsEmpty(wpage));
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
wblkno = wopaque->hasho_nextblkno;
Assert(BlockNumberIsValid(wblkno));
- if (wbuf_dirty)
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
Assert(wopaque->hasho_bucket == bucket);
wbuf_dirty = false;
+ release_buf = false;
}
/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
/* are we freeing the page adjacent to wbuf? */
if (rblkno == wblkno)
{
- /* yes, so release wbuf lock first */
- if (wbuf_dirty)
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
+ /* yes, so release wbuf lock first if needed */
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
+
/* free this overflow page (releases rbuf) */
- _hash_freeovflpage(rel, rbuf, bstrategy);
+ _hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
/* done */
return;
}
/* free this overflow page, then get the previous one */
- _hash_freeovflpage(rel, rbuf, bstrategy);
+ _hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
rbuf = _hash_getbuf_with_strategy(rel,
rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..bb43aaa 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
uint32 nblocks);
static void _hash_splitbucket(Relation rel, Buffer metabuf,
Bucket obucket, Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+ Bucket obucket, Bucket nbucket, Buffer obuf,
+ Buffer nbuf, HTAB *htab, uint32 maxbucket,
+ uint32 highmask, uint32 lowmask);
/*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
/*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock. (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- return ConditionalLockPage(rel, whichlock, access);
- else
- return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- UnlockPage(rel, whichlock, access);
-}
-
-/*
* _hash_getbuf() -- Get a buffer by block number for read or write.
*
* 'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
}
/*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ * We try to take the conditional cleanup lock and if we get it then
+ * retrun the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+ Buffer buf;
+
+ if (blkno == P_NEW)
+ elog(ERROR, "hash AM does not use P_NEW");
+
+ buf = ReadBuffer(rel, blkno);
+
+ if (!ConditionalLockBufferForCleanup(buf))
+ {
+ ReleaseBuffer(buf);
+ return InvalidBuffer;
+ }
+
+ /* ref count and lock type are correct */
+
+ _hash_checkpage(rel, buf, flags);
+
+ return buf;
+}
+
+/*
* _hash_getinitbuf() -- Get and initialize a buffer by block number.
*
* This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
}
/*
+ * _hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+ /* release pin we hold on primary bucket */
+ if (BufferIsValid(so->hashso_bucket_buf) &&
+ so->hashso_bucket_buf != so->hashso_curbuf)
+ _hash_dropbuf(rel, so->hashso_bucket_buf);
+ so->hashso_bucket_buf = InvalidBuffer;
+
+ /* release pin we hold on old primary bucket */
+ if (BufferIsValid(so->hashso_old_bucket_buf) &&
+ so->hashso_old_bucket_buf != so->hashso_curbuf)
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+
+ /* release any pin we still hold */
+ if (BufferIsValid(so->hashso_curbuf))
+ _hash_dropbuf(rel, so->hashso_curbuf);
+ so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
* _hash_wrtbuf() -- write a hash page to disk.
*
* This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
/*
* Attempt to expand the hash table by creating one new bucket.
*
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old or new bucket.
*
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
BlockNumber start_oblkno;
BlockNumber start_nblkno;
Buffer buf_nblkno;
+ Buffer buf_oblkno;
+ Page opage;
+ HashPageOpaque oopaque;
uint32 maxbucket;
uint32 highmask;
uint32 lowmask;
+restart_expand:
+
/*
* Write-lock the meta page. It used to be necessary to acquire a
* heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
goto fail;
/*
- * Determine which bucket is to be split, and attempt to lock the old
- * bucket. If we can't get the lock, give up.
+ * Determine which bucket is to be split, and attempt to take cleanup lock
+ * on the old bucket. If we can't get the lock, give up.
*
- * The lock protects us against other backends, but not against our own
- * backend. Must check for active scans separately.
+ * The cleanup lock protects us against other backends, but not against
+ * our own backend. Must check for active scans separately.
+ *
+ * The cleanup lock is mainly to protect the split from concurrent
+ * inserts. See src/backend/access/hash/README, Lock Definitions for
+ * further details. Due to this locking restriction, if there is any
+ * pending scan, split will give up which is not good, but harmless.
*/
new_bucket = metap->hashm_maxbucket + 1;
@@ -563,11 +595,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, old_bucket))
goto fail;
- if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+ buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+ if (!buf_oblkno)
goto fail;
+ opage = BufferGetPage(buf_oblkno);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
/*
- * Likewise lock the new bucket (should never fail).
+ * We want to finish the split from a bucket as there is no apparent
+ * benefit by not doing so and it will make the code complicated to finish
+ * the split that involves multiple buckets considering the case where new
+ * split also fails. We don't need to cosider the new bucket for
+ * completing the split here as it is not possible that a re-split of new
+ * bucket starts when there is still a pending split from old bucket.
+ */
+ if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+ {
+ BlockNumber nblkno;
+ Buffer buf_nblkno;
+
+ /*
+ * Copy bucket mapping info now; The comment in code below where we
+ * copy this information and calls _hash_splitbucket explains why this
+ * is OK.
+ */
+ maxbucket = metap->hashm_maxbucket;
+ highmask = metap->hashm_highmask;
+ lowmask = metap->hashm_lowmask;
+
+ /* Release the metapage lock, before completing the split. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ nblkno = _hash_get_newblk(rel, oopaque);
+
+ /* Fetch the primary bucket page for the new bucket */
+ buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+ if (!buf_nblkno)
+ {
+ _hash_relbuf(rel, buf_oblkno);
+ goto fail;
+ }
+
+ _hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+ highmask, lowmask);
+
+ /*
+ * release the buffers and retry for expand.
+ */
+ _hash_relbuf(rel, buf_oblkno);
+ _hash_relbuf(rel, buf_nblkno);
+
+ goto restart_expand;
+ }
+
+ /*
+ * Clean the tuples remained from previous split. This operation requires
+ * cleanup lock and we already have one on old bucket, so let's do it. We
+ * also don't want to allow further splits from the bucket till the
+ * garbage of previous split is cleaned. This has two advantages, first
+ * it helps in avoiding the bloat due to garbage and second is, during
+ * cleanup of bucket, we are always sure that the garbage tuples belong to
+ * most recently splitted bucket. On the contrary, if we allow cleanup of
+ * bucket after meta page is updated to indicate the new split and before
+ * the actual split, the cleanup operation won't be able to decide whether
+ * the tuple has been moved to the newly created bucket and ended up
+ * deleting such tuples.
+ */
+ if (H_HAS_GARBAGE(oopaque))
+ {
+ /* Release the metapage lock. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+ metap->hashm_maxbucket, metap->hashm_highmask,
+ metap->hashm_lowmask, NULL,
+ NULL, true, false, NULL, NULL);
+
+ _hash_relbuf(rel, buf_oblkno);
+
+ goto restart_expand;
+ }
+
+ /*
+ * There shouldn't be any active scan on new bucket.
*
* Note: it is safe to compute the new bucket's blkno here, even though we
* may still need to update the BUCKET_TO_BLKNO mapping. This is because
@@ -579,9 +690,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, new_bucket))
elog(ERROR, "scan in progress on supposedly new bucket");
- if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
- elog(ERROR, "could not get lock on supposedly new bucket");
-
/*
* If the split point is increasing (hashm_maxbucket's log base 2
* increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +708,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
{
/* can't split due to BlockNumber overflow */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, buf_oblkno);
goto fail;
}
}
@@ -609,9 +716,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/*
* Physically allocate the new bucket's primary page. We want to do this
* before changing the metapage's mapping info, in case we can't get the
- * disk space.
+ * disk space. Ideally, we don't need to check for cleanup lock on new
+ * bucket as no other backend could find this bucket unless meta page is
+ * updated. However, it is good to be consistent with old bucket locking.
*/
buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+ if (!CheckBufferForCleanup(buf_nblkno))
+ {
+ _hash_relbuf(rel, buf_oblkno);
+ _hash_relbuf(rel, buf_nblkno);
+ goto fail;
+ }
+
/*
* Okay to proceed with split. Update the metapage bucket mapping info.
@@ -665,13 +781,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/* Relocate records to the new bucket */
_hash_splitbucket(rel, metabuf,
old_bucket, new_bucket,
- start_oblkno, buf_nblkno,
+ buf_oblkno, buf_nblkno,
maxbucket, highmask, lowmask);
- /* Release bucket locks, allowing others to access them */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
return;
/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +857,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
* The buffer is returned in the same state. (The metapage is only
* touched if it becomes necessary to add or remove overflow pages.)
*
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation. This is to prevent vacuum to start
+ * when split is in progress.
+ *
* In addition, the caller must have created the new bucket's base page,
* which is passed in buffer nbuf, pinned and write-locked. That lock and
* pin are released here. (The API is set up this way because we must do
@@ -756,37 +872,87 @@ _hash_splitbucket(Relation rel,
Buffer metabuf,
Bucket obucket,
Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask,
uint32 lowmask)
{
- Buffer obuf;
Page opage;
Page npage;
HashPageOpaque oopaque;
HashPageOpaque nopaque;
- /*
- * It should be okay to simultaneously write-lock pages from each bucket,
- * since no one else can be trying to acquire buffer lock on pages of
- * either bucket.
- */
- obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ /*
+ * Mark the old bucket to indicate that split is in progress and it has
+ * deletable tuples. At operation end, we clear split in progress flag and
+ * vacuum will clear page_has_garbage flag after deleting such tuples.
+ */
+ oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
npage = BufferGetPage(nbuf);
- /* initialize the new bucket's primary page */
+ /*
+ * initialize the new bucket's primary page and mark it to indicate that
+ * split is in progress.
+ */
nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
nopaque->hasho_prevblkno = InvalidBlockNumber;
nopaque->hasho_nextblkno = InvalidBlockNumber;
nopaque->hasho_bucket = nbucket;
- nopaque->hasho_flag = LH_BUCKET_PAGE;
+ nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
nopaque->hasho_page_id = HASHO_PAGE_ID;
+ _hash_splitbucket_guts(rel, metabuf, obucket,
+ nbucket, obuf, nbuf, NULL,
+ maxbucket, highmask, lowmask);
+
+ /* all done, now release the locks and pins on primary buckets. */
+ _hash_relbuf(rel, obuf);
+ _hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations. To finish the previously
+ * interrupted split operation, caller needs to fill htab. If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+ Buffer metabuf,
+ Bucket obucket,
+ Bucket nbucket,
+ Buffer obuf,
+ Buffer nbuf,
+ HTAB *htab,
+ uint32 maxbucket,
+ uint32 highmask,
+ uint32 lowmask)
+{
+ Buffer bucket_obuf;
+ Buffer bucket_nbuf;
+ Page opage;
+ Page npage;
+ HashPageOpaque oopaque;
+ HashPageOpaque nopaque;
+
+ bucket_obuf = obuf;
+ opage = BufferGetPage(obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ bucket_nbuf = nbuf;
+ npage = BufferGetPage(nbuf);
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
/*
* Partition the tuples in the old bucket between the old bucket and the
* new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +964,6 @@ _hash_splitbucket(Relation rel,
BlockNumber oblkno;
OffsetNumber ooffnum;
OffsetNumber omaxoffnum;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
/* Scan each tuple in old page */
omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,18 +974,45 @@ _hash_splitbucket(Relation rel,
IndexTuple itup;
Size itemsz;
Bucket bucket;
+ bool found = false;
/*
- * Fetch the item's hash key (conveniently stored in the item) and
- * determine which bucket it now belongs in.
+ * Before inserting tuple, probe the hash table containing TIDs of
+ * tuples belonging to new bucket, if we find a match, then skip
+ * that tuple, else fetch the item's hash key (conveniently stored
+ * in the item) and determine which bucket it now belongs in.
*/
itup = (IndexTuple) PageGetItem(opage,
PageGetItemId(opage, ooffnum));
+
+ if (htab)
+ (void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+ if (found)
+ continue;
+
bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
maxbucket, highmask, lowmask);
if (bucket == nbucket)
{
+ Size itupsize = 0;
+ IndexTuple new_itup;
+
+ /*
+ * make a copy of index tuple as we have to scribble on it.
+ */
+ new_itup = CopyIndexTuple(itup);
+
+ /*
+ * mark the index tuple as moved by split, such tuples are
+ * skipped by scan if there is split in progress for a bucket.
+ */
+ itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+ new_itup->t_info &= ~INDEX_SIZE_MASK;
+ new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+ new_itup->t_info |= itupsize;
+
/*
* insert the tuple into the new bucket. if it doesn't fit on
* the current page in the new bucket, we must allocate a new
@@ -832,17 +1023,25 @@ _hash_splitbucket(Relation rel,
* only partially complete, meaning the index is corrupt,
* since searches may fail to find entries they should find.
*/
- itemsz = IndexTupleDSize(*itup);
+ itemsz = IndexTupleDSize(*new_itup);
itemsz = MAXALIGN(itemsz);
if (PageGetFreeSpace(npage) < itemsz)
{
+ bool retain_pin = false;
+
+ /*
+ * page flags must be accessed before releasing lock on a
+ * page.
+ */
+ retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
/* write out nbuf and drop lock, but keep pin */
_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
/* chain to a new overflow page */
- nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+ nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
npage = BufferGetPage(nbuf);
- /* we don't need nopaque within the loop */
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
}
/*
@@ -852,12 +1051,10 @@ _hash_splitbucket(Relation rel,
* Possible future improvement: accumulate all the items for
* the new page and qsort them before insertion.
*/
- (void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+ (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
- /*
- * Mark tuple for deletion from old page.
- */
- deletable[ndeletable++] = ooffnum;
+ /* be tidy */
+ pfree(new_itup);
}
else
{
@@ -870,15 +1067,9 @@ _hash_splitbucket(Relation rel,
oblkno = oopaque->hasho_nextblkno;
- /*
- * Done scanning this old page. If we moved any tuples, delete them
- * from the old page.
- */
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(opage, deletable, ndeletable);
- _hash_wrtbuf(rel, obuf);
- }
+ /* retain the pin on the old primary bucket */
+ if (obuf == bucket_obuf)
+ _hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, obuf);
@@ -887,18 +1078,153 @@ _hash_splitbucket(Relation rel,
break;
/* Else, advance to next old page */
- obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+ obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
}
/*
* We're at the end of the old bucket chain, so we're done partitioning
- * the tuples. Before quitting, call _hash_squeezebucket to ensure the
- * tuples remaining in the old bucket (including the overflow pages) are
- * packed as tightly as possible. The new bucket is already tight.
+ * the tuples. Mark the old and new buckets to indicate split is
+ * finished.
+ *
+ * To avoid deadlocks due to locking order of buckets, first lock the old
+ * bucket and then the new bucket.
*/
- _hash_wrtbuf(rel, nbuf);
+ if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, nbuf);
+
+ /*
+ * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+ * there is no pending scan that has seen the flag after it is cleared.
+ */
+ _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+ opage = BufferGetPage(bucket_obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ _hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+ npage = BufferGetPage(bucket_nbuf);
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+ /* indicate that split is finished */
+ oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+ nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+ /*
+ * now write the buffers, here we don't release the locks as caller is
+ * responsible to release locks.
+ */
+ MarkBufferDirty(bucket_obuf);
+ MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ * _hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state. (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking it.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+ uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+ HASHCTL hash_ctl;
+ HTAB *tidhtab;
+ Buffer bucket_nbuf;
+ Page opage;
+ Page npage;
+ HashPageOpaque opageopaque;
+ HashPageOpaque npageopaque;
+ Bucket obucket;
+ Bucket nbucket;
+ bool found;
+
+ /* Initialize hash tables used to track TIDs */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ItemPointerData);
+ hash_ctl.entrysize = sizeof(ItemPointerData);
+ hash_ctl.hcxt = CurrentMemoryContext;
+
+ tidhtab =
+ hash_create("bucket ctids",
+ 256, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ /*
+ * Scan the new bucket and build hash table of TIDs
+ */
+ bucket_nbuf = nbuf;
+ npage = BufferGetPage(nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ for (;;)
+ {
+ BlockNumber nblkno;
+ OffsetNumber noffnum;
+ OffsetNumber nmaxoffnum;
+
+ /* Scan each tuple in new page */
+ nmaxoffnum = PageGetMaxOffsetNumber(npage);
+ for (noffnum = FirstOffsetNumber;
+ noffnum <= nmaxoffnum;
+ noffnum = OffsetNumberNext(noffnum))
+ {
+ IndexTuple itup;
+
+ /* Fetch the item's TID and insert it in hash table. */
+ itup = (IndexTuple) PageGetItem(npage,
+ PageGetItemId(npage, noffnum));
+
+ (void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+ Assert(!found);
+ }
+
+ nblkno = npageopaque->hasho_nextblkno;
+
+ /*
+ * release our write lock without modifying buffer and ensure to
+ * retain the pin on primary bucket.
+ */
+ if (nbuf == bucket_nbuf)
+ _hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, nbuf);
+
+ /* Exit loop if no more overflow pages in new bucket */
+ if (!BlockNumberIsValid(nblkno))
+ break;
+
+ /* Else, advance to next page */
+ nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+ npage = BufferGetPage(nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ }
+
+ /* Need a cleanup lock to perform split operation. */
+ LockBufferForCleanup(bucket_nbuf);
+
+ npage = BufferGetPage(bucket_nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ nbucket = npageopaque->hasho_bucket;
+
+ opage = BufferGetPage(obuf);
+ opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ obucket = opageopaque->hasho_bucket;
+
+ _hash_splitbucket_guts(rel, metabuf, obucket,
+ nbucket, obuf, bucket_nbuf, tidhtab,
+ maxbucket, highmask, lowmask);
- _hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+ hash_destroy(tidhtab);
}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..512dabd 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_nextblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan to ensure
+ * that vacuum can't delete the tuples that are moved by split to new
+ * bucket. Such tuples are required by the scans that are started on
+ * splitted buckets, before a new buckets split in progress flag
+ * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_prevblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan. See
+ * comments in _hash_readnext to know the reason of retaining pin.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -104,6 +125,13 @@ _hash_readprev(Relation rel,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
*pagep = BufferGetPage(*bufp);
*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+ /*
+ * We always maintain the pin on bucket page for whole scan operation,
+ * so releasing the additional pin we have acquired here.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_dropbuf(rel, *bufp);
}
}
@@ -192,43 +220,81 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
metap = HashPageGetMeta(page);
/*
- * Loop until we get a lock on the correct target bucket.
+ * Conditionally get the lock on primary bucket page for search while
+ * holding lock on meta page. If we have to wait, then release the meta
+ * page lock and retry it in a hard way.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
- blkno = BUCKET_TO_BLKNO(metap, bucket);
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /* Release metapage lock, but keep pin. */
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBufferShared(buf))
+ {
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ LockBuffer(buf, HASH_READ);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
+ else
+ {
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ }
+ if (retry)
+ {
/*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
+ * Loop until we get a lock on the correct target bucket. We get the
+ * lock on primary bucket page and retain the pin on it during read
+ * operation to prevent the concurrent splits. Retaining pin on a
+ * primary bucket page ensures that split can't happen as it needs to
+ * acquire the cleanup lock on primary bucket page. Acquiring lock on
+ * primary bucket and rechecking if it is a target bucket is mandatory
+ * as otherwise a concurrent split followed by vacuum could remove
+ * tuples from the selected bucket which otherwise would have been
+ * visible.
*/
- if (retry)
+ for (;;)
{
+ /*
+ * Compute the target bucket number, and convert to block number.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Release metapage lock, but keep pin. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ /*
+ * If the previous iteration of this loop locked what is still the
+ * correct target bucket, we are done. Otherwise, drop any old
+ * lock and lock what now appears to be the correct bucket.
+ */
if (oldblkno == blkno)
break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
+ _hash_relbuf(rel, buf);
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
+ /* Fetch the primary bucket page for the bucket */
+ buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * Reacquire metapage lock and check that no bucket split has
+ * taken place while we were awaiting the bucket lock.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ }
}
/* done with the metapage */
@@ -237,14 +303,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
/* Update scan opaque state to show we have lock on the bucket */
so->hashso_bucket = bucket;
so->hashso_bucket_valid = true;
- so->hashso_bucket_blkno = blkno;
- /* Fetch the primary bucket page for the bucket */
- buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
page = BufferGetPage(buf);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->hasho_bucket == bucket);
+ so->hashso_bucket_buf = buf;
+
+ /*
+ * If the bucket split is in progress, then we need to skip tuples that
+ * are moved from old bucket. To ensure that vacuum doesn't clean any
+ * tuples from old or new buckets till this scan is in progress, maintain
+ * a pin on both of the buckets. Here, we have to be cautious about lock
+ * ordering, first acquire the lock on old bucket, release the lock on old
+ * bucket, but not pin, then acuire the lock on new bucket and again
+ * re-verify whether the bucket split still is in progress. Acquiring lock
+ * on old bucket first ensures that the vacuum waits for this scan to
+ * finish.
+ */
+ if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+ {
+ BlockNumber old_blkno;
+ Buffer old_buf;
+
+ old_blkno = _hash_get_oldblk(rel, opaque);
+
+ /*
+ * release the lock on new bucket and re-acquire it after acquiring
+ * the lock on old bucket.
+ */
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+ old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * remember the old bucket buffer so as to use it later for scanning.
+ */
+ so->hashso_old_bucket_buf = old_buf;
+ _hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ Assert(opaque->hasho_bucket == bucket);
+
+ if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+ so->hashso_skip_moved_tuples = true;
+ else
+ {
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+ }
+ }
+
/* If a backwards scan is requested, move to the end of the chain */
if (ScanDirectionIsBackward(dir))
{
@@ -273,6 +385,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
* false. Else, return true and set the hashso_curpos for the
* scan to the right thing.
*
+ * Here we also scan the old bucket if the split for current bucket
+ * was in progress at the start of scan. The basic idea is that
+ * skip the tuples that are moved by split while scanning current
+ * bucket and then scan the old bucket to cover all such tuples. This
+ * is done to ensure that we don't miss any tuples in the scans that
+ * started during split.
+ *
* 'bufP' points to the current buffer, which is pinned and read-locked.
* On success exit, we have pin and read-lock on whichever page
* contains the right item; on failure, we have released all buffers.
@@ -338,6 +457,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum >= FirstOffsetNumber);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ {
+ offnum = OffsetNumberNext(offnum); /* move forward */
+ continue;
+ }
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -353,9 +485,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ buf = so->hashso_old_bucket_buf;
+
+ /*
+ * old buket buffer must be valid as we acquire
+ * the pin on it before the start of scan and
+ * retain it till end of scan.
+ */
+ Assert(BufferIsValid(buf));
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
@@ -379,6 +543,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum <= maxoff);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ {
+ offnum = OffsetNumberPrev(offnum); /* move back */
+ continue;
+ }
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -394,9 +571,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ buf = so->hashso_old_bucket_buf;
+
+ /*
+ * old buket buffer must be valid as we acquire
+ * the pin on it before the start of scan and
+ * retain it till end of scan.
+ */
+ Assert(BufferIsValid(buf));
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
@@ -410,9 +619,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
if (itup == NULL)
{
- /* we ran off the end of the bucket without finding a match */
+ /*
+ * We ran off the end of the bucket without finding a match.
+ * Release the pin on bucket buffers. Normally, such pins are
+ * released at end of scan, however scrolling cursors can
+ * reacquire the bucket lock and pin in the same scan multiple
+ * times.
+ */
*bufP = so->hashso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
+ _hash_dropscanbuf(rel, so);
return false;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..1648581 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
}
/*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+ uint32 i = 0;
+
+ while (num)
+ {
+ num = num >> 1;
+ ++i;
+ }
+
+ return i - 1;
+}
+
+/*
* _hash_checkpage -- sanity checks on the format of all hash pages
*
* If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
return lower;
}
+
+/*
+ * _hash_get_oldblk() -- get the block number from which current bucket
+ * is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket old_bucket;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ /*
+ * To get the old bucket from the current bucket, we need a mask to modulo
+ * into lower half of table. This mask is stored in meta page as
+ * hashm_lowmask, but here we can't rely on the same, because we need a
+ * value of lowmask that was prevalent at the time when bucket split was
+ * started. Masking the most significant bit of new bucket would give us
+ * old bucket.
+ */
+ curr_bucket = opaque->hasho_bucket;
+ mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+ old_bucket = curr_bucket & mask;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
+
+/*
+ * _hash_get_newblk() -- get the block number of bucket for the new bucket
+ * that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half. It is mainly required to finsh the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket new_bucket;
+ uint32 lowmask;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ curr_bucket = opaque->hasho_bucket;
+
+ /*
+ * new bucket can be obtained by OR'ing old bucket with most significant
+ * bit of current table half. There could be multiple buckets that could
+ * have splitted from curent bucket. We need the first such bucket that
+ * exists based on current table half.
+ */
+ lowmask = metap->hashm_lowmask;
+
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = curr_bucket | mask;
+ if (new_bucket > metap->hashm_maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+ break;
+ }
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
+
+/*
+ * _hash_get_newbucket() -- get the new bucket that will be generated after
+ * split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket. New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function. There could be multiple buckets that
+ * could have splitted from curent bucket. We need the first such bucket that
+ * exists. Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+ uint32 lowmask, uint32 maxbucket)
+{
+ Bucket new_bucket;
+ uint32 mask;
+
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = curr_bucket | mask;
+ if (new_bucket > maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ break;
+ }
+
+ return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..1c9be40 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
}
/*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+ BufferDesc *buf;
+
+ Assert(BufferIsValid(buffer));
+ if (BufferIsLocal(buffer))
+ return true; /* act as though we got it */
+
+ buf = GetBufferDescriptor(buffer - 1);
+
+ return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+ LW_SHARED);
+}
+
+/*
* LockBufferForCleanup - lock a buffer in preparation for deleting items
*
* Items may be deleted from a disk page only when the caller (a) holds an
@@ -3750,6 +3770,49 @@ ConditionalLockBufferForCleanup(Buffer buffer)
return false;
}
+/*
+ * CheckBufferForCleanup - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK. If
+ * not, return FALSE.
+ */
+bool
+CheckBufferForCleanup(Buffer buffer)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ Assert(BufferIsValid(buffer));
+
+ if (BufferIsLocal(buffer))
+ {
+ /* There should be exactly one pin */
+ if (LocalRefCount[-buffer - 1] != 1)
+ return false;
+ /* Nobody else to wait for */
+ return true;
+ }
+
+ /* There should be exactly one local pin */
+ if (GetPrivateRefCount(buffer) != 1)
+ return false;
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+
+ buf_state = LockBufHdr(bufHdr);
+
+ Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+ if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+ {
+ /* pincount is OK. */
+ UnlockBufHdr(bufHdr, buf_state);
+ return true;
+ }
+
+ UnlockBufHdr(bufHdr, buf_state);
+ return false;
+}
+
/*
* Functions for buffer I/O handling
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index ce31418..6e8fc4c 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -25,6 +25,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/lockdefs.h"
+#include "utils/hsearch.h"
#include "utils/relcache.h"
/*
@@ -52,6 +53,9 @@ typedef uint32 Bucket;
#define LH_BUCKET_PAGE (1 << 1)
#define LH_BITMAP_PAGE (1 << 2)
#define LH_META_PAGE (1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT (1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT (1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE (1 << 6)
typedef struct HashPageOpaqueData
{
@@ -64,6 +68,12 @@ typedef struct HashPageOpaqueData
typedef HashPageOpaqueData *HashPageOpaque;
+#define H_HAS_GARBAGE(opaque) ((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque) (((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+ ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
/*
* The page ID is for the convenience of pg_filedump and similar utilities,
* which otherwise would have a hard time telling pages of different index
@@ -88,12 +98,6 @@ typedef struct HashScanOpaqueData
bool hashso_bucket_valid;
/*
- * If we have a share lock on the bucket, we record it here. When
- * hashso_bucket_blkno is zero, we have no such lock.
- */
- BlockNumber hashso_bucket_blkno;
-
- /*
* We also want to remember which buffer we're currently examining in the
* scan. We keep the buffer pinned (but not locked) across hashgettuple
* calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -101,11 +105,23 @@ typedef struct HashScanOpaqueData
*/
Buffer hashso_curbuf;
+ /* remember the buffer associated with primary bucket */
+ Buffer hashso_bucket_buf;
+
+ /*
+ * remember the buffer associated with old primary bucket which is
+ * required during the scan of the bucket for which split is in progress.
+ */
+ Buffer hashso_old_bucket_buf;
+
/* Current position of the scan, as an index TID */
ItemPointerData hashso_curpos;
/* Current position of the scan, as a heap TID */
ItemPointerData hashso_heappos;
+
+ /* Whether scan needs to skip tuples that are moved by split */
+ bool hashso_skip_moved_tuples;
} HashScanOpaqueData;
typedef HashScanOpaqueData *HashScanOpaque;
@@ -176,6 +192,8 @@ typedef HashMetaPageData *HashMetaPage;
sizeof(ItemIdData) - \
MAXALIGN(sizeof(HashPageOpaqueData)))
+#define INDEX_MOVED_BY_SPLIT_MASK 0x2000
+
#define HASH_MIN_FILLFACTOR 10
#define HASH_DEFAULT_FILLFACTOR 75
@@ -224,9 +242,6 @@ typedef HashMetaPageData *HashMetaPage;
#define HASH_WRITE BUFFER_LOCK_EXCLUSIVE
#define HASH_NOLOCK (-1)
-#define HASH_SHARE ShareLock
-#define HASH_EXCLUSIVE ExclusiveLock
-
/*
* Strategy number. There's only one valid strategy for hashing: equality.
*/
@@ -299,21 +314,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
Size itemsize, IndexTuple itup);
/* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
- BufferAccessStrategy bstrategy);
+ BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
BlockNumber blkno, ForkNumber forkNum);
extern void _hash_squeezebucket(Relation rel,
Bucket bucket, BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy);
/* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+ BlockNumber blkno, int flags);
extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
ForkNumber forkNum);
@@ -322,6 +337,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
BufferAccessStrategy bstrategy);
extern void _hash_relbuf(Relation rel, Buffer buf);
extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
extern void _hash_wrtbuf(Relation rel, Buffer buf);
extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
int to_access);
@@ -329,6 +345,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
ForkNumber forkNum);
extern void _hash_pageinit(Page page, Size size);
extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+ Buffer nbuf, uint32 maxbucket, uint32 highmask,
+ uint32 lowmask);
/* hashscan.c */
extern void _hash_regscan(IndexScanDesc scan);
@@ -364,10 +383,20 @@ extern bool _hash_convert_tuple(Relation index,
Datum *index_values, bool *index_isnull);
extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+ uint32 lowmask, uint32 maxbucket);
/* hash.c */
extern void hash_redo(XLogReaderState *record);
extern void hash_desc(StringInfo buf, XLogReaderState *record);
extern const char *hash_identify(uint8 info);
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+ BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+ uint32 maxbucket, uint32 highmask, uint32 lowmask,
+ double *tuples_removed, double *num_index_tuples,
+ bool bucket_has_garbage, bool delay,
+ IndexBulkDeleteCallback callback, void *callback_state);
#endif /* HASH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..6d0a29c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -226,8 +226,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
extern void UnlockBuffers(void);
extern void LockBuffer(Buffer buffer, int mode);
extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool CheckBufferForCleanup(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
extern void AbortBufferIO(void);
On 08/05/2016 07:36 AM, Amit Kapila wrote:
On Thu, Aug 4, 2016 at 8:02 PM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
I did some basic testing of same. In that I found one issue with cursor.
Thanks for the testing. The reason for failure was that the patch
didn't take into account the fact that for scrolling cursors, scan can
reacquire the lock and pin on bucket buffer multiple times. I have
fixed it such that we release the pin on bucket buffers after we scan
the last overflow page in bucket. Attached patch fixes the issue for
me, let me know if you still see the issue.
Needs a rebase.
hashinsert.c
+ * reuse the space. There is no such apparent benefit from finsihing the
-> finishing
hashpage.c
+ * retrun the buffer, else return InvalidBuffer.
-> return
+ if (blkno == P_NEW)
+ elog(ERROR, "hash AM does not use P_NEW");
Left over ?
+ * for unlocking it.
-> for unlocking them.
hashsearch.c
+ * bucket, but not pin, then acuire the lock on new bucket and again
-> acquire
hashutil.c
+ * half. It is mainly required to finsh the incomplete splits where we are
-> finish
Ran some tests on a CHAR() based column which showed good results. Will
have to compare with a run with the WAL patch applied.
make check-world passes.
Best regards,
Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 1, 2016 at 11:33 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
On 08/05/2016 07:36 AM, Amit Kapila wrote:
Needs a rebase.
Done.
+ if (blkno == P_NEW) + elog(ERROR, "hash AM does not use P_NEW");Left over ?
No. We need this check similar to all other _hash_*buf API's, as we
never expect caller of those API's to pass P_NEW. The new buckets
(blocks) are created during split and it uses different mechanism to
allocate blocks in bulk.
I have fixed all other issues you have raised. Updated patch is
attached with this mail.
Ran some tests on a CHAR() based column which showed good results. Will have
to compare with a run with the WAL patch applied.
Okay, Thanks for testing. I think WAL patch is still not ready for
performance testing, I am fixing few issues in that patch, but you can
do the design or code level review of that patch at this stage. I
think it is fine even if you share the performance numbers with this
and or Mithun's patch [1]https://commitfest.postgresql.org/10/715/.
[1]: https://commitfest.postgresql.org/10/715/
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
concurrent_hash_index_v5.patchapplication/octet-stream; name=concurrent_hash_index_v5.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
Buffer buf;
Page page;
- _hash_getlock(rel, blkno, HASH_SHARE);
buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
page = BufferGetPage(buf);
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
}
_hash_relbuf(rel, buf);
- _hash_droplock(rel, blkno, HASH_SHARE);
}
/*
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..a0feb2f 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,49 +125,45 @@ the initially created buckets.
Lock Definitions
----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index. lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect. Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket. A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done. A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from. Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements. LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share. A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan. This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it. This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process. If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process. (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+We use buffer content locks (LWLocks) and buffer pins to control access to
+a hash index.
+
+Scan will take a lock in shared mode on primary or overflow buckets. Inserts
+will acquire exclusive lock on the bucket in which it has to insert. Both the
+operations releases the lock on previous bucket before moving to the next
+overflow bucket. They will retain a pin on primary bucket till end of operation.
+Split operation must acquire cleanup lock on both old and new halves of the
+bucket and mark split-in-progress on both the buckets. The cleanup lock at
+the start of split ensures that parallel insert won't get lost. Consider a
+case where insertion has to add a tuple on some intermediate overflow bucket
+in the bucket chain, if we allow split when insertion is in progress, split
+might not move this newly inserted tuple. It releases the lock on previous
+bucket before moving to the next overflow bucket either for old bucket or for
+new bucket. After partitioning the tuples between old and new buckets, it
+again needs to acquire exclusive lock on both old and new buckets to clear
+the split-in-progress flag. Like inserts and scans, it will also retain pins
+on both the old and new primary buckets till end of split operation, although
+we can do without that as well.
+
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split. The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results. Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple. If we allow vacuum to
+remove the dead tuples with just an exclusive lock, it could remove the tuple
+required to resume the scan. The need for cleanup lock to remove the tuples
+that are moved by split is to ensure that there is no pending scan that has
+started after the start of split and before the finish of split on bucket.
+If we don't do that, then vacuum can remove tuples that are required by such
+a scan. We don't need to retain this cleanup lock during whole vacuum
+operation on bucket. We releases the lock as we move ahead in the bucket
+chain. In the end, for squeeze-phase, we conditionally acquire cleanup lock
+and if we don't get, then we just abandon the squeeze phase.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
Pseudocode Algorithms
@@ -188,63 +184,105 @@ track of available overflow pages.
The reader algorithm is:
pin meta page and take buffer content lock in shared mode
- loop:
- compute bucket number for target hash key
- release meta page buffer content lock
- if (correct bucket page is already locked)
- break
- release any existing bucket page lock (if a concurrent split happened)
- take heavyweight bucket lock
- retake meta page buffer content lock in shared mode
+ compute bucket number for target hash key
+ read and pin the primary bucket page
+ conditionally get the buffer content lock in shared mode on primary bucket page for search
+ if we didn't get the lock (need to wait for lock)
+ release the buffer content lock on meta page
+ acquire buffer content lock on primary bucket page in shared mode
+ acquire the buffer content lock in shared mode on meta page
+ to check for possibility of split, we need to recompute the bucket and
+ verify, if it is a correct bucket; set the retry flag
+ else if we get the lock, then we can skip the retry path
+ if (retry)
+ loop:
+ compute bucket number for target hash key
+ release meta page buffer content lock
+ if (correct bucket page is already locked)
+ break
+ release any existing content lock on bucket page (if a concurrent split happened)
+ pin primary bucket page and take shared buffer content lock
+ retake meta page buffer content lock in shared mode
-- then, per read request:
release pin on metapage
- read current page of bucket and take shared buffer content lock
- step to next page if necessary (no chaining of locks)
+ if the split is in progress for current bucket and this is a new bucket
+ release the buffer content lock on current bucket page
+ pin and acquire the buffer content lock on old bucket in shared mode
+ release the buffer content lock on old bucket, but not pin
+ retake the buffer content lock on new bucket
+ mark the scan such that it skips the tuples that are marked as moved by split
+ step to next page if necessary (no chaining of locks)
+ if the scan indicates moved by split, then move to old bucket after the scan
+ of current bucket is finished
get tuple
release buffer content lock and pin on current page
-- at scan shutdown:
- release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection). Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split. If not, we're
-done. If so, we release our previously-acquired lock and repeat the process
-using the new bucket number. Holding the bucket sharelock for
+ release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock if we have to wait for acquiring the
+content lock on bucket page, because that might result in poor concurrency.
+Instead, we relock the metapage after acquiring the bucket page content lock
+and check whether the bucket has been split. If not, we're done. If so, we
+release our previously-acquired content lock, but not pin and repeat the
+process using the new bucket number. Holding the buffer pin on bucket page for
the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions. Notice that the reader's lock
+being invalidated by splits or compactions. Notice that the reader's pin
does not prevent other buckets from being split or compacted.
To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock. Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page. Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
searching for the same heap tuple TID previously returned. Insertion does
not move index entries across pages, so the previously-returned index entry
should always be on the same page, at the same or higher offset number,
as it was before.
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split. Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split. This happens only for the new half bucket.
+
The insertion algorithm is rather similar:
pin meta page and take buffer content lock in shared mode
- loop:
- compute bucket number for target hash key
- release meta page buffer content lock
- if (correct bucket page is already locked)
- break
- release any existing bucket page lock (if a concurrent split happened)
- take heavyweight bucket lock in shared mode
- retake meta page buffer content lock in shared mode
--- (so far same as reader)
+ compute bucket number for target hash key
+ read and pin the primary bucket page
+ conditionally get the buffer content lock in exclusive mode on primary bucket page for search
+ if we didn't get the lock (need to wait for lock)
+ release the buffer content lock on meta page
+ acquire buffer content lock on primary bucket page in exclusive mode
+ acquire the buffer content lock in shared mode on meta page
+ to check for possibility of split, we need to recompute the bucket and
+ verify, if it is a correct bucket; set the retry flag
+ else if we get the lock, then we can skip the retry path
+ if (retry)
+ loop:
+ compute bucket number for target hash key
+ release meta page buffer content lock
+ if (correct bucket page is already locked)
+ break
+ release any existing content lock on bucket page (if a concurrent split happened)
+ pin primary bucket page and take exclusive buffer content lock
+ retake meta page buffer content lock in shared mode
+-- (so far same as reader, except for acquisation of buffer content lock in
+ exclusive mode on primary bucket page)
release pin on metapage
- pin current page of bucket and take exclusive buffer content lock
- if full, release, read/exclusive-lock next page; repeat as needed
+ if the split-in-progress flag is set for bucket in old half of split
+ and pin count on it is one, then finish the split
+ we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+ if get the lock on new bucket
+ finish the split using algorithm mentioned below for split
+ release the buffer content lock and pin on new bucket
+ if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
>> see below if no space in any page of bucket
insert tuple at appropriate place in page
mark current page dirty and release buffer content lock and pin
+ if current page is not a bucket page, release the pin on bucket page
release heavyweight share-lock
- pin meta page and take buffer content lock in shared mode
+ pin meta page and take buffer content lock in exclusive mode
increment tuple count, decide if split needed
mark meta page dirty and release buffer content lock and pin
done if no split needed, else enter Split algorithm below
@@ -256,11 +294,13 @@ bucket that is being actively scanned, because readers can cope with this
as explained above. We only need the short-term buffer locks to ensure
that readers do not see a partially-updated page.
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Locking
+Definitions above. This algorithm allows them a very high degree of
+concurrency. (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
When an inserter cannot find space in any existing page of a bucket, it
must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +311,79 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
The algorithm attempts, but does not necessarily succeed, to split one
existing bucket in two, thereby lowering the fill ratio:
- pin meta page and take buffer content lock in exclusive mode
- check split still needed
- if split not needed anymore, drop buffer content lock and pin and exit
- decide which bucket to split
- Attempt to X-lock old bucket number (definitely could fail)
- Attempt to X-lock new bucket number (shouldn't fail, but...)
- if above fail, drop locks and pin and exit
+ expand:
+ take buffer content lock in exclusive mode on meta page
+ check split still needed
+ if split not needed anymore, drop buffer content lock and exit
+ decide which bucket to split
+ Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+ if above fail, release lock and pin and exit
+ if the split-in-progress flag is set, then finish the split
+ conditionally get the content lock on new bucket which was involved in split
+ if got the lock on new bucket
+ finish the split using algorithm mentioned below for split
+ release the buffer content lock and pin on old and new bucketa
+ try to expand from start
+ else
+ release the buffer conetent lock and pin on old bucket and exit
+ if the garbage flag (indicates that tuples are moved by split) is set on bucket
+ release the buffer content lock on meta page
+ remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+ Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
update meta page to reflect new number of buckets
- mark meta page dirty and release buffer content lock and pin
+ mark meta page dirty and release buffer content lock
-- now, accesses to all other buckets can proceed.
Perform actual split of bucket, moving tuples as needed
>> see below about acquiring needed extra space
Release X-locks of old and new buckets
+ split guts
+ mark the old and new buckets indicating split-in-progress
+ mark the old bucket indicating has-garbage
+ copy the tuples that belongs to new bucket from old bucket
+ during copy mark such tuples as move-by-split
+ release lock but not pin for primary bucket page of old bucket,
+ read/shared-lock next page; repeat as needed
+ >> see below if no space in bucket page of new bucket
+ ensure to have exclusive-lock on both old and new buckets in that order
+ clear the split-in-progress flag from both the buckets
+ mark buffers dirty and release the locks and pins on both old and new buckets
+
Note the metapage lock is not held while the actual tuple rearrangement is
performed, so accesses to other buckets can proceed in parallel; in fact,
it's possible for multiple bucket splits to proceed in parallel.
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it. We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock. (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split. This is all right since the index is
-overfull but perfectly functional. Every subsequent inserter will try to
-split, and eventually one will succeed. If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+Split's attempt to acquire cleanup-lock on the old bucket number could fail
+if another process holds any lock or pin on it. We do not want to wait if
+that happens, because we don't want to wait while holding the metapage
+exclusive-lock. So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split. This is all right since the
+index is overfull but perfectly functional. Every subsequent inserter will
+try to split, and eventually one will succeed. If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
not be overfull and split attempts will stop. (We could make a successful
splitter loop to see if the index is still overfull, but it seems better to
distribute the split overhead across successive insertions.)
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split. This will be set only for old bucket. Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.). This is used both by vacuum as
+well as during re-split operation. Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket. The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt. The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt. The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
must-complete action. I'm not planning to teach hash about WAL in this
-go-round.
+go-round. However, we do try to finish the incomplete splits during insert
+and split.
The fourth operation is garbage collection (bulk deletion):
@@ -319,9 +392,13 @@ The fourth operation is garbage collection (bulk deletion):
fetch current max bucket number
release meta page buffer content lock and pin
while next bucket <= max bucket do
- Acquire X lock on target bucket
- Scan and remove tuples, compact free space as needed
- Release X lock
+ Acquire cleanup lock on target bucket
+ Scan and remove tuples
+ For overflow buckets, first we need to lock the next bucket and then
+ release the lock on current bucket
+ Ensure to have X lock on bucket page
+ If buffer pincount is one, then compact free space as needed
+ Release lock
next bucket ++
end loop
pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +407,23 @@ The fourth operation is garbage collection (bulk deletion):
else update metapage tuple count
mark meta page dirty and release buffer content lock and pin
-Note that this is designed to allow concurrent splits. If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm. (We must however be careful about the statistics
-reported by the VACUUM operation. What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan. This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason. But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans. If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm. As we are releasing the locks during
+scan of a bucket, it will allow concurrent scan to start on a bucket and
+ensures that scan will always be behind cleanup. It is must to keep scans
+behind cleanup, else vacuum could remove tuples that are required to
+complete the scan as explained in Lock Definitions section above. This holds
+true for backward scans as well (backward scans first traverse each bucket
+starting from first bucket to last overflow bucket in the chain).
+We must be careful about the statistics reported by the VACUUM operation.
+What we can do is count the number of tuples scanned, and believe this in
+preference to the stored tuple count if the stored tuple count and number
+of buckets did *not* change at any time during the scan. This provides a
+way of correcting the stored tuple count if it gets out of sync for some
+reason. But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted
+from the stored tuple count and use that.
Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..a12a830 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
/*
* An insertion into the current index page could have happened while
* we didn't have read lock on it. Re-find our position by looking
- * for the TID we previously returned. (Because we hold share lock on
- * the bucket, no deletions or splits could have occurred; therefore
- * we can expect that the TID still exists in the current index page,
- * at an offset >= where we were.)
+ * for the TID we previously returned. (Because we hold pin on the
+ * bucket, no deletions or splits could have occurred; therefore we
+ * can expect that the TID still exists in the current index page, at
+ * an offset >= where we were.)
*/
OffsetNumber maxoffnum;
@@ -425,12 +425,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
so->hashso_bucket_valid = false;
- so->hashso_bucket_blkno = 0;
so->hashso_curbuf = InvalidBuffer;
+ so->hashso_bucket_buf = InvalidBuffer;
+ so->hashso_old_bucket_buf = InvalidBuffer;
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
ItemPointerSetInvalid(&(so->hashso_heappos));
+ so->hashso_skip_moved_tuples = false;
+
scan->opaque = so;
/* register scan in case we change pages it's using */
@@ -449,15 +452,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
HashScanOpaque so = (HashScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- /* release any pin we still hold */
- if (BufferIsValid(so->hashso_curbuf))
- _hash_dropbuf(rel, so->hashso_curbuf);
- so->hashso_curbuf = InvalidBuffer;
-
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ _hash_dropscanbuf(rel, so);
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -471,6 +466,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->hashso_bucket_valid = false;
}
+
+ so->hashso_skip_moved_tuples = false;
}
/*
@@ -484,16 +481,7 @@ hashendscan(IndexScanDesc scan)
/* don't need scan registered anymore */
_hash_dropscan(scan);
-
- /* release any pin we still hold */
- if (BufferIsValid(so->hashso_curbuf))
- _hash_dropbuf(rel, so->hashso_curbuf);
- so->hashso_curbuf = InvalidBuffer;
-
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ _hash_dropscanbuf(rel, so);
pfree(so);
scan->opaque = NULL;
@@ -504,6 +492,9 @@ hashendscan(IndexScanDesc scan)
* The set of target tuples is specified via a callback routine that tells
* whether any given heap tuple (identified by ItemPointer) is being deleted.
*
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
* Result: a palloc'd struct containing statistical info for VACUUM displays.
*/
IndexBulkDeleteResult *
@@ -548,83 +539,52 @@ loop_top:
{
BlockNumber bucket_blkno;
BlockNumber blkno;
- bool bucket_dirty = false;
+ Buffer bucket_buf;
+ Buffer buf;
+ HashPageOpaque bucket_opaque;
+ Page page;
+ bool bucket_has_garbage = false;
/* Get address of bucket's start page */
bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
- /* Exclusive-lock the bucket so we can shrink it */
- _hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
/* Shouldn't have any active scans locally, either */
if (_hash_has_active_scan(rel, cur_bucket))
elog(ERROR, "hash index has active scan during VACUUM");
- /* Scan each page in bucket */
blkno = bucket_blkno;
- while (BlockNumberIsValid(blkno))
- {
- Buffer buf;
- Page page;
- HashPageOpaque opaque;
- OffsetNumber offno;
- OffsetNumber maxoffno;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
- vacuum_delay_point();
-
- buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
- LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- info->strategy);
- page = BufferGetPage(buf);
- opaque = (HashPageOpaque) PageGetSpecialPointer(page);
- Assert(opaque->hasho_bucket == cur_bucket);
-
- /* Scan each tuple in page */
- maxoffno = PageGetMaxOffsetNumber(page);
- for (offno = FirstOffsetNumber;
- offno <= maxoffno;
- offno = OffsetNumberNext(offno))
- {
- IndexTuple itup;
- ItemPointer htup;
+ /*
+ * We need to acquire a cleanup lock on the primary bucket to out wait
+ * concurrent scans.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+ LockBufferForCleanup(buf);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
- itup = (IndexTuple) PageGetItem(page,
- PageGetItemId(page, offno));
- htup = &(itup->t_tid);
- if (callback(htup, callback_state))
- {
- /* mark the item for deletion */
- deletable[ndeletable++] = offno;
- tuples_removed += 1;
- }
- else
- num_index_tuples += 1;
- }
+ page = BufferGetPage(buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
- /*
- * Apply deletions and write page if needed, advance to next page.
- */
- blkno = opaque->hasho_nextblkno;
+ /*
+ * If the bucket contains tuples that are moved by split, then we need
+ * to delete such tuples on completion of split. Before cleaning, we
+ * need to out-wait the scans that have started when the split was in
+ * progress for a bucket.
+ */
+ if (H_HAS_GARBAGE(bucket_opaque) &&
+ !H_INCOMPLETE_SPLIT(bucket_opaque))
+ bucket_has_garbage = true;
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(page, deletable, ndeletable);
- _hash_wrtbuf(rel, buf);
- bucket_dirty = true;
- }
- else
- _hash_relbuf(rel, buf);
- }
+ bucket_buf = buf;
- /* If we deleted anything, try to compact free space */
- if (bucket_dirty)
- _hash_squeezebucket(rel, cur_bucket, bucket_blkno,
- info->strategy);
+ hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+ local_metapage.hashm_maxbucket,
+ local_metapage.hashm_highmask,
+ local_metapage.hashm_lowmask, &tuples_removed,
+ &num_index_tuples, bucket_has_garbage, true,
+ callback, callback_state);
- /* Release bucket lock */
- _hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, bucket_buf);
/* Advance to next bucket */
cur_bucket++;
@@ -705,6 +665,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
return stats;
}
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow buckets, first we need to lock the next bucket and
+ * then release the lock on current bucket. This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup. Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+ BlockNumber bucket_blkno,
+ BufferAccessStrategy bstrategy,
+ uint32 maxbucket,
+ uint32 highmask, uint32 lowmask,
+ double *tuples_removed,
+ double *num_index_tuples,
+ bool bucket_has_garbage,
+ bool delay,
+ IndexBulkDeleteCallback callback,
+ void *callback_state)
+{
+ BlockNumber blkno;
+ Buffer buf;
+ Bucket cur_bucket;
+ Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+ Page page;
+ bool bucket_dirty = false;
+
+ blkno = bucket_blkno;
+ buf = bucket_buf;
+ page = BufferGetPage(buf);
+ cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+ if (bucket_has_garbage)
+ new_bucket = _hash_get_newbucket(rel, cur_bucket,
+ lowmask, maxbucket);
+
+ /* Scan each page in bucket */
+ for (;;)
+ {
+ HashPageOpaque opaque;
+ OffsetNumber offno;
+ OffsetNumber maxoffno;
+ Buffer next_buf;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ bool retain_pin = false;
+ bool curr_page_dirty = false;
+
+ if (delay)
+ vacuum_delay_point();
+
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ /* Scan each tuple in page */
+ maxoffno = PageGetMaxOffsetNumber(page);
+ for (offno = FirstOffsetNumber;
+ offno <= maxoffno;
+ offno = OffsetNumberNext(offno))
+ {
+ IndexTuple itup;
+ ItemPointer htup;
+ Bucket bucket;
+
+ itup = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, offno));
+ htup = &(itup->t_tid);
+ if (callback && callback(htup, callback_state))
+ {
+ /* mark the item for deletion */
+ deletable[ndeletable++] = offno;
+ if (tuples_removed)
+ *tuples_removed += 1;
+ }
+ else if (bucket_has_garbage)
+ {
+ /* delete the tuples that are moved by split. */
+ bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+ maxbucket,
+ highmask,
+ lowmask);
+ /* mark the item for deletion */
+ if (bucket != cur_bucket)
+ {
+ /*
+ * We expect tuples to either belong to curent bucket or
+ * new_bucket. This is ensured because we don't allow
+ * further splits from bucket that contains garbage. See
+ * comments in _hash_expandtable.
+ */
+ Assert(bucket == new_bucket);
+ deletable[ndeletable++] = offno;
+ }
+ else if (num_index_tuples)
+ *num_index_tuples += 1;
+ }
+ else if (num_index_tuples)
+ *num_index_tuples += 1;
+ }
+
+ /* retain the pin on primary bucket till end of bucket scan */
+ if (blkno == bucket_blkno)
+ retain_pin = true;
+ else
+ retain_pin = false;
+
+ blkno = opaque->hasho_nextblkno;
+
+ /*
+ * Apply deletions and write page if needed, advance to next page.
+ */
+ if (ndeletable > 0)
+ {
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ bucket_dirty = true;
+ curr_page_dirty = true;
+ }
+
+ /* bail out if there are no more pages to scan. */
+ if (!BlockNumberIsValid(blkno))
+ break;
+
+ next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+ LH_OVERFLOW_PAGE,
+ bstrategy);
+
+ /*
+ * release the lock on previous page after acquiring the lock on next
+ * page
+ */
+ if (curr_page_dirty)
+ {
+ if (retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, buf);
+ curr_page_dirty = false;
+ }
+ else if (retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
+
+ buf = next_buf;
+ }
+
+ /*
+ * lock the bucket page to clear the garbage flag and squeeze the bucket.
+ * if the current buffer is same as bucket buffer, then we already have
+ * lock on bucket page.
+ */
+ if (buf != bucket_buf)
+ {
+ _hash_relbuf(rel, buf);
+ _hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+ }
+
+ /*
+ * Clear the garbage flag from bucket after deleting the tuples that are
+ * moved by split. We purposefully clear the flag before squeeze bucket,
+ * so that after restart, vacuum shouldn't again try to delete the moved
+ * by split tuples.
+ */
+ if (bucket_has_garbage)
+ {
+ HashPageOpaque bucket_opaque;
+
+ page = BufferGetPage(bucket_buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+ }
+
+ /*
+ * If we deleted anything, try to compact free space. For squeezing the
+ * bucket, we must have a cleanup lock, else it can impact the ordering of
+ * tuples for a scan that has started before it.
+ */
+ if (bucket_dirty && CheckBufferForCleanup(bucket_buf))
+ _hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+ bstrategy);
+}
void
hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..5cfd0aa 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
void
_hash_doinsert(Relation rel, IndexTuple itup)
{
- Buffer buf;
+ Buffer buf = InvalidBuffer;
+ Buffer bucket_buf;
Buffer metabuf;
HashMetaPage metap;
BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
bool do_expand;
uint32 hashkey;
Bucket bucket;
+ uint32 maxbucket;
+ uint32 highmask;
+ uint32 lowmask;
/*
* Get the hash key for the item (it's stored in the index tuple itself).
@@ -70,51 +74,131 @@ _hash_doinsert(Relation rel, IndexTuple itup)
errhint("Values larger than a buffer page cannot be indexed.")));
/*
- * Loop until we get a lock on the correct target bucket.
+ * Copy bucket mapping info now; The comment in _hash_expandtable where
+ * we copy this information and calls _hash_splitbucket explains why this
+ * is OK.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
+ maxbucket = metap->hashm_maxbucket;
+ highmask = metap->hashm_highmask;
+ lowmask = metap->hashm_lowmask;
- blkno = BUCKET_TO_BLKNO(metap, bucket);
+ /*
+ * Conditionally get the lock on primary bucket page for insertion while
+ * holding lock on meta page. If we have to wait, then release the meta
+ * page lock and retry it in a hard way.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ maxbucket,
+ highmask,
+ lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /* Release metapage lock, but keep pin. */
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBuffer(buf))
+ {
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ LockBuffer(buf, HASH_WRITE);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
+ else
+ {
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ }
+ if (retry)
+ {
/*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
+ * Loop until we get a lock on the correct target bucket. We get the
+ * lock on primary bucket page and retain the pin on it during insert
+ * operation to prevent the concurrent splits. Retaining pin on a
+ * primary bucket page ensures that split can't happen as it needs to
+ * acquire the cleanup lock on primary bucket page. Acquiring lock on
+ * primary bucket and rechecking if it is a target bucket is mandatory
+ * as otherwise a concurrent split might cause this insertion to fall
+ * in wrong bucket.
*/
- if (retry)
+ for (;;)
{
+ /*
+ * Compute the target bucket number, and convert to block number.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Release metapage lock, but keep pin. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ /*
+ * If the previous iteration of this loop locked what is still the
+ * correct target bucket, we are done. Otherwise, drop any old
+ * lock and lock what now appears to be the correct bucket.
+ */
if (oldblkno == blkno)
break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
+ _hash_relbuf(rel, buf);
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
+ /* Fetch the primary bucket page for the bucket */
+ buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+ /*
+ * Reacquire metapage lock and check that no bucket split has
+ * taken place while we were awaiting the bucket lock.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ }
}
- /* Fetch the primary bucket page for the bucket */
- buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+ /* remember the primary bucket buffer to release the pin on it at end. */
+ bucket_buf = buf;
+
page = BufferGetPage(buf);
pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(pageopaque->hasho_bucket == bucket);
+ /*
+ * If there is any pending split, try to finish it before proceeding for
+ * the insertion. We try to finish the split for the insertion in old
+ * bucket, as that will allow us to remove the tuples from old bucket and
+ * reuse the space. There is no such apparent benefit from finishing the
+ * split during insertion in new bucket.
+ *
+ * In future, if we want to finish the splits during insertion in new
+ * bucket, we must ensure the locking order such that old bucket is locked
+ * before new bucket.
+ */
+ if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && CheckBufferForCleanup(buf))
+ {
+ BlockNumber nblkno;
+ Buffer nbuf;
+
+ nblkno = _hash_get_newblk(rel, pageopaque);
+
+ /* Fetch the primary bucket page for the new bucket */
+ nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+ if (nbuf)
+ {
+ _hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+ highmask, lowmask);
+
+ /*
+ * release the buffer here as the insertion will happen in old
+ * bucket.
+ */
+ _hash_relbuf(rel, nbuf);
+ }
+ }
+
/* Do the insertion */
while (PageGetFreeSpace(page) < itemsz)
{
@@ -127,14 +211,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
{
/*
* ovfl page exists; go get it. if it doesn't have room, we'll
- * find out next pass through the loop test above.
+ * find out next pass through the loop test above. Retain the pin
+ * if it is a primary bucket.
*/
- _hash_relbuf(rel, buf);
+ if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
page = BufferGetPage(buf);
}
else
{
+ bool retain_pin = false;
+
+ /* page flags must be accessed before releasing lock on a page. */
+ retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
/*
* we're at the end of the bucket chain and we haven't found a
* page with enough room. allocate a new overflow page.
@@ -144,7 +237,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
/* chain to a new overflow page */
- buf = _hash_addovflpage(rel, metabuf, buf);
+ buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
page = BufferGetPage(buf);
/* should fit now, given test above */
@@ -158,11 +251,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
/* found page with enough space, so add the item here */
(void) _hash_pgaddtup(rel, buf, itemsz, itup);
- /* write and release the modified page */
+ /*
+ * write and release the modified page and ensure to release the pin on
+ * primary page.
+ */
_hash_wrtbuf(rel, buf);
-
- /* We can drop the bucket lock now */
- _hash_droplock(rel, blkno, HASH_SHARE);
+ if (buf != bucket_buf)
+ _hash_dropbuf(rel, bucket_buf);
/*
* Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
*
* On entry, the caller must hold a pin but no lock on 'buf'. The pin is
* dropped before exiting (we assume the caller is not interested in 'buf'
- * anymore). The returned overflow page will be pinned and write-locked;
- * it is guaranteed to be empty.
+ * anymore) if not asked to retain. The pin will be retained only for the
+ * primary bucket. The returned overflow page will be pinned and
+ * write-locked; it is guaranteed to be empty.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* That buffer is returned in the same state.
*
- * The caller must hold at least share lock on the bucket, to ensure that
- * no one else tries to compact the bucket meanwhile. This guarantees that
- * 'buf' won't stop being part of the bucket while it's unlocked.
- *
* NB: since this could be executed concurrently by multiple processes,
* one should not assume that the returned overflow page will be the
* immediate successor of the originally passed 'buf'. Additional overflow
* pages might have been added to the bucket chain in between.
*/
Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
{
Buffer ovflbuf;
Page page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
break;
/* we assume we do not need to write the unmodified page */
- _hash_relbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
/* logically chain overflow page to previous page */
pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
- _hash_wrtbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, buf);
return ovflbuf;
}
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
* in the bucket, or InvalidBlockNumber if no following page.
*
* NB: caller must not hold lock on metapage, nor on either page that's
- * adjacent in the bucket chain. The caller had better hold exclusive lock
- * on the bucket, too.
+ * adjacent in the bucket chain except from primary bucket. The caller had
+ * better hold cleanup lock on the primary bucket.
*/
BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
BufferAccessStrategy bstrategy)
{
HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
/*
* Fix up the bucket chain. this is a doubly-linked list, so we must fix
* up the bucket chain members behind and ahead of the overflow page being
- * deleted. No concurrency issues since we hold exclusive lock on the
- * entire bucket.
+ * deleted. No concurrency issues since we hold the cleanup lock on
+ * primary bucket. We don't need to aqcuire buffer lock to fix the
+ * primary bucket, as we already have that lock.
*/
if (BlockNumberIsValid(prevblkno))
{
- Buffer prevbuf = _hash_getbuf_with_strategy(rel,
- prevblkno,
- HASH_WRITE,
+ if (prevblkno == bucket_blkno)
+ {
+ Buffer prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+ prevblkno,
+ RBM_NORMAL,
+ bstrategy);
+
+ Page prevpage = BufferGetPage(prevbuf);
+ HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+ Assert(prevopaque->hasho_bucket == bucket);
+ prevopaque->hasho_nextblkno = nextblkno;
+ MarkBufferDirty(prevbuf);
+ ReleaseBuffer(prevbuf);
+ }
+ else
+ {
+ Buffer prevbuf = _hash_getbuf_with_strategy(rel,
+ prevblkno,
+ HASH_WRITE,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- bstrategy);
- Page prevpage = BufferGetPage(prevbuf);
- HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+ bstrategy);
+ Page prevpage = BufferGetPage(prevbuf);
+ HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
- Assert(prevopaque->hasho_bucket == bucket);
- prevopaque->hasho_nextblkno = nextblkno;
- _hash_wrtbuf(rel, prevbuf);
+ Assert(prevopaque->hasho_bucket == bucket);
+ prevopaque->hasho_nextblkno = nextblkno;
+ _hash_wrtbuf(rel, prevbuf);
+ }
}
if (BlockNumberIsValid(nextblkno))
{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
* required that to be true on entry as well, but it's a lot easier for
* callers to leave empty overflow pages and let this guy clean it up.
*
- * Caller must hold exclusive lock on the target bucket. This allows
+ * Caller must hold cleanup lock on the target bucket. This allows
* us to safely lock multiple pages in the bucket.
*
* Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
_hash_squeezebucket(Relation rel,
Bucket bucket,
BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy)
{
BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
HashPageOpaque wopaque;
HashPageOpaque ropaque;
bool wbuf_dirty;
+ bool release_buf = false;
/*
* start squeezing into the base bucket page.
*/
wblkno = bucket_blkno;
- wbuf = _hash_getbuf_with_strategy(rel,
- wblkno,
- HASH_WRITE,
- LH_BUCKET_PAGE,
- bstrategy);
+ wbuf = bucket_buf;
wpage = BufferGetPage(wbuf);
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
/*
- * if there aren't any overflow pages, there's nothing to squeeze.
+ * if there aren't any overflow pages, there's nothing to squeeze. caller
+ * is responsible to release the lock on primary bucket.
*/
if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
- {
- _hash_relbuf(rel, wbuf);
return;
- }
/*
* Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
{
Assert(!PageIsEmpty(wpage));
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
wblkno = wopaque->hasho_nextblkno;
Assert(BlockNumberIsValid(wblkno));
- if (wbuf_dirty)
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
Assert(wopaque->hasho_bucket == bucket);
wbuf_dirty = false;
+ release_buf = false;
}
/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
/* are we freeing the page adjacent to wbuf? */
if (rblkno == wblkno)
{
- /* yes, so release wbuf lock first */
- if (wbuf_dirty)
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
+ /* yes, so release wbuf lock first if needed */
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
+
/* free this overflow page (releases rbuf) */
- _hash_freeovflpage(rel, rbuf, bstrategy);
+ _hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
/* done */
return;
}
/* free this overflow page, then get the previous one */
- _hash_freeovflpage(rel, rbuf, bstrategy);
+ _hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
rbuf = _hash_getbuf_with_strategy(rel,
rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..f51c313 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
uint32 nblocks);
static void _hash_splitbucket(Relation rel, Buffer metabuf,
Bucket obucket, Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+ Bucket obucket, Bucket nbucket, Buffer obuf,
+ Buffer nbuf, HTAB *htab, uint32 maxbucket,
+ uint32 highmask, uint32 lowmask);
/*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
/*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock. (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- return ConditionalLockPage(rel, whichlock, access);
- else
- return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- UnlockPage(rel, whichlock, access);
-}
-
-/*
* _hash_getbuf() -- Get a buffer by block number for read or write.
*
* 'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
}
/*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ * We try to take the conditional cleanup lock and if we get it then
+ * return the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+ Buffer buf;
+
+ if (blkno == P_NEW)
+ elog(ERROR, "hash AM does not use P_NEW");
+
+ buf = ReadBuffer(rel, blkno);
+
+ if (!ConditionalLockBufferForCleanup(buf))
+ {
+ ReleaseBuffer(buf);
+ return InvalidBuffer;
+ }
+
+ /* ref count and lock type are correct */
+
+ _hash_checkpage(rel, buf, flags);
+
+ return buf;
+}
+
+/*
* _hash_getinitbuf() -- Get and initialize a buffer by block number.
*
* This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
}
/*
+ * _hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+ /* release pin we hold on primary bucket */
+ if (BufferIsValid(so->hashso_bucket_buf) &&
+ so->hashso_bucket_buf != so->hashso_curbuf)
+ _hash_dropbuf(rel, so->hashso_bucket_buf);
+ so->hashso_bucket_buf = InvalidBuffer;
+
+ /* release pin we hold on old primary bucket */
+ if (BufferIsValid(so->hashso_old_bucket_buf) &&
+ so->hashso_old_bucket_buf != so->hashso_curbuf)
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+
+ /* release any pin we still hold */
+ if (BufferIsValid(so->hashso_curbuf))
+ _hash_dropbuf(rel, so->hashso_curbuf);
+ so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
* _hash_wrtbuf() -- write a hash page to disk.
*
* This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
/*
* Attempt to expand the hash table by creating one new bucket.
*
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old or new bucket.
*
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
BlockNumber start_oblkno;
BlockNumber start_nblkno;
Buffer buf_nblkno;
+ Buffer buf_oblkno;
+ Page opage;
+ HashPageOpaque oopaque;
uint32 maxbucket;
uint32 highmask;
uint32 lowmask;
+restart_expand:
+
/*
* Write-lock the meta page. It used to be necessary to acquire a
* heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
goto fail;
/*
- * Determine which bucket is to be split, and attempt to lock the old
- * bucket. If we can't get the lock, give up.
+ * Determine which bucket is to be split, and attempt to take cleanup lock
+ * on the old bucket. If we can't get the lock, give up.
*
- * The lock protects us against other backends, but not against our own
- * backend. Must check for active scans separately.
+ * The cleanup lock protects us against other backends, but not against
+ * our own backend. Must check for active scans separately.
+ *
+ * The cleanup lock is mainly to protect the split from concurrent
+ * inserts. See src/backend/access/hash/README, Lock Definitions for
+ * further details. Due to this locking restriction, if there is any
+ * pending scan, split will give up which is not good, but harmless.
*/
new_bucket = metap->hashm_maxbucket + 1;
@@ -563,11 +595,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, old_bucket))
goto fail;
- if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+ buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+ if (!buf_oblkno)
goto fail;
+ opage = BufferGetPage(buf_oblkno);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
/*
- * Likewise lock the new bucket (should never fail).
+ * We want to finish the split from a bucket as there is no apparent
+ * benefit by not doing so and it will make the code complicated to finish
+ * the split that involves multiple buckets considering the case where new
+ * split also fails. We don't need to consider the new bucket for
+ * completing the split here as it is not possible that a re-split of new
+ * bucket starts when there is still a pending split from old bucket.
+ */
+ if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+ {
+ BlockNumber nblkno;
+ Buffer buf_nblkno;
+
+ /*
+ * Copy bucket mapping info now; The comment in code below where we
+ * copy this information and calls _hash_splitbucket explains why this
+ * is OK.
+ */
+ maxbucket = metap->hashm_maxbucket;
+ highmask = metap->hashm_highmask;
+ lowmask = metap->hashm_lowmask;
+
+ /* Release the metapage lock, before completing the split. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ nblkno = _hash_get_newblk(rel, oopaque);
+
+ /* Fetch the primary bucket page for the new bucket */
+ buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+ if (!buf_nblkno)
+ {
+ _hash_relbuf(rel, buf_oblkno);
+ goto fail;
+ }
+
+ _hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+ highmask, lowmask);
+
+ /*
+ * release the buffers and retry for expand.
+ */
+ _hash_relbuf(rel, buf_oblkno);
+ _hash_relbuf(rel, buf_nblkno);
+
+ goto restart_expand;
+ }
+
+ /*
+ * Clean the tuples remained from previous split. This operation requires
+ * cleanup lock and we already have one on old bucket, so let's do it. We
+ * also don't want to allow further splits from the bucket till the
+ * garbage of previous split is cleaned. This has two advantages, first
+ * it helps in avoiding the bloat due to garbage and second is, during
+ * cleanup of bucket, we are always sure that the garbage tuples belong to
+ * most recently splitted bucket. On the contrary, if we allow cleanup of
+ * bucket after meta page is updated to indicate the new split and before
+ * the actual split, the cleanup operation won't be able to decide whether
+ * the tuple has been moved to the newly created bucket and ended up
+ * deleting such tuples.
+ */
+ if (H_HAS_GARBAGE(oopaque))
+ {
+ /* Release the metapage lock. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+ metap->hashm_maxbucket, metap->hashm_highmask,
+ metap->hashm_lowmask, NULL,
+ NULL, true, false, NULL, NULL);
+
+ _hash_relbuf(rel, buf_oblkno);
+
+ goto restart_expand;
+ }
+
+ /*
+ * There shouldn't be any active scan on new bucket.
*
* Note: it is safe to compute the new bucket's blkno here, even though we
* may still need to update the BUCKET_TO_BLKNO mapping. This is because
@@ -579,9 +690,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, new_bucket))
elog(ERROR, "scan in progress on supposedly new bucket");
- if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
- elog(ERROR, "could not get lock on supposedly new bucket");
-
/*
* If the split point is increasing (hashm_maxbucket's log base 2
* increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +708,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
{
/* can't split due to BlockNumber overflow */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, buf_oblkno);
goto fail;
}
}
@@ -609,9 +716,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/*
* Physically allocate the new bucket's primary page. We want to do this
* before changing the metapage's mapping info, in case we can't get the
- * disk space.
+ * disk space. Ideally, we don't need to check for cleanup lock on new
+ * bucket as no other backend could find this bucket unless meta page is
+ * updated. However, it is good to be consistent with old bucket locking.
*/
buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+ if (!CheckBufferForCleanup(buf_nblkno))
+ {
+ _hash_relbuf(rel, buf_oblkno);
+ _hash_relbuf(rel, buf_nblkno);
+ goto fail;
+ }
+
/*
* Okay to proceed with split. Update the metapage bucket mapping info.
@@ -665,13 +781,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/* Relocate records to the new bucket */
_hash_splitbucket(rel, metabuf,
old_bucket, new_bucket,
- start_oblkno, buf_nblkno,
+ buf_oblkno, buf_nblkno,
maxbucket, highmask, lowmask);
- /* Release bucket locks, allowing others to access them */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
return;
/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +857,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
* The buffer is returned in the same state. (The metapage is only
* touched if it becomes necessary to add or remove overflow pages.)
*
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation. This is to prevent vacuum to start
+ * when split is in progress.
+ *
* In addition, the caller must have created the new bucket's base page,
* which is passed in buffer nbuf, pinned and write-locked. That lock and
* pin are released here. (The API is set up this way because we must do
@@ -756,37 +872,87 @@ _hash_splitbucket(Relation rel,
Buffer metabuf,
Bucket obucket,
Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask,
uint32 lowmask)
{
- Buffer obuf;
Page opage;
Page npage;
HashPageOpaque oopaque;
HashPageOpaque nopaque;
- /*
- * It should be okay to simultaneously write-lock pages from each bucket,
- * since no one else can be trying to acquire buffer lock on pages of
- * either bucket.
- */
- obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ /*
+ * Mark the old bucket to indicate that split is in progress and it has
+ * deletable tuples. At operation end, we clear split in progress flag and
+ * vacuum will clear page_has_garbage flag after deleting such tuples.
+ */
+ oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
npage = BufferGetPage(nbuf);
- /* initialize the new bucket's primary page */
+ /*
+ * initialize the new bucket's primary page and mark it to indicate that
+ * split is in progress.
+ */
nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
nopaque->hasho_prevblkno = InvalidBlockNumber;
nopaque->hasho_nextblkno = InvalidBlockNumber;
nopaque->hasho_bucket = nbucket;
- nopaque->hasho_flag = LH_BUCKET_PAGE;
+ nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
nopaque->hasho_page_id = HASHO_PAGE_ID;
+ _hash_splitbucket_guts(rel, metabuf, obucket,
+ nbucket, obuf, nbuf, NULL,
+ maxbucket, highmask, lowmask);
+
+ /* all done, now release the locks and pins on primary buckets. */
+ _hash_relbuf(rel, obuf);
+ _hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations. To finish the previously
+ * interrupted split operation, caller needs to fill htab. If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+ Buffer metabuf,
+ Bucket obucket,
+ Bucket nbucket,
+ Buffer obuf,
+ Buffer nbuf,
+ HTAB *htab,
+ uint32 maxbucket,
+ uint32 highmask,
+ uint32 lowmask)
+{
+ Buffer bucket_obuf;
+ Buffer bucket_nbuf;
+ Page opage;
+ Page npage;
+ HashPageOpaque oopaque;
+ HashPageOpaque nopaque;
+
+ bucket_obuf = obuf;
+ opage = BufferGetPage(obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ bucket_nbuf = nbuf;
+ npage = BufferGetPage(nbuf);
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
/*
* Partition the tuples in the old bucket between the old bucket and the
* new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +964,6 @@ _hash_splitbucket(Relation rel,
BlockNumber oblkno;
OffsetNumber ooffnum;
OffsetNumber omaxoffnum;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
/* Scan each tuple in old page */
omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,18 +974,45 @@ _hash_splitbucket(Relation rel,
IndexTuple itup;
Size itemsz;
Bucket bucket;
+ bool found = false;
/*
- * Fetch the item's hash key (conveniently stored in the item) and
- * determine which bucket it now belongs in.
+ * Before inserting tuple, probe the hash table containing TIDs of
+ * tuples belonging to new bucket, if we find a match, then skip
+ * that tuple, else fetch the item's hash key (conveniently stored
+ * in the item) and determine which bucket it now belongs in.
*/
itup = (IndexTuple) PageGetItem(opage,
PageGetItemId(opage, ooffnum));
+
+ if (htab)
+ (void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+ if (found)
+ continue;
+
bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
maxbucket, highmask, lowmask);
if (bucket == nbucket)
{
+ Size itupsize = 0;
+ IndexTuple new_itup;
+
+ /*
+ * make a copy of index tuple as we have to scribble on it.
+ */
+ new_itup = CopyIndexTuple(itup);
+
+ /*
+ * mark the index tuple as moved by split, such tuples are
+ * skipped by scan if there is split in progress for a bucket.
+ */
+ itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+ new_itup->t_info &= ~INDEX_SIZE_MASK;
+ new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+ new_itup->t_info |= itupsize;
+
/*
* insert the tuple into the new bucket. if it doesn't fit on
* the current page in the new bucket, we must allocate a new
@@ -832,17 +1023,25 @@ _hash_splitbucket(Relation rel,
* only partially complete, meaning the index is corrupt,
* since searches may fail to find entries they should find.
*/
- itemsz = IndexTupleDSize(*itup);
+ itemsz = IndexTupleDSize(*new_itup);
itemsz = MAXALIGN(itemsz);
if (PageGetFreeSpace(npage) < itemsz)
{
+ bool retain_pin = false;
+
+ /*
+ * page flags must be accessed before releasing lock on a
+ * page.
+ */
+ retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
/* write out nbuf and drop lock, but keep pin */
_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
/* chain to a new overflow page */
- nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+ nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
npage = BufferGetPage(nbuf);
- /* we don't need nopaque within the loop */
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
}
/*
@@ -852,12 +1051,10 @@ _hash_splitbucket(Relation rel,
* Possible future improvement: accumulate all the items for
* the new page and qsort them before insertion.
*/
- (void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+ (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
- /*
- * Mark tuple for deletion from old page.
- */
- deletable[ndeletable++] = ooffnum;
+ /* be tidy */
+ pfree(new_itup);
}
else
{
@@ -870,15 +1067,9 @@ _hash_splitbucket(Relation rel,
oblkno = oopaque->hasho_nextblkno;
- /*
- * Done scanning this old page. If we moved any tuples, delete them
- * from the old page.
- */
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(opage, deletable, ndeletable);
- _hash_wrtbuf(rel, obuf);
- }
+ /* retain the pin on the old primary bucket */
+ if (obuf == bucket_obuf)
+ _hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, obuf);
@@ -887,18 +1078,153 @@ _hash_splitbucket(Relation rel,
break;
/* Else, advance to next old page */
- obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+ obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
}
/*
* We're at the end of the old bucket chain, so we're done partitioning
- * the tuples. Before quitting, call _hash_squeezebucket to ensure the
- * tuples remaining in the old bucket (including the overflow pages) are
- * packed as tightly as possible. The new bucket is already tight.
+ * the tuples. Mark the old and new buckets to indicate split is
+ * finished.
+ *
+ * To avoid deadlocks due to locking order of buckets, first lock the old
+ * bucket and then the new bucket.
*/
- _hash_wrtbuf(rel, nbuf);
+ if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, nbuf);
+
+ /*
+ * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+ * there is no pending scan that has seen the flag after it is cleared.
+ */
+ _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+ opage = BufferGetPage(bucket_obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ _hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+ npage = BufferGetPage(bucket_nbuf);
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+ /* indicate that split is finished */
+ oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+ nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+ /*
+ * now write the buffers, here we don't release the locks as caller is
+ * responsible to release locks.
+ */
+ MarkBufferDirty(bucket_obuf);
+ MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ * _hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state. (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking them.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+ uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+ HASHCTL hash_ctl;
+ HTAB *tidhtab;
+ Buffer bucket_nbuf;
+ Page opage;
+ Page npage;
+ HashPageOpaque opageopaque;
+ HashPageOpaque npageopaque;
+ Bucket obucket;
+ Bucket nbucket;
+ bool found;
+
+ /* Initialize hash tables used to track TIDs */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ItemPointerData);
+ hash_ctl.entrysize = sizeof(ItemPointerData);
+ hash_ctl.hcxt = CurrentMemoryContext;
+
+ tidhtab =
+ hash_create("bucket ctids",
+ 256, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ /*
+ * Scan the new bucket and build hash table of TIDs
+ */
+ bucket_nbuf = nbuf;
+ npage = BufferGetPage(nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ for (;;)
+ {
+ BlockNumber nblkno;
+ OffsetNumber noffnum;
+ OffsetNumber nmaxoffnum;
+
+ /* Scan each tuple in new page */
+ nmaxoffnum = PageGetMaxOffsetNumber(npage);
+ for (noffnum = FirstOffsetNumber;
+ noffnum <= nmaxoffnum;
+ noffnum = OffsetNumberNext(noffnum))
+ {
+ IndexTuple itup;
+
+ /* Fetch the item's TID and insert it in hash table. */
+ itup = (IndexTuple) PageGetItem(npage,
+ PageGetItemId(npage, noffnum));
+
+ (void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+ Assert(!found);
+ }
+
+ nblkno = npageopaque->hasho_nextblkno;
+
+ /*
+ * release our write lock without modifying buffer and ensure to
+ * retain the pin on primary bucket.
+ */
+ if (nbuf == bucket_nbuf)
+ _hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, nbuf);
+
+ /* Exit loop if no more overflow pages in new bucket */
+ if (!BlockNumberIsValid(nblkno))
+ break;
+
+ /* Else, advance to next page */
+ nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+ npage = BufferGetPage(nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ }
+
+ /* Need a cleanup lock to perform split operation. */
+ LockBufferForCleanup(bucket_nbuf);
+
+ npage = BufferGetPage(bucket_nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ nbucket = npageopaque->hasho_bucket;
+
+ opage = BufferGetPage(obuf);
+ opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ obucket = opageopaque->hasho_bucket;
+
+ _hash_splitbucket_guts(rel, metabuf, obucket,
+ nbucket, obuf, bucket_nbuf, tidhtab,
+ maxbucket, highmask, lowmask);
- _hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+ hash_destroy(tidhtab);
}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..e3a99cf 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_nextblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan to ensure
+ * that vacuum can't delete the tuples that are moved by split to new
+ * bucket. Such tuples are required by the scans that are started on
+ * splitted buckets, before a new buckets split in progress flag
+ * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_prevblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan. See
+ * comments in _hash_readnext to know the reason of retaining pin.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -104,6 +125,13 @@ _hash_readprev(Relation rel,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
*pagep = BufferGetPage(*bufp);
*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+ /*
+ * We always maintain the pin on bucket page for whole scan operation,
+ * so releasing the additional pin we have acquired here.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_dropbuf(rel, *bufp);
}
}
@@ -192,43 +220,81 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
metap = HashPageGetMeta(page);
/*
- * Loop until we get a lock on the correct target bucket.
+ * Conditionally get the lock on primary bucket page for search while
+ * holding lock on meta page. If we have to wait, then release the meta
+ * page lock and retry it in a hard way.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
- blkno = BUCKET_TO_BLKNO(metap, bucket);
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /* Release metapage lock, but keep pin. */
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBufferShared(buf))
+ {
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ LockBuffer(buf, HASH_READ);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
+ else
+ {
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ }
+ if (retry)
+ {
/*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
+ * Loop until we get a lock on the correct target bucket. We get the
+ * lock on primary bucket page and retain the pin on it during read
+ * operation to prevent the concurrent splits. Retaining pin on a
+ * primary bucket page ensures that split can't happen as it needs to
+ * acquire the cleanup lock on primary bucket page. Acquiring lock on
+ * primary bucket and rechecking if it is a target bucket is mandatory
+ * as otherwise a concurrent split followed by vacuum could remove
+ * tuples from the selected bucket which otherwise would have been
+ * visible.
*/
- if (retry)
+ for (;;)
{
+ /*
+ * Compute the target bucket number, and convert to block number.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Release metapage lock, but keep pin. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ /*
+ * If the previous iteration of this loop locked what is still the
+ * correct target bucket, we are done. Otherwise, drop any old
+ * lock and lock what now appears to be the correct bucket.
+ */
if (oldblkno == blkno)
break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
+ _hash_relbuf(rel, buf);
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
+ /* Fetch the primary bucket page for the bucket */
+ buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * Reacquire metapage lock and check that no bucket split has
+ * taken place while we were awaiting the bucket lock.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ }
}
/* done with the metapage */
@@ -237,14 +303,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
/* Update scan opaque state to show we have lock on the bucket */
so->hashso_bucket = bucket;
so->hashso_bucket_valid = true;
- so->hashso_bucket_blkno = blkno;
- /* Fetch the primary bucket page for the bucket */
- buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
page = BufferGetPage(buf);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->hasho_bucket == bucket);
+ so->hashso_bucket_buf = buf;
+
+ /*
+ * If the bucket split is in progress, then we need to skip tuples that
+ * are moved from old bucket. To ensure that vacuum doesn't clean any
+ * tuples from old or new buckets till this scan is in progress, maintain
+ * a pin on both of the buckets. Here, we have to be cautious about lock
+ * ordering, first acquire the lock on old bucket, release the lock on old
+ * bucket, but not pin, then acquire the lock on new bucket and again
+ * re-verify whether the bucket split still is in progress. Acquiring lock
+ * on old bucket first ensures that the vacuum waits for this scan to
+ * finish.
+ */
+ if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+ {
+ BlockNumber old_blkno;
+ Buffer old_buf;
+
+ old_blkno = _hash_get_oldblk(rel, opaque);
+
+ /*
+ * release the lock on new bucket and re-acquire it after acquiring
+ * the lock on old bucket.
+ */
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+ old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * remember the old bucket buffer so as to use it later for scanning.
+ */
+ so->hashso_old_bucket_buf = old_buf;
+ _hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ Assert(opaque->hasho_bucket == bucket);
+
+ if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+ so->hashso_skip_moved_tuples = true;
+ else
+ {
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+ }
+ }
+
/* If a backwards scan is requested, move to the end of the chain */
if (ScanDirectionIsBackward(dir))
{
@@ -273,6 +385,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
* false. Else, return true and set the hashso_curpos for the
* scan to the right thing.
*
+ * Here we also scan the old bucket if the split for current bucket
+ * was in progress at the start of scan. The basic idea is that
+ * skip the tuples that are moved by split while scanning current
+ * bucket and then scan the old bucket to cover all such tuples. This
+ * is done to ensure that we don't miss any tuples in the scans that
+ * started during split.
+ *
* 'bufP' points to the current buffer, which is pinned and read-locked.
* On success exit, we have pin and read-lock on whichever page
* contains the right item; on failure, we have released all buffers.
@@ -338,6 +457,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum >= FirstOffsetNumber);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ {
+ offnum = OffsetNumberNext(offnum); /* move forward */
+ continue;
+ }
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -353,9 +485,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ buf = so->hashso_old_bucket_buf;
+
+ /*
+ * old buket buffer must be valid as we acquire
+ * the pin on it before the start of scan and
+ * retain it till end of scan.
+ */
+ Assert(BufferIsValid(buf));
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
@@ -379,6 +543,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum <= maxoff);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ {
+ offnum = OffsetNumberPrev(offnum); /* move back */
+ continue;
+ }
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -394,9 +571,41 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ buf = so->hashso_old_bucket_buf;
+
+ /*
+ * old buket buffer must be valid as we acquire
+ * the pin on it before the start of scan and
+ * retain it till end of scan.
+ */
+ Assert(BufferIsValid(buf));
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+ page = BufferGetPage(buf);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
@@ -410,9 +619,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
if (itup == NULL)
{
- /* we ran off the end of the bucket without finding a match */
+ /*
+ * We ran off the end of the bucket without finding a match.
+ * Release the pin on bucket buffers. Normally, such pins are
+ * released at end of scan, however scrolling cursors can
+ * reacquire the bucket lock and pin in the same scan multiple
+ * times.
+ */
*bufP = so->hashso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
+ _hash_dropscanbuf(rel, so);
return false;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..b5164d7 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
}
/*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+ uint32 i = 0;
+
+ while (num)
+ {
+ num = num >> 1;
+ ++i;
+ }
+
+ return i - 1;
+}
+
+/*
* _hash_checkpage -- sanity checks on the format of all hash pages
*
* If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
return lower;
}
+
+/*
+ * _hash_get_oldblk() -- get the block number from which current bucket
+ * is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket old_bucket;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ /*
+ * To get the old bucket from the current bucket, we need a mask to modulo
+ * into lower half of table. This mask is stored in meta page as
+ * hashm_lowmask, but here we can't rely on the same, because we need a
+ * value of lowmask that was prevalent at the time when bucket split was
+ * started. Masking the most significant bit of new bucket would give us
+ * old bucket.
+ */
+ curr_bucket = opaque->hasho_bucket;
+ mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+ old_bucket = curr_bucket & mask;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
+
+/*
+ * _hash_get_newblk() -- get the block number of bucket for the new bucket
+ * that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half. It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket new_bucket;
+ uint32 lowmask;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ curr_bucket = opaque->hasho_bucket;
+
+ /*
+ * new bucket can be obtained by OR'ing old bucket with most significant
+ * bit of current table half. There could be multiple buckets that could
+ * have splitted from curent bucket. We need the first such bucket that
+ * exists based on current table half.
+ */
+ lowmask = metap->hashm_lowmask;
+
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = curr_bucket | mask;
+ if (new_bucket > metap->hashm_maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+ break;
+ }
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
+
+/*
+ * _hash_get_newbucket() -- get the new bucket that will be generated after
+ * split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket. New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function. There could be multiple buckets that
+ * could have splitted from curent bucket. We need the first such bucket that
+ * exists. Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+ uint32 lowmask, uint32 maxbucket)
+{
+ Bucket new_bucket;
+ uint32 mask;
+
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = curr_bucket | mask;
+ if (new_bucket > maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ break;
+ }
+
+ return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..1c9be40 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
}
/*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+ BufferDesc *buf;
+
+ Assert(BufferIsValid(buffer));
+ if (BufferIsLocal(buffer))
+ return true; /* act as though we got it */
+
+ buf = GetBufferDescriptor(buffer - 1);
+
+ return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+ LW_SHARED);
+}
+
+/*
* LockBufferForCleanup - lock a buffer in preparation for deleting items
*
* Items may be deleted from a disk page only when the caller (a) holds an
@@ -3750,6 +3770,49 @@ ConditionalLockBufferForCleanup(Buffer buffer)
return false;
}
+/*
+ * CheckBufferForCleanup - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK. If
+ * not, return FALSE.
+ */
+bool
+CheckBufferForCleanup(Buffer buffer)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ Assert(BufferIsValid(buffer));
+
+ if (BufferIsLocal(buffer))
+ {
+ /* There should be exactly one pin */
+ if (LocalRefCount[-buffer - 1] != 1)
+ return false;
+ /* Nobody else to wait for */
+ return true;
+ }
+
+ /* There should be exactly one local pin */
+ if (GetPrivateRefCount(buffer) != 1)
+ return false;
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+
+ buf_state = LockBufHdr(bufHdr);
+
+ Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+ if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+ {
+ /* pincount is OK. */
+ UnlockBufHdr(bufHdr, buf_state);
+ return true;
+ }
+
+ UnlockBufHdr(bufHdr, buf_state);
+ return false;
+}
+
/*
* Functions for buffer I/O handling
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index d9df904..bbf822b 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -24,6 +24,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/lockdefs.h"
+#include "utils/hsearch.h"
#include "utils/relcache.h"
/*
@@ -32,6 +33,8 @@
*/
typedef uint32 Bucket;
+#define InvalidBucket ((Bucket) 0xFFFFFFFF)
+
#define BUCKET_TO_BLKNO(metap,B) \
((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
@@ -51,6 +54,9 @@ typedef uint32 Bucket;
#define LH_BUCKET_PAGE (1 << 1)
#define LH_BITMAP_PAGE (1 << 2)
#define LH_META_PAGE (1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT (1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT (1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE (1 << 6)
typedef struct HashPageOpaqueData
{
@@ -63,6 +69,12 @@ typedef struct HashPageOpaqueData
typedef HashPageOpaqueData *HashPageOpaque;
+#define H_HAS_GARBAGE(opaque) ((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque) (((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+ ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
/*
* The page ID is for the convenience of pg_filedump and similar utilities,
* which otherwise would have a hard time telling pages of different index
@@ -87,12 +99,6 @@ typedef struct HashScanOpaqueData
bool hashso_bucket_valid;
/*
- * If we have a share lock on the bucket, we record it here. When
- * hashso_bucket_blkno is zero, we have no such lock.
- */
- BlockNumber hashso_bucket_blkno;
-
- /*
* We also want to remember which buffer we're currently examining in the
* scan. We keep the buffer pinned (but not locked) across hashgettuple
* calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -100,11 +106,23 @@ typedef struct HashScanOpaqueData
*/
Buffer hashso_curbuf;
+ /* remember the buffer associated with primary bucket */
+ Buffer hashso_bucket_buf;
+
+ /*
+ * remember the buffer associated with old primary bucket which is
+ * required during the scan of the bucket for which split is in progress.
+ */
+ Buffer hashso_old_bucket_buf;
+
/* Current position of the scan, as an index TID */
ItemPointerData hashso_curpos;
/* Current position of the scan, as a heap TID */
ItemPointerData hashso_heappos;
+
+ /* Whether scan needs to skip tuples that are moved by split */
+ bool hashso_skip_moved_tuples;
} HashScanOpaqueData;
typedef HashScanOpaqueData *HashScanOpaque;
@@ -175,6 +193,8 @@ typedef HashMetaPageData *HashMetaPage;
sizeof(ItemIdData) - \
MAXALIGN(sizeof(HashPageOpaqueData)))
+#define INDEX_MOVED_BY_SPLIT_MASK 0x2000
+
#define HASH_MIN_FILLFACTOR 10
#define HASH_DEFAULT_FILLFACTOR 75
@@ -223,9 +243,6 @@ typedef HashMetaPageData *HashMetaPage;
#define HASH_WRITE BUFFER_LOCK_EXCLUSIVE
#define HASH_NOLOCK (-1)
-#define HASH_SHARE ShareLock
-#define HASH_EXCLUSIVE ExclusiveLock
-
/*
* Strategy number. There's only one valid strategy for hashing: equality.
*/
@@ -298,21 +315,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
Size itemsize, IndexTuple itup);
/* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
- BufferAccessStrategy bstrategy);
+ BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
BlockNumber blkno, ForkNumber forkNum);
extern void _hash_squeezebucket(Relation rel,
Bucket bucket, BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy);
/* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+ BlockNumber blkno, int flags);
extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
ForkNumber forkNum);
@@ -321,6 +338,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
BufferAccessStrategy bstrategy);
extern void _hash_relbuf(Relation rel, Buffer buf);
extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
extern void _hash_wrtbuf(Relation rel, Buffer buf);
extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
int to_access);
@@ -328,6 +346,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
ForkNumber forkNum);
extern void _hash_pageinit(Page page, Size size);
extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+ Buffer nbuf, uint32 maxbucket, uint32 highmask,
+ uint32 lowmask);
/* hashscan.c */
extern void _hash_regscan(IndexScanDesc scan);
@@ -363,5 +384,17 @@ extern bool _hash_convert_tuple(Relation index,
Datum *index_values, bool *index_isnull);
extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+ uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+ BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+ uint32 maxbucket, uint32 highmask, uint32 lowmask,
+ double *tuples_removed, double *num_index_tuples,
+ bool bucket_has_garbage, bool delay,
+ IndexBulkDeleteCallback callback, void *callback_state);
#endif /* HASH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b6ba96..accbb88 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -225,8 +225,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
extern void UnlockBuffers(void);
extern void LockBuffer(Buffer buffer, int mode);
extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool CheckBufferForCleanup(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
extern void AbortBufferIO(void);
On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have fixed all other issues you have raised. Updated patch is
attached with this mail.
I am finding the comments (particularly README) quite hard to follow.
There are many references to an "overflow bucket", or similar phrases. I
think these should be "overflow pages". A bucket is a conceptual thing
consisting of a primary page for that bucket and zero or more overflow
pages for the same bucket. There are no overflow buckets, unless you are
referring to the new bucket to which things are being moved.
Was maintaining on-disk compatibility a major concern for this patch?
Would you do things differently if that were not a concern? If we would
benefit from a break in format, I think it would be better to do that now
while hash indexes are still discouraged, rather than in a future release.
In particular, I am thinking about the need for every insert to
exclusive-content-lock the meta page to increment the index-wide tuple
count. I think that this is going to be a huge bottleneck on update
intensive workloads (which I don't believe have been performance tested as
of yet). I was wondering if we might not want to change that so that each
bucket keeps a local count, and sweeps that up to the meta page only when
it exceeds a threshold. But this would require the bucket page to have an
area to hold such a count. Another idea would to keep not a count of
tuples, but of buckets with at least one overflow page, and split when
there are too many of those. I bring it up now because it would be a shame
to ignore it until 10.0 is out the door, and then need to break things in
11.0.
Cheers,
Jeff
On Wed, Sep 7, 2016 at 11:49 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
I have fixed all other issues you have raised. Updated patch is
attached with this mail.I am finding the comments (particularly README) quite hard to follow. There
are many references to an "overflow bucket", or similar phrases. I think
these should be "overflow pages". A bucket is a conceptual thing consisting
of a primary page for that bucket and zero or more overflow pages for the
same bucket. There are no overflow buckets, unless you are referring to the
new bucket to which things are being moved.
Hmm. I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology. I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1]https://en.wikipedia.org/wiki/Linear_hashing,
search for overflow on that wiki page. Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.
Was maintaining on-disk compatibility a major concern for this patch? Would
you do things differently if that were not a concern?
I would not have done much differently from what it is now, however
one thing I have considered during development was to change the hash
index tuple structure as below to mark the index tuples as
move-by-split:
typedef struct
{
IndexTuple entry; /* tuple to insert */
bool moved_by_split;
} HashEntryData;
The other alternative was to use the (unused) bit in IndexTupleData->tinfo.
I have chosen the later approach, now one could definitely argue that
it is the last available bit in IndexTuple and using it for hash
indexes might or might not be best thing to do. However, I think it
is also not advisable to break the compatibility if we can use some
existing bit. In any case, the same question can arise whenever
anyone wants to use it for some other purpose.
In particular, I am thinking about the need for every insert to
exclusive-content-lock the meta page to increment the index-wide tuple
count.
This is not what this patch has changed. The main purpose of this
patch is to change heavy-weight locking to light-weight locking and
provide a way to handle the in-complete splits, both of which are
required to sensibly write WAL for hash-indexes. Having said that, I
agree with your point that we can improve the insertion logic, so that
we don't need to Write-lock the meta-page at each insert. I have
noticed some other improvements in hash indexes as well during this
work like caching the meta page, reduce lock/unlock calls for
retrieving tuples from a page by making hash index scans work a page
at a time as we do for btree scans, kill_prior_tuple mechanism is
current quite naive and needs improvement and the biggest improvement
is needed in create index logic where we are inserting tuple-by-tuple
whereas btree operates at page level and also by-passes the shared
buffers. One of such improvements (cache the meta page) is already
being worked upon by my colleague and the patch [2]https://commitfest.postgresql.org/10/715/ for same is in CF.
The main point I want to high light is that apart from what this patch
does, there are number of other potential areas which needs
improvements in hash indexes and I think it is better to do those as
separate enhancements rather than as a single patch.
I think that this is going to be a huge bottleneck on update
intensive workloads (which I don't believe have been performance tested as
of yet).
I have done some performance testing with this patch and I find there
was a significant improvement as compare to what we have now in hash
indexes even for read-write workload. I think the better idea is to
compare it with btree, but in any case, even if this proves to be a
bottleneck, we should try to improve it in a separate patch rather
than as a part of this patch.
I was wondering if we might not want to change that so that each
bucket keeps a local count, and sweeps that up to the meta page only when it
exceeds a threshold. But this would require the bucket page to have an area
to hold such a count. Another idea would to keep not a count of tuples, but
of buckets with at least one overflow page, and split when there are too
many of those.
I think both of these ideas could lead to change the point (tuple
count) where we currently split. This might impact the search speed
and space usage. Yet another alternative could be to change
hashm_ntuples to 64bit and use 64-bit atomics to operate on it or may
be use a separate spin-lock to protect it. However, whatever we
decide to do with it, I think it is a matter of separate patch.
Thanks for looking into patch.
[1]: https://en.wikipedia.org/wiki/Linear_hashing
[2]: https://commitfest.postgresql.org/10/715/
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 09/01/2016 11:55 PM, Amit Kapila wrote:
I have fixed all other issues you have raised. Updated patch is
attached with this mail.
The following script hangs on idx_val creation - just with v5, WAL patch
not applied.
Best regards,
Jesper
Attachments:
On 13/09/16 01:20, Jesper Pedersen wrote:
On 09/01/2016 11:55 PM, Amit Kapila wrote:
I have fixed all other issues you have raised. Updated patch is
attached with this mail.The following script hangs on idx_val creation - just with v5, WAL patch
not applied.
Are you sure it is actually hanging? I see 100% cpu for a few minutes
but the index eventually completes ok for me (v5 patch applied to
today's master).
Cheers
Mark
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 13, 2016 at 3:58 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
On 13/09/16 01:20, Jesper Pedersen wrote:
On 09/01/2016 11:55 PM, Amit Kapila wrote:
I have fixed all other issues you have raised. Updated patch is
attached with this mail.The following script hangs on idx_val creation - just with v5, WAL patch
not applied.Are you sure it is actually hanging? I see 100% cpu for a few minutes but
the index eventually completes ok for me (v5 patch applied to today's
master).
It completed for me as well. The second index creation is taking more
time and cpu, because it is just inserting duplicate values which need
lot of overflow pages.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1]/messages/by-id/CAA4eK1JuKt=-=Y0FheiFL-i8Z5_5660=3n8JUA8s3zG53t_ArQ@mail.gmail.com.
[1]: /messages/by-id/CAA4eK1JuKt=-=Y0FheiFL-i8Z5_5660=3n8JUA8s3zG53t_ArQ@mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
concurrent_hash_index_v6.patchapplication/octet-stream; name=concurrent_hash_index_v6.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c1122b4..d02539a 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -400,7 +400,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
Buffer buf;
Page page;
- _hash_getlock(rel, blkno, HASH_SHARE);
buf = _hash_getbuf_with_strategy(rel, blkno, HASH_READ, 0, bstrategy);
page = BufferGetPage(buf);
@@ -431,7 +430,6 @@ pgstat_hash_page(pgstattuple_type *stat, Relation rel, BlockNumber blkno,
}
_hash_relbuf(rel, buf);
- _hash_droplock(rel, blkno, HASH_SHARE);
}
/*
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 0a7da89..a0feb2f 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -125,49 +125,45 @@ the initially created buckets.
Lock Definitions
----------------
-
-We use both lmgr locks ("heavyweight" locks) and buffer context locks
-(LWLocks) to control access to a hash index. lmgr locks are needed for
-long-term locking since there is a (small) risk of deadlock, which we must
-be able to detect. Buffer context locks are used for short-term access
-control to individual pages of the index.
-
-LockPage(rel, page), where page is the page number of a hash bucket page,
-represents the right to split or compact an individual bucket. A process
-splitting a bucket must exclusive-lock both old and new halves of the
-bucket until it is done. A process doing VACUUM must exclusive-lock the
-bucket it is currently purging tuples from. Processes doing scans or
-insertions must share-lock the bucket they are scanning or inserting into.
-(It is okay to allow concurrent scans and insertions.)
-
-The lmgr lock IDs corresponding to overflow pages are currently unused.
-These are available for possible future refinements. LockPage(rel, 0)
-is also currently undefined (it was previously used to represent the right
-to modify the hash-code-to-bucket mapping, but it is no longer needed for
-that purpose).
-
-Note that these lock definitions are conceptually distinct from any sort
-of lock on the pages whose numbers they share. A process must also obtain
-read or write buffer lock on the metapage or bucket page before accessing
-said page.
-
-Processes performing hash index scans must hold share lock on the bucket
-they are scanning throughout the scan. This seems to be essential, since
-there is no reasonable way for a scan to cope with its bucket being split
-underneath it. This creates a possibility of deadlock external to the
-hash index code, since a process holding one of these locks could block
-waiting for an unrelated lock held by another process. If that process
-then does something that requires exclusive lock on the bucket, we have
-deadlock. Therefore the bucket locks must be lmgr locks so that deadlock
-can be detected and recovered from.
-
-Processes must obtain read (share) buffer context lock on any hash index
-page while reading it, and write (exclusive) lock while modifying it.
-To prevent deadlock we enforce these coding rules: no buffer lock may be
-held long term (across index AM calls), nor may any buffer lock be held
-while waiting for an lmgr lock, nor may more than one buffer lock
-be held at a time by any one process. (The third restriction is probably
-stronger than necessary, but it makes the proof of no deadlock obvious.)
+We use buffer content locks (LWLocks) and buffer pins to control access to
+a hash index.
+
+Scan will take a lock in shared mode on primary or overflow buckets. Inserts
+will acquire exclusive lock on the bucket in which it has to insert. Both the
+operations releases the lock on previous bucket before moving to the next
+overflow bucket. They will retain a pin on primary bucket till end of operation.
+Split operation must acquire cleanup lock on both old and new halves of the
+bucket and mark split-in-progress on both the buckets. The cleanup lock at
+the start of split ensures that parallel insert won't get lost. Consider a
+case where insertion has to add a tuple on some intermediate overflow bucket
+in the bucket chain, if we allow split when insertion is in progress, split
+might not move this newly inserted tuple. It releases the lock on previous
+bucket before moving to the next overflow bucket either for old bucket or for
+new bucket. After partitioning the tuples between old and new buckets, it
+again needs to acquire exclusive lock on both old and new buckets to clear
+the split-in-progress flag. Like inserts and scans, it will also retain pins
+on both the old and new primary buckets till end of split operation, although
+we can do without that as well.
+
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or tuples
+that are moved due to split. The need for cleanup lock to remove dead tuples
+is to ensure that scans' returns correct results. Scan that returns multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple. If we allow vacuum to
+remove the dead tuples with just an exclusive lock, it could remove the tuple
+required to resume the scan. The need for cleanup lock to remove the tuples
+that are moved by split is to ensure that there is no pending scan that has
+started after the start of split and before the finish of split on bucket.
+If we don't do that, then vacuum can remove tuples that are required by such
+a scan. We don't need to retain this cleanup lock during whole vacuum
+operation on bucket. We releases the lock as we move ahead in the bucket
+chain. In the end, for squeeze-phase, we conditionally acquire cleanup lock
+and if we don't get, then we just abandon the squeeze phase.
+
+To avoid deadlocks, we must be consistent about the lock order in which we
+lock the buckets for operations that requires locks on two different buckets.
+We use the rule "first lock the old bucket and then new bucket, basically
+lock the lowered number bucket first".
Pseudocode Algorithms
@@ -188,63 +184,105 @@ track of available overflow pages.
The reader algorithm is:
pin meta page and take buffer content lock in shared mode
- loop:
- compute bucket number for target hash key
- release meta page buffer content lock
- if (correct bucket page is already locked)
- break
- release any existing bucket page lock (if a concurrent split happened)
- take heavyweight bucket lock
- retake meta page buffer content lock in shared mode
+ compute bucket number for target hash key
+ read and pin the primary bucket page
+ conditionally get the buffer content lock in shared mode on primary bucket page for search
+ if we didn't get the lock (need to wait for lock)
+ release the buffer content lock on meta page
+ acquire buffer content lock on primary bucket page in shared mode
+ acquire the buffer content lock in shared mode on meta page
+ to check for possibility of split, we need to recompute the bucket and
+ verify, if it is a correct bucket; set the retry flag
+ else if we get the lock, then we can skip the retry path
+ if (retry)
+ loop:
+ compute bucket number for target hash key
+ release meta page buffer content lock
+ if (correct bucket page is already locked)
+ break
+ release any existing content lock on bucket page (if a concurrent split happened)
+ pin primary bucket page and take shared buffer content lock
+ retake meta page buffer content lock in shared mode
-- then, per read request:
release pin on metapage
- read current page of bucket and take shared buffer content lock
- step to next page if necessary (no chaining of locks)
+ if the split is in progress for current bucket and this is a new bucket
+ release the buffer content lock on current bucket page
+ pin and acquire the buffer content lock on old bucket in shared mode
+ release the buffer content lock on old bucket, but not pin
+ retake the buffer content lock on new bucket
+ mark the scan such that it skips the tuples that are marked as moved by split
+ step to next page if necessary (no chaining of locks)
+ if the scan indicates moved by split, then move to old bucket after the scan
+ of current bucket is finished
get tuple
release buffer content lock and pin on current page
-- at scan shutdown:
- release bucket share-lock
-
-We can't hold the metapage lock while acquiring a lock on the target bucket,
-because that might result in an undetected deadlock (lwlocks do not participate
-in deadlock detection). Instead, we relock the metapage after acquiring the
-bucket page lock and check whether the bucket has been split. If not, we're
-done. If so, we release our previously-acquired lock and repeat the process
-using the new bucket number. Holding the bucket sharelock for
+ release any pin we hold on current buffer, old bucket buffer, new bucket buffer
+
+We don't want to hold the meta page lock if we have to wait for acquiring the
+content lock on bucket page, because that might result in poor concurrency.
+Instead, we relock the metapage after acquiring the bucket page content lock
+and check whether the bucket has been split. If not, we're done. If so, we
+release our previously-acquired content lock, but not pin and repeat the
+process using the new bucket number. Holding the buffer pin on bucket page for
the remainder of the scan prevents the reader's current-tuple pointer from
-being invalidated by splits or compactions. Notice that the reader's lock
+being invalidated by splits or compactions. Notice that the reader's pin
does not prevent other buckets from being split or compacted.
To keep concurrency reasonably good, we require readers to cope with
concurrent insertions, which means that they have to be able to re-find
-their current scan position after re-acquiring the page sharelock. Since
-deletion is not possible while a reader holds the bucket sharelock, and
-we assume that heap tuple TIDs are unique, this can be implemented by
+their current scan position after re-acquiring the buffer content lock on
+page. Since deletion is not possible while a reader holds the pin on bucket,
+and we assume that heap tuple TIDs are unique, this can be implemented by
searching for the same heap tuple TID previously returned. Insertion does
not move index entries across pages, so the previously-returned index entry
should always be on the same page, at the same or higher offset number,
as it was before.
+To allow scan during bucket split, if at the start of the scan, bucket is
+marked as split-in-progress, it scan all the tuples in that bucket except for
+those that are marked as moved-by-split. Once it finishes the scan of all the
+tuples in the current bucket, it scans the old bucket from which this bucket
+is formed by split. This happens only for the new half bucket.
+
The insertion algorithm is rather similar:
pin meta page and take buffer content lock in shared mode
- loop:
- compute bucket number for target hash key
- release meta page buffer content lock
- if (correct bucket page is already locked)
- break
- release any existing bucket page lock (if a concurrent split happened)
- take heavyweight bucket lock in shared mode
- retake meta page buffer content lock in shared mode
--- (so far same as reader)
+ compute bucket number for target hash key
+ read and pin the primary bucket page
+ conditionally get the buffer content lock in exclusive mode on primary bucket page for search
+ if we didn't get the lock (need to wait for lock)
+ release the buffer content lock on meta page
+ acquire buffer content lock on primary bucket page in exclusive mode
+ acquire the buffer content lock in shared mode on meta page
+ to check for possibility of split, we need to recompute the bucket and
+ verify, if it is a correct bucket; set the retry flag
+ else if we get the lock, then we can skip the retry path
+ if (retry)
+ loop:
+ compute bucket number for target hash key
+ release meta page buffer content lock
+ if (correct bucket page is already locked)
+ break
+ release any existing content lock on bucket page (if a concurrent split happened)
+ pin primary bucket page and take exclusive buffer content lock
+ retake meta page buffer content lock in shared mode
+-- (so far same as reader, except for acquisation of buffer content lock in
+ exclusive mode on primary bucket page)
release pin on metapage
- pin current page of bucket and take exclusive buffer content lock
- if full, release, read/exclusive-lock next page; repeat as needed
+ if the split-in-progress flag is set for bucket in old half of split
+ and pin count on it is one, then finish the split
+ we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
+ if get the lock on new bucket
+ finish the split using algorithm mentioned below for split
+ release the buffer content lock and pin on new bucket
+ if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
>> see below if no space in any page of bucket
insert tuple at appropriate place in page
mark current page dirty and release buffer content lock and pin
+ if current page is not a bucket page, release the pin on bucket page
release heavyweight share-lock
- pin meta page and take buffer content lock in shared mode
+ pin meta page and take buffer content lock in exclusive mode
increment tuple count, decide if split needed
mark meta page dirty and release buffer content lock and pin
done if no split needed, else enter Split algorithm below
@@ -256,11 +294,13 @@ bucket that is being actively scanned, because readers can cope with this
as explained above. We only need the short-term buffer locks to ensure
that readers do not see a partially-updated page.
-It is clearly impossible for readers and inserters to deadlock, and in
-fact this algorithm allows them a very high degree of concurrency.
-(The exclusive metapage lock taken to update the tuple count is stronger
-than necessary, since readers do not care about the tuple count, but the
-lock is held for such a short time that this is probably not an issue.)
+To avoid deadlock between readers and inserters, whenever there is a need
+to lock multiple buckets, we always take in the order suggested in Locking
+Definitions above. This algorithm allows them a very high degree of
+concurrency. (The exclusive metapage lock taken to update the tuple count
+is stronger than necessary, since readers do not care about the tuple count,
+but the lock is held for such a short time that this is probably not an
+issue.)
When an inserter cannot find space in any existing page of a bucket, it
must obtain an overflow page and add that page to the bucket's chain.
@@ -271,46 +311,79 @@ index is overfull (has a higher-than-wanted ratio of tuples to buckets).
The algorithm attempts, but does not necessarily succeed, to split one
existing bucket in two, thereby lowering the fill ratio:
- pin meta page and take buffer content lock in exclusive mode
- check split still needed
- if split not needed anymore, drop buffer content lock and pin and exit
- decide which bucket to split
- Attempt to X-lock old bucket number (definitely could fail)
- Attempt to X-lock new bucket number (shouldn't fail, but...)
- if above fail, drop locks and pin and exit
+ expand:
+ take buffer content lock in exclusive mode on meta page
+ check split still needed
+ if split not needed anymore, drop buffer content lock and exit
+ decide which bucket to split
+ Attempt to acquire cleanup lock on old bucket number (definitely could fail)
+ if above fail, release lock and pin and exit
+ if the split-in-progress flag is set, then finish the split
+ conditionally get the content lock on new bucket which was involved in split
+ if got the lock on new bucket
+ finish the split using algorithm mentioned below for split
+ release the buffer content lock and pin on old and new bucketa
+ try to expand from start
+ else
+ release the buffer conetent lock and pin on old bucket and exit
+ if the garbage flag (indicates that tuples are moved by split) is set on bucket
+ release the buffer content lock on meta page
+ remove the tuples that doesn't belong to this bucket; see bucket cleanup below
+ Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
update meta page to reflect new number of buckets
- mark meta page dirty and release buffer content lock and pin
+ mark meta page dirty and release buffer content lock
-- now, accesses to all other buckets can proceed.
Perform actual split of bucket, moving tuples as needed
>> see below about acquiring needed extra space
Release X-locks of old and new buckets
+ split guts
+ mark the old and new buckets indicating split-in-progress
+ mark the old bucket indicating has-garbage
+ copy the tuples that belongs to new bucket from old bucket
+ during copy mark such tuples as move-by-split
+ release lock but not pin for primary bucket page of old bucket,
+ read/shared-lock next page; repeat as needed
+ >> see below if no space in bucket page of new bucket
+ ensure to have exclusive-lock on both old and new buckets in that order
+ clear the split-in-progress flag from both the buckets
+ mark buffers dirty and release the locks and pins on both old and new buckets
+
Note the metapage lock is not held while the actual tuple rearrangement is
performed, so accesses to other buckets can proceed in parallel; in fact,
it's possible for multiple bucket splits to proceed in parallel.
-Split's attempt to X-lock the old bucket number could fail if another
-process holds S-lock on it. We do not want to wait if that happens, first
-because we don't want to wait while holding the metapage exclusive-lock,
-and second because it could very easily result in deadlock. (The other
-process might be out of the hash AM altogether, and could do something
-that blocks on another lock this process holds; so even if the hash
-algorithm itself is deadlock-free, a user-induced deadlock could occur.)
-So, this is a conditional LockAcquire operation, and if it fails we just
-abandon the attempt to split. This is all right since the index is
-overfull but perfectly functional. Every subsequent inserter will try to
-split, and eventually one will succeed. If multiple inserters failed to
-split, the index might still be overfull, but eventually, the index will
+Split's attempt to acquire cleanup-lock on the old bucket number could fail
+if another process holds any lock or pin on it. We do not want to wait if
+that happens, because we don't want to wait while holding the metapage
+exclusive-lock. So, this is a conditional LWLockAcquire operation, and if
+it fails we just abandon the attempt to split. This is all right since the
+index is overfull but perfectly functional. Every subsequent inserter will
+try to split, and eventually one will succeed. If multiple inserters failed
+to split, the index might still be overfull, but eventually, the index will
not be overfull and split attempts will stop. (We could make a successful
splitter loop to see if the index is still overfull, but it seems better to
distribute the split overhead across successive insertions.)
+has_garbage flag indicates that the bucket contains tuples that are moved due
+to split. This will be set only for old bucket. Now, why we need it besides
+split-in-progress flag is to distinguish the case when the split is over
+(aka split-in-progress flag is cleared.). This is used both by vacuum as
+well as during re-split operation. Vacuum, uses it to decide if it needs to
+clear the tuples (that are moved-by-split) from bucket along with dead tuples.
+Re-split of bucket uses it to ensure that it doesn't start a new split from a
+bucket without clearing the previous tuples from old bucket. The usage by
+re-split helps to keep bloat under control and makes the design somewhat
+simpler as we don't have to any time handle the situation where a bucket can
+contain dead-tuples from multiple splits.
+
A problem is that if a split fails partway through (eg due to insufficient
-disk space) the index is left corrupt. The probability of that could be
-made quite low if we grab a free page or two before we update the meta
-page, but the only real solution is to treat a split as a WAL-loggable,
+disk space or crash) the index is left corrupt. The probability of that
+could be made quite low if we grab a free page or two before we update the
+meta page, but the only real solution is to treat a split as a WAL-loggable,
must-complete action. I'm not planning to teach hash about WAL in this
-go-round.
+go-round. However, we do try to finish the incomplete splits during insert
+and split.
The fourth operation is garbage collection (bulk deletion):
@@ -319,9 +392,13 @@ The fourth operation is garbage collection (bulk deletion):
fetch current max bucket number
release meta page buffer content lock and pin
while next bucket <= max bucket do
- Acquire X lock on target bucket
- Scan and remove tuples, compact free space as needed
- Release X lock
+ Acquire cleanup lock on target bucket
+ Scan and remove tuples
+ For overflow buckets, first we need to lock the next bucket and then
+ release the lock on current bucket
+ Ensure to have X lock on bucket page
+ If buffer pincount is one, then compact free space as needed
+ Release lock
next bucket ++
end loop
pin metapage and take buffer content lock in exclusive mode
@@ -330,20 +407,23 @@ The fourth operation is garbage collection (bulk deletion):
else update metapage tuple count
mark meta page dirty and release buffer content lock and pin
-Note that this is designed to allow concurrent splits. If a split occurs,
-tuples relocated into the new bucket will be visited twice by the scan,
-but that does no harm. (We must however be careful about the statistics
-reported by the VACUUM operation. What we can do is count the number of
-tuples scanned, and believe this in preference to the stored tuple count
-if the stored tuple count and number of buckets did *not* change at any
-time during the scan. This provides a way of correcting the stored tuple
-count if it gets out of sync for some reason. But if a split or insertion
-does occur concurrently, the scan count is untrustworthy; instead,
-subtract the number of tuples deleted from the stored tuple count and
-use that.)
-
-The exclusive lock request could deadlock in some strange scenarios, but
-we can just error out without any great harm being done.
+Note that this is designed to allow concurrent splits and scans. If a
+split occurs, tuples relocated into the new bucket will be visited twice
+by the scan, but that does no harm. As we are releasing the locks during
+scan of a bucket, it will allow concurrent scan to start on a bucket and
+ensures that scan will always be behind cleanup. It is must to keep scans
+behind cleanup, else vacuum could remove tuples that are required to
+complete the scan as explained in Lock Definitions section above. This holds
+true for backward scans as well (backward scans first traverse each bucket
+starting from first bucket to last overflow bucket in the chain).
+We must be careful about the statistics reported by the VACUUM operation.
+What we can do is count the number of tuples scanned, and believe this in
+preference to the stored tuple count if the stored tuple count and number
+of buckets did *not* change at any time during the scan. This provides a
+way of correcting the stored tuple count if it gets out of sync for some
+reason. But if a split or insertion does occur concurrently, the scan
+count is untrustworthy; instead, subtract the number of tuples deleted
+from the stored tuple count and use that.
Free Space Management
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..a12a830 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -287,10 +287,10 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
/*
* An insertion into the current index page could have happened while
* we didn't have read lock on it. Re-find our position by looking
- * for the TID we previously returned. (Because we hold share lock on
- * the bucket, no deletions or splits could have occurred; therefore
- * we can expect that the TID still exists in the current index page,
- * at an offset >= where we were.)
+ * for the TID we previously returned. (Because we hold pin on the
+ * bucket, no deletions or splits could have occurred; therefore we
+ * can expect that the TID still exists in the current index page, at
+ * an offset >= where we were.)
*/
OffsetNumber maxoffnum;
@@ -425,12 +425,15 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
so = (HashScanOpaque) palloc(sizeof(HashScanOpaqueData));
so->hashso_bucket_valid = false;
- so->hashso_bucket_blkno = 0;
so->hashso_curbuf = InvalidBuffer;
+ so->hashso_bucket_buf = InvalidBuffer;
+ so->hashso_old_bucket_buf = InvalidBuffer;
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
ItemPointerSetInvalid(&(so->hashso_heappos));
+ so->hashso_skip_moved_tuples = false;
+
scan->opaque = so;
/* register scan in case we change pages it's using */
@@ -449,15 +452,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
HashScanOpaque so = (HashScanOpaque) scan->opaque;
Relation rel = scan->indexRelation;
- /* release any pin we still hold */
- if (BufferIsValid(so->hashso_curbuf))
- _hash_dropbuf(rel, so->hashso_curbuf);
- so->hashso_curbuf = InvalidBuffer;
-
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ _hash_dropscanbuf(rel, so);
/* set position invalid (this will cause _hash_first call) */
ItemPointerSetInvalid(&(so->hashso_curpos));
@@ -471,6 +466,8 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
scan->numberOfKeys * sizeof(ScanKeyData));
so->hashso_bucket_valid = false;
}
+
+ so->hashso_skip_moved_tuples = false;
}
/*
@@ -484,16 +481,7 @@ hashendscan(IndexScanDesc scan)
/* don't need scan registered anymore */
_hash_dropscan(scan);
-
- /* release any pin we still hold */
- if (BufferIsValid(so->hashso_curbuf))
- _hash_dropbuf(rel, so->hashso_curbuf);
- so->hashso_curbuf = InvalidBuffer;
-
- /* release lock on bucket, too */
- if (so->hashso_bucket_blkno)
- _hash_droplock(rel, so->hashso_bucket_blkno, HASH_SHARE);
- so->hashso_bucket_blkno = 0;
+ _hash_dropscanbuf(rel, so);
pfree(so);
scan->opaque = NULL;
@@ -504,6 +492,9 @@ hashendscan(IndexScanDesc scan)
* The set of target tuples is specified via a callback routine that tells
* whether any given heap tuple (identified by ItemPointer) is being deleted.
*
+ * This function also delete the tuples that are moved by split to other
+ * bucket.
+ *
* Result: a palloc'd struct containing statistical info for VACUUM displays.
*/
IndexBulkDeleteResult *
@@ -548,83 +539,52 @@ loop_top:
{
BlockNumber bucket_blkno;
BlockNumber blkno;
- bool bucket_dirty = false;
+ Buffer bucket_buf;
+ Buffer buf;
+ HashPageOpaque bucket_opaque;
+ Page page;
+ bool bucket_has_garbage = false;
/* Get address of bucket's start page */
bucket_blkno = BUCKET_TO_BLKNO(&local_metapage, cur_bucket);
- /* Exclusive-lock the bucket so we can shrink it */
- _hash_getlock(rel, bucket_blkno, HASH_EXCLUSIVE);
-
/* Shouldn't have any active scans locally, either */
if (_hash_has_active_scan(rel, cur_bucket))
elog(ERROR, "hash index has active scan during VACUUM");
- /* Scan each page in bucket */
blkno = bucket_blkno;
- while (BlockNumberIsValid(blkno))
- {
- Buffer buf;
- Page page;
- HashPageOpaque opaque;
- OffsetNumber offno;
- OffsetNumber maxoffno;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
- vacuum_delay_point();
-
- buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
- LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- info->strategy);
- page = BufferGetPage(buf);
- opaque = (HashPageOpaque) PageGetSpecialPointer(page);
- Assert(opaque->hasho_bucket == cur_bucket);
-
- /* Scan each tuple in page */
- maxoffno = PageGetMaxOffsetNumber(page);
- for (offno = FirstOffsetNumber;
- offno <= maxoffno;
- offno = OffsetNumberNext(offno))
- {
- IndexTuple itup;
- ItemPointer htup;
+ /*
+ * We need to acquire a cleanup lock on the primary bucket to out wait
+ * concurrent scans.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, info->strategy);
+ LockBufferForCleanup(buf);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
- itup = (IndexTuple) PageGetItem(page,
- PageGetItemId(page, offno));
- htup = &(itup->t_tid);
- if (callback(htup, callback_state))
- {
- /* mark the item for deletion */
- deletable[ndeletable++] = offno;
- tuples_removed += 1;
- }
- else
- num_index_tuples += 1;
- }
+ page = BufferGetPage(buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
- /*
- * Apply deletions and write page if needed, advance to next page.
- */
- blkno = opaque->hasho_nextblkno;
+ /*
+ * If the bucket contains tuples that are moved by split, then we need
+ * to delete such tuples on completion of split. Before cleaning, we
+ * need to out-wait the scans that have started when the split was in
+ * progress for a bucket.
+ */
+ if (H_HAS_GARBAGE(bucket_opaque) &&
+ !H_INCOMPLETE_SPLIT(bucket_opaque))
+ bucket_has_garbage = true;
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(page, deletable, ndeletable);
- _hash_wrtbuf(rel, buf);
- bucket_dirty = true;
- }
- else
- _hash_relbuf(rel, buf);
- }
+ bucket_buf = buf;
- /* If we deleted anything, try to compact free space */
- if (bucket_dirty)
- _hash_squeezebucket(rel, cur_bucket, bucket_blkno,
- info->strategy);
+ hashbucketcleanup(rel, bucket_buf, blkno, info->strategy,
+ local_metapage.hashm_maxbucket,
+ local_metapage.hashm_highmask,
+ local_metapage.hashm_lowmask, &tuples_removed,
+ &num_index_tuples, bucket_has_garbage, true,
+ callback, callback_state);
- /* Release bucket lock */
- _hash_droplock(rel, bucket_blkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, bucket_buf);
/* Advance to next bucket */
cur_bucket++;
@@ -705,6 +665,197 @@ hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
return stats;
}
+/*
+ * Helper function to perform deletion of index entries from a bucket.
+ *
+ * This expects that the caller has acquired a cleanup lock on the target
+ * bucket (primary page of a bucket) and it is reponsibility of caller to
+ * release that lock.
+ *
+ * During scan of overflow buckets, first we need to lock the next bucket and
+ * then release the lock on current bucket. This ensures that any concurrent
+ * scan started after we start cleaning the bucket will always be behind the
+ * cleanup. Allowing scans to cross vacuum will allow it to remove tuples
+ * required for sanctity of scan.
+ *
+ * We need to retain a pin on the primary bucket to ensure that no concurrent
+ * split can start.
+ */
+void
+hashbucketcleanup(Relation rel, Buffer bucket_buf,
+ BlockNumber bucket_blkno,
+ BufferAccessStrategy bstrategy,
+ uint32 maxbucket,
+ uint32 highmask, uint32 lowmask,
+ double *tuples_removed,
+ double *num_index_tuples,
+ bool bucket_has_garbage,
+ bool delay,
+ IndexBulkDeleteCallback callback,
+ void *callback_state)
+{
+ BlockNumber blkno;
+ Buffer buf;
+ Bucket cur_bucket;
+ Bucket new_bucket PG_USED_FOR_ASSERTS_ONLY = InvalidBucket;
+ Page page;
+ bool bucket_dirty = false;
+
+ blkno = bucket_blkno;
+ buf = bucket_buf;
+ page = BufferGetPage(buf);
+ cur_bucket = ((HashPageOpaque) PageGetSpecialPointer(page))->hasho_bucket;
+
+ if (bucket_has_garbage)
+ new_bucket = _hash_get_newbucket(rel, cur_bucket,
+ lowmask, maxbucket);
+
+ /* Scan each page in bucket */
+ for (;;)
+ {
+ HashPageOpaque opaque;
+ OffsetNumber offno;
+ OffsetNumber maxoffno;
+ Buffer next_buf;
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ bool retain_pin = false;
+ bool curr_page_dirty = false;
+
+ if (delay)
+ vacuum_delay_point();
+
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ /* Scan each tuple in page */
+ maxoffno = PageGetMaxOffsetNumber(page);
+ for (offno = FirstOffsetNumber;
+ offno <= maxoffno;
+ offno = OffsetNumberNext(offno))
+ {
+ IndexTuple itup;
+ ItemPointer htup;
+ Bucket bucket;
+
+ itup = (IndexTuple) PageGetItem(page,
+ PageGetItemId(page, offno));
+ htup = &(itup->t_tid);
+ if (callback && callback(htup, callback_state))
+ {
+ /* mark the item for deletion */
+ deletable[ndeletable++] = offno;
+ if (tuples_removed)
+ *tuples_removed += 1;
+ }
+ else if (bucket_has_garbage)
+ {
+ /* delete the tuples that are moved by split. */
+ bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
+ maxbucket,
+ highmask,
+ lowmask);
+ /* mark the item for deletion */
+ if (bucket != cur_bucket)
+ {
+ /*
+ * We expect tuples to either belong to curent bucket or
+ * new_bucket. This is ensured because we don't allow
+ * further splits from bucket that contains garbage. See
+ * comments in _hash_expandtable.
+ */
+ Assert(bucket == new_bucket);
+ deletable[ndeletable++] = offno;
+ }
+ else if (num_index_tuples)
+ *num_index_tuples += 1;
+ }
+ else if (num_index_tuples)
+ *num_index_tuples += 1;
+ }
+
+ /* retain the pin on primary bucket till end of bucket scan */
+ if (blkno == bucket_blkno)
+ retain_pin = true;
+ else
+ retain_pin = false;
+
+ blkno = opaque->hasho_nextblkno;
+
+ /*
+ * Apply deletions and write page if needed, advance to next page.
+ */
+ if (ndeletable > 0)
+ {
+ PageIndexMultiDelete(page, deletable, ndeletable);
+ bucket_dirty = true;
+ curr_page_dirty = true;
+ }
+
+ /* bail out if there are no more pages to scan. */
+ if (!BlockNumberIsValid(blkno))
+ break;
+
+ next_buf = _hash_getbuf_with_strategy(rel, blkno, HASH_WRITE,
+ LH_OVERFLOW_PAGE,
+ bstrategy);
+
+ /*
+ * release the lock on previous page after acquiring the lock on next
+ * page
+ */
+ if (curr_page_dirty)
+ {
+ if (retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, buf);
+ curr_page_dirty = false;
+ }
+ else if (retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
+
+ buf = next_buf;
+ }
+
+ /*
+ * lock the bucket page to clear the garbage flag and squeeze the bucket.
+ * if the current buffer is same as bucket buffer, then we already have
+ * lock on bucket page.
+ */
+ if (buf != bucket_buf)
+ {
+ _hash_relbuf(rel, buf);
+ _hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+ }
+
+ /*
+ * Clear the garbage flag from bucket after deleting the tuples that are
+ * moved by split. We purposefully clear the flag before squeeze bucket,
+ * so that after restart, vacuum shouldn't again try to delete the moved
+ * by split tuples.
+ */
+ if (bucket_has_garbage)
+ {
+ HashPageOpaque bucket_opaque;
+
+ page = BufferGetPage(bucket_buf);
+ bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+ }
+
+ /*
+ * If we deleted anything, try to compact free space. For squeezing the
+ * bucket, we must have a cleanup lock, else it can impact the ordering of
+ * tuples for a scan that has started before it.
+ */
+ if (bucket_dirty && CheckBufferForCleanup(bucket_buf))
+ _hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
+ bstrategy);
+}
void
hash_redo(XLogReaderState *record)
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index acd2e64..5cfd0aa 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -28,7 +28,8 @@
void
_hash_doinsert(Relation rel, IndexTuple itup)
{
- Buffer buf;
+ Buffer buf = InvalidBuffer;
+ Buffer bucket_buf;
Buffer metabuf;
HashMetaPage metap;
BlockNumber blkno;
@@ -40,6 +41,9 @@ _hash_doinsert(Relation rel, IndexTuple itup)
bool do_expand;
uint32 hashkey;
Bucket bucket;
+ uint32 maxbucket;
+ uint32 highmask;
+ uint32 lowmask;
/*
* Get the hash key for the item (it's stored in the index tuple itself).
@@ -70,51 +74,131 @@ _hash_doinsert(Relation rel, IndexTuple itup)
errhint("Values larger than a buffer page cannot be indexed.")));
/*
- * Loop until we get a lock on the correct target bucket.
+ * Copy bucket mapping info now; The comment in _hash_expandtable where
+ * we copy this information and calls _hash_splitbucket explains why this
+ * is OK.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
+ maxbucket = metap->hashm_maxbucket;
+ highmask = metap->hashm_highmask;
+ lowmask = metap->hashm_lowmask;
- blkno = BUCKET_TO_BLKNO(metap, bucket);
+ /*
+ * Conditionally get the lock on primary bucket page for insertion while
+ * holding lock on meta page. If we have to wait, then release the meta
+ * page lock and retry it in a hard way.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ maxbucket,
+ highmask,
+ lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /* Release metapage lock, but keep pin. */
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBuffer(buf))
+ {
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ LockBuffer(buf, HASH_WRITE);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
+ else
+ {
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ }
+ if (retry)
+ {
/*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
+ * Loop until we get a lock on the correct target bucket. We get the
+ * lock on primary bucket page and retain the pin on it during insert
+ * operation to prevent the concurrent splits. Retaining pin on a
+ * primary bucket page ensures that split can't happen as it needs to
+ * acquire the cleanup lock on primary bucket page. Acquiring lock on
+ * primary bucket and rechecking if it is a target bucket is mandatory
+ * as otherwise a concurrent split might cause this insertion to fall
+ * in wrong bucket.
*/
- if (retry)
+ for (;;)
{
+ /*
+ * Compute the target bucket number, and convert to block number.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Release metapage lock, but keep pin. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ /*
+ * If the previous iteration of this loop locked what is still the
+ * correct target bucket, we are done. Otherwise, drop any old
+ * lock and lock what now appears to be the correct bucket.
+ */
if (oldblkno == blkno)
break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
+ _hash_relbuf(rel, buf);
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
+ /* Fetch the primary bucket page for the bucket */
+ buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+
+ /*
+ * Reacquire metapage lock and check that no bucket split has
+ * taken place while we were awaiting the bucket lock.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ }
}
- /* Fetch the primary bucket page for the bucket */
- buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
+ /* remember the primary bucket buffer to release the pin on it at end. */
+ bucket_buf = buf;
+
page = BufferGetPage(buf);
pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(pageopaque->hasho_bucket == bucket);
+ /*
+ * If there is any pending split, try to finish it before proceeding for
+ * the insertion. We try to finish the split for the insertion in old
+ * bucket, as that will allow us to remove the tuples from old bucket and
+ * reuse the space. There is no such apparent benefit from finishing the
+ * split during insertion in new bucket.
+ *
+ * In future, if we want to finish the splits during insertion in new
+ * bucket, we must ensure the locking order such that old bucket is locked
+ * before new bucket.
+ */
+ if (H_OLD_INCOMPLETE_SPLIT(pageopaque) && CheckBufferForCleanup(buf))
+ {
+ BlockNumber nblkno;
+ Buffer nbuf;
+
+ nblkno = _hash_get_newblk(rel, pageopaque);
+
+ /* Fetch the primary bucket page for the new bucket */
+ nbuf = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+ if (nbuf)
+ {
+ _hash_finish_split(rel, metabuf, buf, nbuf, maxbucket,
+ highmask, lowmask);
+
+ /*
+ * release the buffer here as the insertion will happen in old
+ * bucket.
+ */
+ _hash_relbuf(rel, nbuf);
+ }
+ }
+
/* Do the insertion */
while (PageGetFreeSpace(page) < itemsz)
{
@@ -127,14 +211,23 @@ _hash_doinsert(Relation rel, IndexTuple itup)
{
/*
* ovfl page exists; go get it. if it doesn't have room, we'll
- * find out next pass through the loop test above.
+ * find out next pass through the loop test above. Retain the pin
+ * if it is a primary bucket.
*/
- _hash_relbuf(rel, buf);
+ if (pageopaque->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
page = BufferGetPage(buf);
}
else
{
+ bool retain_pin = false;
+
+ /* page flags must be accessed before releasing lock on a page. */
+ retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE;
+
/*
* we're at the end of the bucket chain and we haven't found a
* page with enough room. allocate a new overflow page.
@@ -144,7 +237,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
/* chain to a new overflow page */
- buf = _hash_addovflpage(rel, metabuf, buf);
+ buf = _hash_addovflpage(rel, metabuf, buf, retain_pin);
page = BufferGetPage(buf);
/* should fit now, given test above */
@@ -158,11 +251,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
/* found page with enough space, so add the item here */
(void) _hash_pgaddtup(rel, buf, itemsz, itup);
- /* write and release the modified page */
+ /*
+ * write and release the modified page and ensure to release the pin on
+ * primary page.
+ */
_hash_wrtbuf(rel, buf);
-
- /* We can drop the bucket lock now */
- _hash_droplock(rel, blkno, HASH_SHARE);
+ if (buf != bucket_buf)
+ _hash_dropbuf(rel, bucket_buf);
/*
* Write-lock the metapage so we can increment the tuple count. After
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index db3e268..760563a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -82,23 +82,20 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
*
* On entry, the caller must hold a pin but no lock on 'buf'. The pin is
* dropped before exiting (we assume the caller is not interested in 'buf'
- * anymore). The returned overflow page will be pinned and write-locked;
- * it is guaranteed to be empty.
+ * anymore) if not asked to retain. The pin will be retained only for the
+ * primary bucket. The returned overflow page will be pinned and
+ * write-locked; it is guaranteed to be empty.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* That buffer is returned in the same state.
*
- * The caller must hold at least share lock on the bucket, to ensure that
- * no one else tries to compact the bucket meanwhile. This guarantees that
- * 'buf' won't stop being part of the bucket while it's unlocked.
- *
* NB: since this could be executed concurrently by multiple processes,
* one should not assume that the returned overflow page will be the
* immediate successor of the originally passed 'buf'. Additional overflow
* pages might have been added to the bucket chain in between.
*/
Buffer
-_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
+_hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
{
Buffer ovflbuf;
Page page;
@@ -131,7 +128,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
break;
/* we assume we do not need to write the unmodified page */
- _hash_relbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, buf);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
}
@@ -149,7 +149,10 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf)
/* logically chain overflow page to previous page */
pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
- _hash_wrtbuf(rel, buf);
+ if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+ _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, buf);
return ovflbuf;
}
@@ -370,11 +373,11 @@ _hash_firstfreebit(uint32 map)
* in the bucket, or InvalidBlockNumber if no following page.
*
* NB: caller must not hold lock on metapage, nor on either page that's
- * adjacent in the bucket chain. The caller had better hold exclusive lock
- * on the bucket, too.
+ * adjacent in the bucket chain except from primary bucket. The caller had
+ * better hold cleanup lock on the primary bucket.
*/
BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf,
+_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
BufferAccessStrategy bstrategy)
{
HashMetaPage metap;
@@ -413,22 +416,41 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf,
/*
* Fix up the bucket chain. this is a doubly-linked list, so we must fix
* up the bucket chain members behind and ahead of the overflow page being
- * deleted. No concurrency issues since we hold exclusive lock on the
- * entire bucket.
+ * deleted. No concurrency issues since we hold the cleanup lock on
+ * primary bucket. We don't need to aqcuire buffer lock to fix the
+ * primary bucket, as we already have that lock.
*/
if (BlockNumberIsValid(prevblkno))
{
- Buffer prevbuf = _hash_getbuf_with_strategy(rel,
- prevblkno,
- HASH_WRITE,
+ if (prevblkno == bucket_blkno)
+ {
+ Buffer prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
+ prevblkno,
+ RBM_NORMAL,
+ bstrategy);
+
+ Page prevpage = BufferGetPage(prevbuf);
+ HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+ Assert(prevopaque->hasho_bucket == bucket);
+ prevopaque->hasho_nextblkno = nextblkno;
+ MarkBufferDirty(prevbuf);
+ ReleaseBuffer(prevbuf);
+ }
+ else
+ {
+ Buffer prevbuf = _hash_getbuf_with_strategy(rel,
+ prevblkno,
+ HASH_WRITE,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
- bstrategy);
- Page prevpage = BufferGetPage(prevbuf);
- HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+ bstrategy);
+ Page prevpage = BufferGetPage(prevbuf);
+ HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
- Assert(prevopaque->hasho_bucket == bucket);
- prevopaque->hasho_nextblkno = nextblkno;
- _hash_wrtbuf(rel, prevbuf);
+ Assert(prevopaque->hasho_bucket == bucket);
+ prevopaque->hasho_nextblkno = nextblkno;
+ _hash_wrtbuf(rel, prevbuf);
+ }
}
if (BlockNumberIsValid(nextblkno))
{
@@ -570,7 +592,7 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
* required that to be true on entry as well, but it's a lot easier for
* callers to leave empty overflow pages and let this guy clean it up.
*
- * Caller must hold exclusive lock on the target bucket. This allows
+ * Caller must hold cleanup lock on the target bucket. This allows
* us to safely lock multiple pages in the bucket.
*
* Since this function is invoked in VACUUM, we provide an access strategy
@@ -580,6 +602,7 @@ void
_hash_squeezebucket(Relation rel,
Bucket bucket,
BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy)
{
BlockNumber wblkno;
@@ -591,27 +614,22 @@ _hash_squeezebucket(Relation rel,
HashPageOpaque wopaque;
HashPageOpaque ropaque;
bool wbuf_dirty;
+ bool release_buf = false;
/*
* start squeezing into the base bucket page.
*/
wblkno = bucket_blkno;
- wbuf = _hash_getbuf_with_strategy(rel,
- wblkno,
- HASH_WRITE,
- LH_BUCKET_PAGE,
- bstrategy);
+ wbuf = bucket_buf;
wpage = BufferGetPage(wbuf);
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
/*
- * if there aren't any overflow pages, there's nothing to squeeze.
+ * if there aren't any overflow pages, there's nothing to squeeze. caller
+ * is responsible to release the lock on primary bucket.
*/
if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
- {
- _hash_relbuf(rel, wbuf);
return;
- }
/*
* Find the last page in the bucket chain by starting at the base bucket
@@ -669,12 +687,17 @@ _hash_squeezebucket(Relation rel,
{
Assert(!PageIsEmpty(wpage));
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
wblkno = wopaque->hasho_nextblkno;
Assert(BlockNumberIsValid(wblkno));
- if (wbuf_dirty)
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
/* nothing more to do if we reached the read page */
@@ -700,6 +723,7 @@ _hash_squeezebucket(Relation rel,
wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
Assert(wopaque->hasho_bucket == bucket);
wbuf_dirty = false;
+ release_buf = false;
}
/*
@@ -733,19 +757,25 @@ _hash_squeezebucket(Relation rel,
/* are we freeing the page adjacent to wbuf? */
if (rblkno == wblkno)
{
- /* yes, so release wbuf lock first */
- if (wbuf_dirty)
+ if (wblkno != bucket_blkno)
+ release_buf = true;
+
+ /* yes, so release wbuf lock first if needed */
+ if (wbuf_dirty && release_buf)
_hash_wrtbuf(rel, wbuf);
- else
+ else if (wbuf_dirty)
+ MarkBufferDirty(wbuf);
+ else if (release_buf)
_hash_relbuf(rel, wbuf);
+
/* free this overflow page (releases rbuf) */
- _hash_freeovflpage(rel, rbuf, bstrategy);
+ _hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
/* done */
return;
}
/* free this overflow page, then get the previous one */
- _hash_freeovflpage(rel, rbuf, bstrategy);
+ _hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
rbuf = _hash_getbuf_with_strategy(rel,
rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 178463f..f51c313 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -38,10 +38,14 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock,
uint32 nblocks);
static void _hash_splitbucket(Relation rel, Buffer metabuf,
Bucket obucket, Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask, uint32 lowmask);
+static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
+ Bucket obucket, Bucket nbucket, Buffer obuf,
+ Buffer nbuf, HTAB *htab, uint32 maxbucket,
+ uint32 highmask, uint32 lowmask);
/*
@@ -55,46 +59,6 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
/*
- * _hash_getlock() -- Acquire an lmgr lock.
- *
- * 'whichlock' should the block number of a bucket's primary bucket page to
- * acquire the per-bucket lock. (See README for details of the use of these
- * locks.)
- *
- * 'access' must be HASH_SHARE or HASH_EXCLUSIVE.
- */
-void
-_hash_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- LockPage(rel, whichlock, access);
-}
-
-/*
- * _hash_try_getlock() -- Acquire an lmgr lock, but only if it's free.
- *
- * Same as above except we return FALSE without blocking if lock isn't free.
- */
-bool
-_hash_try_getlock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- return ConditionalLockPage(rel, whichlock, access);
- else
- return true;
-}
-
-/*
- * _hash_droplock() -- Release an lmgr lock.
- */
-void
-_hash_droplock(Relation rel, BlockNumber whichlock, int access)
-{
- if (USELOCKING(rel))
- UnlockPage(rel, whichlock, access);
-}
-
-/*
* _hash_getbuf() -- Get a buffer by block number for read or write.
*
* 'access' must be HASH_READ, HASH_WRITE, or HASH_NOLOCK.
@@ -132,6 +96,35 @@ _hash_getbuf(Relation rel, BlockNumber blkno, int access, int flags)
}
/*
+ * _hash_getbuf_with_condlock_cleanup() -- as above, but get the buffer for write.
+ *
+ * We try to take the conditional cleanup lock and if we get it then
+ * return the buffer, else return InvalidBuffer.
+ */
+Buffer
+_hash_getbuf_with_condlock_cleanup(Relation rel, BlockNumber blkno, int flags)
+{
+ Buffer buf;
+
+ if (blkno == P_NEW)
+ elog(ERROR, "hash AM does not use P_NEW");
+
+ buf = ReadBuffer(rel, blkno);
+
+ if (!ConditionalLockBufferForCleanup(buf))
+ {
+ ReleaseBuffer(buf);
+ return InvalidBuffer;
+ }
+
+ /* ref count and lock type are correct */
+
+ _hash_checkpage(rel, buf, flags);
+
+ return buf;
+}
+
+/*
* _hash_getinitbuf() -- Get and initialize a buffer by block number.
*
* This must be used only to fetch pages that are known to be before
@@ -266,6 +259,33 @@ _hash_dropbuf(Relation rel, Buffer buf)
}
/*
+ * _hash_dropscanbuf() -- release buffers used in scan.
+ *
+ * This routine unpins the buffers used during scan on which we
+ * hold no lock.
+ */
+void
+_hash_dropscanbuf(Relation rel, HashScanOpaque so)
+{
+ /* release pin we hold on primary bucket */
+ if (BufferIsValid(so->hashso_bucket_buf) &&
+ so->hashso_bucket_buf != so->hashso_curbuf)
+ _hash_dropbuf(rel, so->hashso_bucket_buf);
+ so->hashso_bucket_buf = InvalidBuffer;
+
+ /* release pin we hold on old primary bucket */
+ if (BufferIsValid(so->hashso_old_bucket_buf) &&
+ so->hashso_old_bucket_buf != so->hashso_curbuf)
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+
+ /* release any pin we still hold */
+ if (BufferIsValid(so->hashso_curbuf))
+ _hash_dropbuf(rel, so->hashso_curbuf);
+ so->hashso_curbuf = InvalidBuffer;
+}
+
+/*
* _hash_wrtbuf() -- write a hash page to disk.
*
* This routine releases the lock held on the buffer and our refcount
@@ -489,9 +509,11 @@ _hash_pageinit(Page page, Size size)
/*
* Attempt to expand the hash table by creating one new bucket.
*
- * This will silently do nothing if it cannot get the needed locks.
+ * This will silently do nothing if there are active scans of our own
+ * backend or if we don't get cleanup lock on old or new bucket.
*
- * The caller should hold no locks on the hash index.
+ * Complete the pending splits and remove the tuples from old bucket,
+ * if there are any left over from previous split.
*
* The caller must hold a pin, but no lock, on the metapage buffer.
* The buffer is returned in the same state.
@@ -506,10 +528,15 @@ _hash_expandtable(Relation rel, Buffer metabuf)
BlockNumber start_oblkno;
BlockNumber start_nblkno;
Buffer buf_nblkno;
+ Buffer buf_oblkno;
+ Page opage;
+ HashPageOpaque oopaque;
uint32 maxbucket;
uint32 highmask;
uint32 lowmask;
+restart_expand:
+
/*
* Write-lock the meta page. It used to be necessary to acquire a
* heavyweight lock to begin a split, but that is no longer required.
@@ -548,11 +575,16 @@ _hash_expandtable(Relation rel, Buffer metabuf)
goto fail;
/*
- * Determine which bucket is to be split, and attempt to lock the old
- * bucket. If we can't get the lock, give up.
+ * Determine which bucket is to be split, and attempt to take cleanup lock
+ * on the old bucket. If we can't get the lock, give up.
*
- * The lock protects us against other backends, but not against our own
- * backend. Must check for active scans separately.
+ * The cleanup lock protects us against other backends, but not against
+ * our own backend. Must check for active scans separately.
+ *
+ * The cleanup lock is mainly to protect the split from concurrent
+ * inserts. See src/backend/access/hash/README, Lock Definitions for
+ * further details. Due to this locking restriction, if there is any
+ * pending scan, split will give up which is not good, but harmless.
*/
new_bucket = metap->hashm_maxbucket + 1;
@@ -563,11 +595,90 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, old_bucket))
goto fail;
- if (!_hash_try_getlock(rel, start_oblkno, HASH_EXCLUSIVE))
+ buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
+ if (!buf_oblkno)
goto fail;
+ opage = BufferGetPage(buf_oblkno);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
/*
- * Likewise lock the new bucket (should never fail).
+ * We want to finish the split from a bucket as there is no apparent
+ * benefit by not doing so and it will make the code complicated to finish
+ * the split that involves multiple buckets considering the case where new
+ * split also fails. We don't need to consider the new bucket for
+ * completing the split here as it is not possible that a re-split of new
+ * bucket starts when there is still a pending split from old bucket.
+ */
+ if (H_OLD_INCOMPLETE_SPLIT(oopaque))
+ {
+ BlockNumber nblkno;
+ Buffer buf_nblkno;
+
+ /*
+ * Copy bucket mapping info now; The comment in code below where we
+ * copy this information and calls _hash_splitbucket explains why this
+ * is OK.
+ */
+ maxbucket = metap->hashm_maxbucket;
+ highmask = metap->hashm_highmask;
+ lowmask = metap->hashm_lowmask;
+
+ /* Release the metapage lock, before completing the split. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ nblkno = _hash_get_newblk(rel, oopaque);
+
+ /* Fetch the primary bucket page for the new bucket */
+ buf_nblkno = _hash_getbuf_with_condlock_cleanup(rel, nblkno, LH_BUCKET_PAGE);
+ if (!buf_nblkno)
+ {
+ _hash_relbuf(rel, buf_oblkno);
+ goto fail;
+ }
+
+ _hash_finish_split(rel, metabuf, buf_oblkno, buf_nblkno, maxbucket,
+ highmask, lowmask);
+
+ /*
+ * release the buffers and retry for expand.
+ */
+ _hash_relbuf(rel, buf_oblkno);
+ _hash_relbuf(rel, buf_nblkno);
+
+ goto restart_expand;
+ }
+
+ /*
+ * Clean the tuples remained from previous split. This operation requires
+ * cleanup lock and we already have one on old bucket, so let's do it. We
+ * also don't want to allow further splits from the bucket till the
+ * garbage of previous split is cleaned. This has two advantages, first
+ * it helps in avoiding the bloat due to garbage and second is, during
+ * cleanup of bucket, we are always sure that the garbage tuples belong to
+ * most recently splitted bucket. On the contrary, if we allow cleanup of
+ * bucket after meta page is updated to indicate the new split and before
+ * the actual split, the cleanup operation won't be able to decide whether
+ * the tuple has been moved to the newly created bucket and ended up
+ * deleting such tuples.
+ */
+ if (H_HAS_GARBAGE(oopaque))
+ {
+ /* Release the metapage lock. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ hashbucketcleanup(rel, buf_oblkno, start_oblkno, NULL,
+ metap->hashm_maxbucket, metap->hashm_highmask,
+ metap->hashm_lowmask, NULL,
+ NULL, true, false, NULL, NULL);
+
+ _hash_relbuf(rel, buf_oblkno);
+
+ goto restart_expand;
+ }
+
+ /*
+ * There shouldn't be any active scan on new bucket.
*
* Note: it is safe to compute the new bucket's blkno here, even though we
* may still need to update the BUCKET_TO_BLKNO mapping. This is because
@@ -579,9 +690,6 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (_hash_has_active_scan(rel, new_bucket))
elog(ERROR, "scan in progress on supposedly new bucket");
- if (!_hash_try_getlock(rel, start_nblkno, HASH_EXCLUSIVE))
- elog(ERROR, "could not get lock on supposedly new bucket");
-
/*
* If the split point is increasing (hashm_maxbucket's log base 2
* increases), we need to allocate a new batch of bucket pages.
@@ -600,8 +708,7 @@ _hash_expandtable(Relation rel, Buffer metabuf)
if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
{
/* can't split due to BlockNumber overflow */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
+ _hash_relbuf(rel, buf_oblkno);
goto fail;
}
}
@@ -609,9 +716,18 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/*
* Physically allocate the new bucket's primary page. We want to do this
* before changing the metapage's mapping info, in case we can't get the
- * disk space.
+ * disk space. Ideally, we don't need to check for cleanup lock on new
+ * bucket as no other backend could find this bucket unless meta page is
+ * updated. However, it is good to be consistent with old bucket locking.
*/
buf_nblkno = _hash_getnewbuf(rel, start_nblkno, MAIN_FORKNUM);
+ if (!CheckBufferForCleanup(buf_nblkno))
+ {
+ _hash_relbuf(rel, buf_oblkno);
+ _hash_relbuf(rel, buf_nblkno);
+ goto fail;
+ }
+
/*
* Okay to proceed with split. Update the metapage bucket mapping info.
@@ -665,13 +781,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
/* Relocate records to the new bucket */
_hash_splitbucket(rel, metabuf,
old_bucket, new_bucket,
- start_oblkno, buf_nblkno,
+ buf_oblkno, buf_nblkno,
maxbucket, highmask, lowmask);
- /* Release bucket locks, allowing others to access them */
- _hash_droplock(rel, start_oblkno, HASH_EXCLUSIVE);
- _hash_droplock(rel, start_nblkno, HASH_EXCLUSIVE);
-
return;
/* Here if decide not to split or fail to acquire old bucket lock */
@@ -745,6 +857,10 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
* The buffer is returned in the same state. (The metapage is only
* touched if it becomes necessary to add or remove overflow pages.)
*
+ * Split needs to hold pin on primary bucket pages of both old and new
+ * buckets till end of operation. This is to prevent vacuum to start
+ * when split is in progress.
+ *
* In addition, the caller must have created the new bucket's base page,
* which is passed in buffer nbuf, pinned and write-locked. That lock and
* pin are released here. (The API is set up this way because we must do
@@ -756,37 +872,87 @@ _hash_splitbucket(Relation rel,
Buffer metabuf,
Bucket obucket,
Bucket nbucket,
- BlockNumber start_oblkno,
+ Buffer obuf,
Buffer nbuf,
uint32 maxbucket,
uint32 highmask,
uint32 lowmask)
{
- Buffer obuf;
Page opage;
Page npage;
HashPageOpaque oopaque;
HashPageOpaque nopaque;
- /*
- * It should be okay to simultaneously write-lock pages from each bucket,
- * since no one else can be trying to acquire buffer lock on pages of
- * either bucket.
- */
- obuf = _hash_getbuf(rel, start_oblkno, HASH_WRITE, LH_BUCKET_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ /*
+ * Mark the old bucket to indicate that split is in progress and it has
+ * deletable tuples. At operation end, we clear split in progress flag and
+ * vacuum will clear page_has_garbage flag after deleting such tuples.
+ */
+ oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
npage = BufferGetPage(nbuf);
- /* initialize the new bucket's primary page */
+ /*
+ * initialize the new bucket's primary page and mark it to indicate that
+ * split is in progress.
+ */
nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
nopaque->hasho_prevblkno = InvalidBlockNumber;
nopaque->hasho_nextblkno = InvalidBlockNumber;
nopaque->hasho_bucket = nbucket;
- nopaque->hasho_flag = LH_BUCKET_PAGE;
+ nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
nopaque->hasho_page_id = HASHO_PAGE_ID;
+ _hash_splitbucket_guts(rel, metabuf, obucket,
+ nbucket, obuf, nbuf, NULL,
+ maxbucket, highmask, lowmask);
+
+ /* all done, now release the locks and pins on primary buckets. */
+ _hash_relbuf(rel, obuf);
+ _hash_relbuf(rel, nbuf);
+}
+
+/*
+ * _hash_splitbucket_guts -- Helper function to perform the split operation
+ *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations. To finish the previously
+ * interrupted split operation, caller needs to fill htab. If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
+ * Caller needs to lock and unlock the old and new primary buckets.
+ */
+static void
+_hash_splitbucket_guts(Relation rel,
+ Buffer metabuf,
+ Bucket obucket,
+ Bucket nbucket,
+ Buffer obuf,
+ Buffer nbuf,
+ HTAB *htab,
+ uint32 maxbucket,
+ uint32 highmask,
+ uint32 lowmask)
+{
+ Buffer bucket_obuf;
+ Buffer bucket_nbuf;
+ Page opage;
+ Page npage;
+ HashPageOpaque oopaque;
+ HashPageOpaque nopaque;
+
+ bucket_obuf = obuf;
+ opage = BufferGetPage(obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ bucket_nbuf = nbuf;
+ npage = BufferGetPage(nbuf);
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
/*
* Partition the tuples in the old bucket between the old bucket and the
* new bucket, advancing along the old bucket's overflow bucket chain and
@@ -798,8 +964,6 @@ _hash_splitbucket(Relation rel,
BlockNumber oblkno;
OffsetNumber ooffnum;
OffsetNumber omaxoffnum;
- OffsetNumber deletable[MaxOffsetNumber];
- int ndeletable = 0;
/* Scan each tuple in old page */
omaxoffnum = PageGetMaxOffsetNumber(opage);
@@ -810,18 +974,45 @@ _hash_splitbucket(Relation rel,
IndexTuple itup;
Size itemsz;
Bucket bucket;
+ bool found = false;
/*
- * Fetch the item's hash key (conveniently stored in the item) and
- * determine which bucket it now belongs in.
+ * Before inserting tuple, probe the hash table containing TIDs of
+ * tuples belonging to new bucket, if we find a match, then skip
+ * that tuple, else fetch the item's hash key (conveniently stored
+ * in the item) and determine which bucket it now belongs in.
*/
itup = (IndexTuple) PageGetItem(opage,
PageGetItemId(opage, ooffnum));
+
+ if (htab)
+ (void) hash_search(htab, &itup->t_tid, HASH_FIND, &found);
+
+ if (found)
+ continue;
+
bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
maxbucket, highmask, lowmask);
if (bucket == nbucket)
{
+ Size itupsize = 0;
+ IndexTuple new_itup;
+
+ /*
+ * make a copy of index tuple as we have to scribble on it.
+ */
+ new_itup = CopyIndexTuple(itup);
+
+ /*
+ * mark the index tuple as moved by split, such tuples are
+ * skipped by scan if there is split in progress for a bucket.
+ */
+ itupsize = new_itup->t_info & INDEX_SIZE_MASK;
+ new_itup->t_info &= ~INDEX_SIZE_MASK;
+ new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK;
+ new_itup->t_info |= itupsize;
+
/*
* insert the tuple into the new bucket. if it doesn't fit on
* the current page in the new bucket, we must allocate a new
@@ -832,17 +1023,25 @@ _hash_splitbucket(Relation rel,
* only partially complete, meaning the index is corrupt,
* since searches may fail to find entries they should find.
*/
- itemsz = IndexTupleDSize(*itup);
+ itemsz = IndexTupleDSize(*new_itup);
itemsz = MAXALIGN(itemsz);
if (PageGetFreeSpace(npage) < itemsz)
{
+ bool retain_pin = false;
+
+ /*
+ * page flags must be accessed before releasing lock on a
+ * page.
+ */
+ retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
+
/* write out nbuf and drop lock, but keep pin */
_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
/* chain to a new overflow page */
- nbuf = _hash_addovflpage(rel, metabuf, nbuf);
+ nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
npage = BufferGetPage(nbuf);
- /* we don't need nopaque within the loop */
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
}
/*
@@ -852,12 +1051,10 @@ _hash_splitbucket(Relation rel,
* Possible future improvement: accumulate all the items for
* the new page and qsort them before insertion.
*/
- (void) _hash_pgaddtup(rel, nbuf, itemsz, itup);
+ (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
- /*
- * Mark tuple for deletion from old page.
- */
- deletable[ndeletable++] = ooffnum;
+ /* be tidy */
+ pfree(new_itup);
}
else
{
@@ -870,15 +1067,9 @@ _hash_splitbucket(Relation rel,
oblkno = oopaque->hasho_nextblkno;
- /*
- * Done scanning this old page. If we moved any tuples, delete them
- * from the old page.
- */
- if (ndeletable > 0)
- {
- PageIndexMultiDelete(opage, deletable, ndeletable);
- _hash_wrtbuf(rel, obuf);
- }
+ /* retain the pin on the old primary bucket */
+ if (obuf == bucket_obuf)
+ _hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
else
_hash_relbuf(rel, obuf);
@@ -887,18 +1078,153 @@ _hash_splitbucket(Relation rel,
break;
/* Else, advance to next old page */
- obuf = _hash_getbuf(rel, oblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+ obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
opage = BufferGetPage(obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
}
/*
* We're at the end of the old bucket chain, so we're done partitioning
- * the tuples. Before quitting, call _hash_squeezebucket to ensure the
- * tuples remaining in the old bucket (including the overflow pages) are
- * packed as tightly as possible. The new bucket is already tight.
+ * the tuples. Mark the old and new buckets to indicate split is
+ * finished.
+ *
+ * To avoid deadlocks due to locking order of buckets, first lock the old
+ * bucket and then the new bucket.
*/
- _hash_wrtbuf(rel, nbuf);
+ if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+ else
+ _hash_wrtbuf(rel, nbuf);
+
+ /*
+ * Acquiring cleanup lock to clear the split-in-progress flag ensures that
+ * there is no pending scan that has seen the flag after it is cleared.
+ */
+ _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+ opage = BufferGetPage(bucket_obuf);
+ oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+ _hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+ npage = BufferGetPage(bucket_nbuf);
+ nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+
+ /* indicate that split is finished */
+ oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
+ nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
+
+ /*
+ * now write the buffers, here we don't release the locks as caller is
+ * responsible to release locks.
+ */
+ MarkBufferDirty(bucket_obuf);
+ MarkBufferDirty(bucket_nbuf);
+}
+
+/*
+ * _hash_finish_split() -- Finish the previously interrupted split operation
+ *
+ * To complete the split operation, we form the hash table of TIDs in new
+ * bucket which is then used by split operation to skip tuples that are
+ * already moved before the split operation was previously interruptted.
+ *
+ * The caller must hold a pin, but no lock, on the metapage buffer.
+ * The buffer is returned in the same state. (The metapage is only
+ * touched if it becomes necessary to add or remove overflow pages.)
+ *
+ * 'obuf' and 'nbuf' must be locked by the caller which is also responsible
+ * for unlocking them.
+ */
+void
+_hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
+ uint32 maxbucket, uint32 highmask, uint32 lowmask)
+{
+ HASHCTL hash_ctl;
+ HTAB *tidhtab;
+ Buffer bucket_nbuf;
+ Page opage;
+ Page npage;
+ HashPageOpaque opageopaque;
+ HashPageOpaque npageopaque;
+ Bucket obucket;
+ Bucket nbucket;
+ bool found;
+
+ /* Initialize hash tables used to track TIDs */
+ memset(&hash_ctl, 0, sizeof(hash_ctl));
+ hash_ctl.keysize = sizeof(ItemPointerData);
+ hash_ctl.entrysize = sizeof(ItemPointerData);
+ hash_ctl.hcxt = CurrentMemoryContext;
+
+ tidhtab =
+ hash_create("bucket ctids",
+ 256, /* arbitrary initial size */
+ &hash_ctl,
+ HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+ /*
+ * Scan the new bucket and build hash table of TIDs
+ */
+ bucket_nbuf = nbuf;
+ npage = BufferGetPage(nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ for (;;)
+ {
+ BlockNumber nblkno;
+ OffsetNumber noffnum;
+ OffsetNumber nmaxoffnum;
+
+ /* Scan each tuple in new page */
+ nmaxoffnum = PageGetMaxOffsetNumber(npage);
+ for (noffnum = FirstOffsetNumber;
+ noffnum <= nmaxoffnum;
+ noffnum = OffsetNumberNext(noffnum))
+ {
+ IndexTuple itup;
+
+ /* Fetch the item's TID and insert it in hash table. */
+ itup = (IndexTuple) PageGetItem(npage,
+ PageGetItemId(npage, noffnum));
+
+ (void) hash_search(tidhtab, &itup->t_tid, HASH_ENTER, &found);
+
+ Assert(!found);
+ }
+
+ nblkno = npageopaque->hasho_nextblkno;
+
+ /*
+ * release our write lock without modifying buffer and ensure to
+ * retain the pin on primary bucket.
+ */
+ if (nbuf == bucket_nbuf)
+ _hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, nbuf);
+
+ /* Exit loop if no more overflow pages in new bucket */
+ if (!BlockNumberIsValid(nblkno))
+ break;
+
+ /* Else, advance to next page */
+ nbuf = _hash_getbuf(rel, nblkno, HASH_READ, LH_OVERFLOW_PAGE);
+ npage = BufferGetPage(nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ }
+
+ /* Need a cleanup lock to perform split operation. */
+ LockBufferForCleanup(bucket_nbuf);
+
+ npage = BufferGetPage(bucket_nbuf);
+ npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+ nbucket = npageopaque->hasho_bucket;
+
+ opage = BufferGetPage(obuf);
+ opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+ obucket = opageopaque->hasho_bucket;
+
+ _hash_splitbucket_guts(rel, metabuf, obucket,
+ nbucket, obuf, bucket_nbuf, tidhtab,
+ maxbucket, highmask, lowmask);
- _hash_squeezebucket(rel, obucket, start_oblkno, NULL);
+ hash_destroy(tidhtab);
}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 4825558..6ec3bea 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -72,7 +72,19 @@ _hash_readnext(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_nextblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan to ensure
+ * that vacuum can't delete the tuples that are moved by split to new
+ * bucket. Such tuples are required by the scans that are started on
+ * splitted buckets, before a new buckets split in progress flag
+ * (LH_BUCKET_NEW_PAGE_SPLIT) is cleared.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -94,7 +106,16 @@ _hash_readprev(Relation rel,
BlockNumber blkno;
blkno = (*opaquep)->hasho_prevblkno;
- _hash_relbuf(rel, *bufp);
+
+ /*
+ * Retain the pin on primary bucket page till the end of scan. See
+ * comments in _hash_readnext to know the reason of retaining pin.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+ else
+ _hash_relbuf(rel, *bufp);
+
*bufp = InvalidBuffer;
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
@@ -104,6 +125,13 @@ _hash_readprev(Relation rel,
LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
*pagep = BufferGetPage(*bufp);
*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
+
+ /*
+ * We always maintain the pin on bucket page for whole scan operation,
+ * so releasing the additional pin we have acquired here.
+ */
+ if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
+ _hash_dropbuf(rel, *bufp);
}
}
@@ -192,43 +220,81 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
metap = HashPageGetMeta(page);
/*
- * Loop until we get a lock on the correct target bucket.
+ * Conditionally get the lock on primary bucket page for search while
+ * holding lock on meta page. If we have to wait, then release the meta
+ * page lock and retry it in a hard way.
*/
- for (;;)
- {
- /*
- * Compute the target bucket number, and convert to block number.
- */
- bucket = _hash_hashkey2bucket(hashkey,
- metap->hashm_maxbucket,
- metap->hashm_highmask,
- metap->hashm_lowmask);
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
- blkno = BUCKET_TO_BLKNO(metap, bucket);
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
- /* Release metapage lock, but keep pin. */
+ /* Fetch the primary bucket page for the bucket */
+ buf = ReadBuffer(rel, blkno);
+ if (!ConditionalLockBufferShared(buf))
+ {
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ LockBuffer(buf, HASH_READ);
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ retry = true;
+ }
+ else
+ {
+ _hash_checkpage(rel, buf, LH_BUCKET_PAGE);
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+ }
+ if (retry)
+ {
/*
- * If the previous iteration of this loop locked what is still the
- * correct target bucket, we are done. Otherwise, drop any old lock
- * and lock what now appears to be the correct bucket.
+ * Loop until we get a lock on the correct target bucket. We get the
+ * lock on primary bucket page and retain the pin on it during read
+ * operation to prevent the concurrent splits. Retaining pin on a
+ * primary bucket page ensures that split can't happen as it needs to
+ * acquire the cleanup lock on primary bucket page. Acquiring lock on
+ * primary bucket and rechecking if it is a target bucket is mandatory
+ * as otherwise a concurrent split followed by vacuum could remove
+ * tuples from the selected bucket which otherwise would have been
+ * visible.
*/
- if (retry)
+ for (;;)
{
+ /*
+ * Compute the target bucket number, and convert to block number.
+ */
+ bucket = _hash_hashkey2bucket(hashkey,
+ metap->hashm_maxbucket,
+ metap->hashm_highmask,
+ metap->hashm_lowmask);
+
+ blkno = BUCKET_TO_BLKNO(metap, bucket);
+
+ /* Release metapage lock, but keep pin. */
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+ /*
+ * If the previous iteration of this loop locked what is still the
+ * correct target bucket, we are done. Otherwise, drop any old
+ * lock and lock what now appears to be the correct bucket.
+ */
if (oldblkno == blkno)
break;
- _hash_droplock(rel, oldblkno, HASH_SHARE);
- }
- _hash_getlock(rel, blkno, HASH_SHARE);
+ _hash_relbuf(rel, buf);
- /*
- * Reacquire metapage lock and check that no bucket split has taken
- * place while we were awaiting the bucket lock.
- */
- _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
- oldblkno = blkno;
- retry = true;
+ /* Fetch the primary bucket page for the bucket */
+ buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * Reacquire metapage lock and check that no bucket split has
+ * taken place while we were awaiting the bucket lock.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+ oldblkno = blkno;
+ }
}
/* done with the metapage */
@@ -237,14 +303,60 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
/* Update scan opaque state to show we have lock on the bucket */
so->hashso_bucket = bucket;
so->hashso_bucket_valid = true;
- so->hashso_bucket_blkno = blkno;
- /* Fetch the primary bucket page for the bucket */
- buf = _hash_getbuf(rel, blkno, HASH_READ, LH_BUCKET_PAGE);
+
page = BufferGetPage(buf);
opaque = (HashPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->hasho_bucket == bucket);
+ so->hashso_bucket_buf = buf;
+
+ /*
+ * If the bucket split is in progress, then we need to skip tuples that
+ * are moved from old bucket. To ensure that vacuum doesn't clean any
+ * tuples from old or new buckets till this scan is in progress, maintain
+ * a pin on both of the buckets. Here, we have to be cautious about lock
+ * ordering, first acquire the lock on old bucket, release the lock on old
+ * bucket, but not pin, then acquire the lock on new bucket and again
+ * re-verify whether the bucket split still is in progress. Acquiring lock
+ * on old bucket first ensures that the vacuum waits for this scan to
+ * finish.
+ */
+ if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+ {
+ BlockNumber old_blkno;
+ Buffer old_buf;
+
+ old_blkno = _hash_get_oldblk(rel, opaque);
+
+ /*
+ * release the lock on new bucket and re-acquire it after acquiring
+ * the lock on old bucket.
+ */
+ _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+
+ old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+
+ /*
+ * remember the old bucket buffer so as to use it later for scanning.
+ */
+ so->hashso_old_bucket_buf = old_buf;
+ _hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ Assert(opaque->hasho_bucket == bucket);
+
+ if (opaque->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+ so->hashso_skip_moved_tuples = true;
+ else
+ {
+ _hash_dropbuf(rel, so->hashso_old_bucket_buf);
+ so->hashso_old_bucket_buf = InvalidBuffer;
+ }
+ }
+
/* If a backwards scan is requested, move to the end of the chain */
if (ScanDirectionIsBackward(dir))
{
@@ -273,6 +385,13 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
* false. Else, return true and set the hashso_curpos for the
* scan to the right thing.
*
+ * Here we also scan the old bucket if the split for current bucket
+ * was in progress at the start of scan. The basic idea is that
+ * skip the tuples that are moved by split while scanning current
+ * bucket and then scan the old bucket to cover all such tuples. This
+ * is done to ensure that we don't miss any tuples in the scans that
+ * started during split.
+ *
* 'bufP' points to the current buffer, which is pinned and read-locked.
* On success exit, we have pin and read-lock on whichever page
* contains the right item; on failure, we have released all buffers.
@@ -338,6 +457,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum >= FirstOffsetNumber);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ {
+ offnum = OffsetNumberNext(offnum); /* move forward */
+ continue;
+ }
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -353,9 +485,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ buf = so->hashso_old_bucket_buf;
+
+ /*
+ * old buket buffer must be valid as we acquire
+ * the pin on it before the start of scan and
+ * retain it till end of scan.
+ */
+ Assert(BufferIsValid(buf));
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
@@ -379,6 +544,19 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
{
Assert(offnum <= maxoff);
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+ /*
+ * skip the tuples that are moved by split operation
+ * for the scan that has started when split was in
+ * progress
+ */
+ if (so->hashso_skip_moved_tuples &&
+ (itup->t_info & INDEX_MOVED_BY_SPLIT_MASK))
+ {
+ offnum = OffsetNumberPrev(offnum); /* move back */
+ continue;
+ }
+
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup))
break; /* yes, so exit for-loop */
}
@@ -394,9 +572,42 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
else
{
- /* end of bucket */
- itup = NULL;
- break; /* exit for-loop */
+ /*
+ * end of bucket, scan old bucket if there was a split
+ * in progress at the start of scan.
+ */
+ if (so->hashso_skip_moved_tuples)
+ {
+ buf = so->hashso_old_bucket_buf;
+
+ /*
+ * old buket buffer must be valid as we acquire
+ * the pin on it before the start of scan and
+ * retain it till end of scan.
+ */
+ Assert(BufferIsValid(buf));
+
+ _hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+
+ page = BufferGetPage(buf);
+ opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ maxoff = PageGetMaxOffsetNumber(page);
+ offnum = _hash_binsearch(page, so->hashso_sk_hash);
+
+ /*
+ * setting hashso_skip_moved_tuples to false
+ * ensures that we don't check for tuples that are
+ * moved by split in old bucket and it also
+ * ensures that we won't retry to scan the old
+ * bucket once the scan for same is finished.
+ */
+ so->hashso_skip_moved_tuples = false;
+ }
+ else
+ {
+ itup = NULL;
+ break; /* exit for-loop */
+ }
}
}
break;
@@ -410,9 +621,16 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
if (itup == NULL)
{
- /* we ran off the end of the bucket without finding a match */
+ /*
+ * We ran off the end of the bucket without finding a match.
+ * Release the pin on bucket buffers. Normally, such pins are
+ * released at end of scan, however scrolling cursors can
+ * reacquire the bucket lock and pin in the same scan multiple
+ * times.
+ */
*bufP = so->hashso_curbuf = InvalidBuffer;
ItemPointerSetInvalid(current);
+ _hash_dropscanbuf(rel, so);
return false;
}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index 822862d..b5164d7 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -147,6 +147,23 @@ _hash_log2(uint32 num)
}
/*
+ * _hash_msb-- returns most significant bit position.
+ */
+static uint32
+_hash_msb(uint32 num)
+{
+ uint32 i = 0;
+
+ while (num)
+ {
+ num = num >> 1;
+ ++i;
+ }
+
+ return i - 1;
+}
+
+/*
* _hash_checkpage -- sanity checks on the format of all hash pages
*
* If flags is not zero, it is a bitwise OR of the acceptable values of
@@ -352,3 +369,123 @@ _hash_binsearch_last(Page page, uint32 hash_value)
return lower;
}
+
+/*
+ * _hash_get_oldblk() -- get the block number from which current bucket
+ * is being splitted.
+ */
+BlockNumber
+_hash_get_oldblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket old_bucket;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ /*
+ * To get the old bucket from the current bucket, we need a mask to modulo
+ * into lower half of table. This mask is stored in meta page as
+ * hashm_lowmask, but here we can't rely on the same, because we need a
+ * value of lowmask that was prevalent at the time when bucket split was
+ * started. Masking the most significant bit of new bucket would give us
+ * old bucket.
+ */
+ curr_bucket = opaque->hasho_bucket;
+ mask = (((uint32) 1) << _hash_msb(curr_bucket)) - 1;
+ old_bucket = curr_bucket & mask;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ blkno = BUCKET_TO_BLKNO(metap, old_bucket);
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
+
+/*
+ * _hash_get_newblk() -- get the block number of bucket for the new bucket
+ * that will be generated after split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket based on current table
+ * half. It is mainly required to finish the incomplete splits where we are
+ * sure that not more than one bucket could have split in progress from old
+ * bucket.
+ */
+BlockNumber
+_hash_get_newblk(Relation rel, HashPageOpaque opaque)
+{
+ Bucket curr_bucket;
+ Bucket new_bucket;
+ uint32 lowmask;
+ uint32 mask;
+ Buffer metabuf;
+ HashMetaPage metap;
+ BlockNumber blkno;
+
+ metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+ metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+ curr_bucket = opaque->hasho_bucket;
+
+ /*
+ * new bucket can be obtained by OR'ing old bucket with most significant
+ * bit of current table half. There could be multiple buckets that could
+ * have splitted from curent bucket. We need the first such bucket that
+ * exists based on current table half.
+ */
+ lowmask = metap->hashm_lowmask;
+
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = curr_bucket | mask;
+ if (new_bucket > metap->hashm_maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ blkno = BUCKET_TO_BLKNO(metap, new_bucket);
+ break;
+ }
+
+ _hash_relbuf(rel, metabuf);
+
+ return blkno;
+}
+
+/*
+ * _hash_get_newbucket() -- get the new bucket that will be generated after
+ * split from current bucket.
+ *
+ * This is used to find the new bucket from old bucket. New bucket can be
+ * obtained by OR'ing old bucket with most significant bit of table half
+ * for lowmask passed in this function. There could be multiple buckets that
+ * could have splitted from curent bucket. We need the first such bucket that
+ * exists. Caller must ensure that no more than one split has happened from
+ * old bucket.
+ */
+Bucket
+_hash_get_newbucket(Relation rel, Bucket curr_bucket,
+ uint32 lowmask, uint32 maxbucket)
+{
+ Bucket new_bucket;
+ uint32 mask;
+
+ for (;;)
+ {
+ mask = lowmask + 1;
+ new_bucket = curr_bucket | mask;
+ if (new_bucket > maxbucket)
+ {
+ lowmask = lowmask >> 1;
+ continue;
+ }
+ break;
+ }
+
+ return new_bucket;
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 90804a3..3e5b1d2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3567,6 +3567,26 @@ ConditionalLockBuffer(Buffer buffer)
}
/*
+ * Acquire the content_lock for the buffer, but only if we don't have to wait.
+ *
+ * This assumes the caller wants BUFFER_LOCK_SHARED mode.
+ */
+bool
+ConditionalLockBufferShared(Buffer buffer)
+{
+ BufferDesc *buf;
+
+ Assert(BufferIsValid(buffer));
+ if (BufferIsLocal(buffer))
+ return true; /* act as though we got it */
+
+ buf = GetBufferDescriptor(buffer - 1);
+
+ return LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
+ LW_SHARED);
+}
+
+/*
* LockBufferForCleanup - lock a buffer in preparation for deleting items
*
* Items may be deleted from a disk page only when the caller (a) holds an
@@ -3750,6 +3770,49 @@ ConditionalLockBufferForCleanup(Buffer buffer)
return false;
}
+/*
+ * CheckBufferForCleanup - as above, but don't attempt to take lock
+ *
+ * We won't loop, but just check once to see if the pin count is OK. If
+ * not, return FALSE.
+ */
+bool
+CheckBufferForCleanup(Buffer buffer)
+{
+ BufferDesc *bufHdr;
+ uint32 buf_state;
+
+ Assert(BufferIsValid(buffer));
+
+ if (BufferIsLocal(buffer))
+ {
+ /* There should be exactly one pin */
+ if (LocalRefCount[-buffer - 1] != 1)
+ return false;
+ /* Nobody else to wait for */
+ return true;
+ }
+
+ /* There should be exactly one local pin */
+ if (GetPrivateRefCount(buffer) != 1)
+ return false;
+
+ bufHdr = GetBufferDescriptor(buffer - 1);
+
+ buf_state = LockBufHdr(bufHdr);
+
+ Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+ if (BUF_STATE_GET_REFCOUNT(buf_state) == 1)
+ {
+ /* pincount is OK. */
+ UnlockBufHdr(bufHdr, buf_state);
+ return true;
+ }
+
+ UnlockBufHdr(bufHdr, buf_state);
+ return false;
+}
+
/*
* Functions for buffer I/O handling
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index d9df904..bbf822b 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -24,6 +24,7 @@
#include "lib/stringinfo.h"
#include "storage/bufmgr.h"
#include "storage/lockdefs.h"
+#include "utils/hsearch.h"
#include "utils/relcache.h"
/*
@@ -32,6 +33,8 @@
*/
typedef uint32 Bucket;
+#define InvalidBucket ((Bucket) 0xFFFFFFFF)
+
#define BUCKET_TO_BLKNO(metap,B) \
((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_log2((B)+1)-1] : 0)) + 1)
@@ -51,6 +54,9 @@ typedef uint32 Bucket;
#define LH_BUCKET_PAGE (1 << 1)
#define LH_BITMAP_PAGE (1 << 2)
#define LH_META_PAGE (1 << 3)
+#define LH_BUCKET_NEW_PAGE_SPLIT (1 << 4)
+#define LH_BUCKET_OLD_PAGE_SPLIT (1 << 5)
+#define LH_BUCKET_PAGE_HAS_GARBAGE (1 << 6)
typedef struct HashPageOpaqueData
{
@@ -63,6 +69,12 @@ typedef struct HashPageOpaqueData
typedef HashPageOpaqueData *HashPageOpaque;
+#define H_HAS_GARBAGE(opaque) ((opaque)->hasho_flag & LH_BUCKET_PAGE_HAS_GARBAGE)
+#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT)
+#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
+#define H_INCOMPLETE_SPLIT(opaque) (((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
+ ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+
/*
* The page ID is for the convenience of pg_filedump and similar utilities,
* which otherwise would have a hard time telling pages of different index
@@ -87,12 +99,6 @@ typedef struct HashScanOpaqueData
bool hashso_bucket_valid;
/*
- * If we have a share lock on the bucket, we record it here. When
- * hashso_bucket_blkno is zero, we have no such lock.
- */
- BlockNumber hashso_bucket_blkno;
-
- /*
* We also want to remember which buffer we're currently examining in the
* scan. We keep the buffer pinned (but not locked) across hashgettuple
* calls, in order to avoid doing a ReadBuffer() for every tuple in the
@@ -100,11 +106,23 @@ typedef struct HashScanOpaqueData
*/
Buffer hashso_curbuf;
+ /* remember the buffer associated with primary bucket */
+ Buffer hashso_bucket_buf;
+
+ /*
+ * remember the buffer associated with old primary bucket which is
+ * required during the scan of the bucket for which split is in progress.
+ */
+ Buffer hashso_old_bucket_buf;
+
/* Current position of the scan, as an index TID */
ItemPointerData hashso_curpos;
/* Current position of the scan, as a heap TID */
ItemPointerData hashso_heappos;
+
+ /* Whether scan needs to skip tuples that are moved by split */
+ bool hashso_skip_moved_tuples;
} HashScanOpaqueData;
typedef HashScanOpaqueData *HashScanOpaque;
@@ -175,6 +193,8 @@ typedef HashMetaPageData *HashMetaPage;
sizeof(ItemIdData) - \
MAXALIGN(sizeof(HashPageOpaqueData)))
+#define INDEX_MOVED_BY_SPLIT_MASK 0x2000
+
#define HASH_MIN_FILLFACTOR 10
#define HASH_DEFAULT_FILLFACTOR 75
@@ -223,9 +243,6 @@ typedef HashMetaPageData *HashMetaPage;
#define HASH_WRITE BUFFER_LOCK_EXCLUSIVE
#define HASH_NOLOCK (-1)
-#define HASH_SHARE ShareLock
-#define HASH_EXCLUSIVE ExclusiveLock
-
/*
* Strategy number. There's only one valid strategy for hashing: equality.
*/
@@ -298,21 +315,21 @@ extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
Size itemsize, IndexTuple itup);
/* hashovfl.c */
-extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf);
+extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
- BufferAccessStrategy bstrategy);
+ BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
BlockNumber blkno, ForkNumber forkNum);
extern void _hash_squeezebucket(Relation rel,
Bucket bucket, BlockNumber bucket_blkno,
+ Buffer bucket_buf,
BufferAccessStrategy bstrategy);
/* hashpage.c */
-extern void _hash_getlock(Relation rel, BlockNumber whichlock, int access);
-extern bool _hash_try_getlock(Relation rel, BlockNumber whichlock, int access);
-extern void _hash_droplock(Relation rel, BlockNumber whichlock, int access);
extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
int access, int flags);
+extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
+ BlockNumber blkno, int flags);
extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
ForkNumber forkNum);
@@ -321,6 +338,7 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
BufferAccessStrategy bstrategy);
extern void _hash_relbuf(Relation rel, Buffer buf);
extern void _hash_dropbuf(Relation rel, Buffer buf);
+extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
extern void _hash_wrtbuf(Relation rel, Buffer buf);
extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
int to_access);
@@ -328,6 +346,9 @@ extern uint32 _hash_metapinit(Relation rel, double num_tuples,
ForkNumber forkNum);
extern void _hash_pageinit(Page page, Size size);
extern void _hash_expandtable(Relation rel, Buffer metabuf);
+extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
+ Buffer nbuf, uint32 maxbucket, uint32 highmask,
+ uint32 lowmask);
/* hashscan.c */
extern void _hash_regscan(IndexScanDesc scan);
@@ -363,5 +384,17 @@ extern bool _hash_convert_tuple(Relation index,
Datum *index_values, bool *index_isnull);
extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
+extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
+extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
+extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
+ uint32 lowmask, uint32 maxbucket);
+
+/* hash.c */
+extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
+ BlockNumber bucket_blkno, BufferAccessStrategy bstrategy,
+ uint32 maxbucket, uint32 highmask, uint32 lowmask,
+ double *tuples_removed, double *num_index_tuples,
+ bool bucket_has_garbage, bool delay,
+ IndexBulkDeleteCallback callback, void *callback_state);
#endif /* HASH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7b6ba96..accbb88 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -225,8 +225,10 @@ extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
extern void UnlockBuffers(void);
extern void LockBuffer(Buffer buffer, int mode);
extern bool ConditionalLockBuffer(Buffer buffer);
+extern bool ConditionalLockBufferShared(Buffer buffer);
extern void LockBufferForCleanup(Buffer buffer);
extern bool ConditionalLockBufferForCleanup(Buffer buffer);
+extern bool CheckBufferForCleanup(Buffer buffer);
extern bool HoldingBufferPinThatDelaysRecovery(void);
extern void AbortBufferIO(void);
On 09/12/2016 10:42 PM, Amit Kapila wrote:
The following script hangs on idx_val creation - just with v5, WAL patch
not applied.Are you sure it is actually hanging? I see 100% cpu for a few minutes but
the index eventually completes ok for me (v5 patch applied to today's
master).It completed for me as well. The second index creation is taking more
time and cpu, because it is just inserting duplicate values which need
lot of overflow pages.
Yeah, sorry for the false alarm. It just took 3m45s to complete on my
machine.
Best regards,
Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 8, 2016 at 12:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Hmm. I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology. I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1],
search for overflow on that wiki page. Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.
In the existing source code, the terminology 'overflow page' is
clearly preferred to 'overflow bucket'.
[rhaas pgsql]$ git grep 'overflow page' | wc -l
75
[rhaas pgsql]$ git grep 'overflow bucket' | wc -l
1
In our off-list conversations, I too have found it very confusing when
you've made reference to an overflow bucket. A hash table has a fixed
number of buckets, and depending on the type of hash table the storage
for each bucket may be linked together into some kind of a chain;
here, a chain of pages. The 'bucket' logically refers to all of the
entries that have hash codes such that (hc % nbuckets) == bucketno,
regardless of which pages contain them.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 7, 2016 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Sep 7, 2016 at 11:49 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Sep 1, 2016 at 8:55 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
I have fixed all other issues you have raised. Updated patch is
attached with this mail.I am finding the comments (particularly README) quite hard to follow.
There
are many references to an "overflow bucket", or similar phrases. I think
these should be "overflow pages". A bucket is a conceptual thingconsisting
of a primary page for that bucket and zero or more overflow pages for the
same bucket. There are no overflow buckets, unless you are referring tothe
new bucket to which things are being moved.
Hmm. I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology. I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1],
search for overflow on that wiki page.
That page seems to use "slot" to refer to the primary bucket/page and all
the overflow buckets/pages which cover the same post-masked values. I
don't think that would be an improvement for us, because "slot" is already
pretty well-used for other things. Their use of "bucket" does seem to be
mostly the same as "page" (or maybe "buffer" or "block"?) but I don't think
we gain anything from creating yet another synonym for page/buffer/block.
I think the easiest thing would be to keep using the meanings which the
existed committed code uses, so that we at least have internal consistency.
Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.
I think just "overflow page" or "buffer containing the overflow page".
Here are some more notes I've taken, mostly about the README and comments.
It took me a while to understand that once a tuple is marked as moved by
split, it stays that way forever. It doesn't mean "recently moved by
split", but "ever moved by split". Which works, but is rather subtle.
Perhaps this deserves a parenthetical comment in the README the first time
the flag is mentioned.
========
#define INDEX_SIZE_MASK 0x1FFF
/* bit 0x2000 is not used at present */
This is no longer true, maybe:
/* bit 0x2000 is reserved for index-AM specific usage */
========
Note that this is designed to allow concurrent splits and scans. If a
split occurs, tuples relocated into the new bucket will be visited twice
by the scan, but that does no harm. As we are releasing the locks during
scan of a bucket, it will allow concurrent scan to start on a bucket and
ensures that scan will always be behind cleanup.
Above, the abrupt transition from splits (first sentence) to cleanup is
confusing. If the cleanup referred to is vacuuming, it should be a new
paragraph or at least have a transition sentence. Or is it referring to
clean-up locks used for control purposes, rather than for actual vacuum
clean-up? I think it is the first one, the vacuum. (I find the committed
version of this comment confusing as well--how in the committed code would
a tuple be visited twice, and why does that not do harm in the committed
coding? So maybe the issue here is me, not the comment.)
=======
+Vacuum acquires cleanup lock on bucket to remove the dead tuples and or
tuples
+that are moved due to split. The need for cleanup lock to remove dead
tuples
+is to ensure that scans' returns correct results. Scan that returns
multiple
+tuples from the same bucket page always restart the scan from the previous
+offset number from which it has returned last tuple.
Perhaps it would be better to teach scans to restart anywhere on the page,
than to force more cleanup locks to be taken?
=======
This comment no longer seems accurate (as long as it is just an ERROR and
not a PANIC):
* XXX we have a problem here if we fail to get space for a
* new overflow page: we'll error out leaving the bucket
split
* only partially complete, meaning the index is corrupt,
* since searches may fail to find entries they should find.
The split will still be marked as being in progress, so any scanner will
have to scan the old page and see the tuple there.
========
in _hash_splitbucket comments, this needs updating:
* The caller must hold exclusive locks on both buckets to ensure that
* no one else is trying to access them (see README).
The true prereq here is a buffer clean up lock (pin plus exclusive buffer
content lock), correct?
And then:
* Split needs to hold pin on primary bucket pages of both old and new
* buckets till end of operation.
'retain' is probably better than 'hold', to emphasize that we are dropping
the buffer content lock part of the clean-up lock, but that the pin part of
it is kept continuously (this also matches the variable name used in the
code). Also, the paragraph after that one seems to be obsolete and
contradictory with the newly added comments.
===========
/*
* Acquiring cleanup lock to clear the split-in-progress flag ensures
that
* there is no pending scan that has seen the flag after it is cleared.
*/
But, we are not acquiring a clean up lock. We already have a pin, and we
do acquire a write buffer-content lock, but don't observe that our pin is
the only one. I don't see why it is necessary to have a clean up lock
(what harm is done if a under-way scan thinks it is scanning a bucket that
is being split when it actually just finished the split?), but if it is
necessary then I think this code is wrong. If not necessary, the comment
is wrong.
Also, why must we hold a write lock on both old and new primary bucket
pages simultaneously? Is this in anticipation of the WAL patch? The
contract for the function does say that it returns both pages write locked,
but I don't see a reason for that part of the contract at the moment.
=========
To avoid deadlock between readers and inserters, whenever there is a need
to lock multiple buckets, we always take in the order suggested in
Locking
Definitions above. This algorithm allows them a very high degree of
concurrency.
The section referred to is actually spelled "Lock Definitions", no "ing".
The Lock Definitions sections doesn't mention the meta page at all. I
think there needs be something added to it about how the meta page gets
locked and why that is deadlock free. (But we could be optimistic and
assume the patch to implement caching of the metapage will go in and will
take care of that).
=========
And an operational question on this: A lot of stuff is done conditionally
here. Under high concurrency, do splits ever actually occur? It seems
like they could easily be permanently starved.
Cheers,
Jeff
On 09/13/2016 07:26 AM, Amit Kapila wrote:
Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1].
I have been testing patch in various scenarios, and it has a positive
performance impact in some cases.
This is especially seen in cases where the values of the indexed column
are unique - SELECTs can see a 40-60% benefit over a similar query using
b-tree. UPDATE also sees an improvement.
In cases where the indexed column value isn't unique, it takes a long
time to build the index due to the overflow page creation.
Also in cases where the index column is updated with a high number of
clients, ala
-- ddl.sql --
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;
-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;
w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps
for hash, and 10062 tps for b-tree).
Jeff mentioned upthread the idea of moving the lock to a bucket meta
page instead of having it on the main meta page. Likely a question for
the assigned committer.
Thanks for working on this !
Best regards,
Jesper
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 13, 2016 at 5:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 8, 2016 at 12:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Hmm. I think page or block is a concept of database systems and
buckets is a general concept used in hashing technology. I think the
difference is that there are primary buckets and overflow buckets. I
have checked how they are referred in one of the wiki pages [1],
search for overflow on that wiki page. Now, I think we shouldn't be
inconsistent in using them. I will change to make it same if I find
any inconsistency based on what you or other people think is the
better way to refer overflow space.In the existing source code, the terminology 'overflow page' is
clearly preferred to 'overflow bucket'.
Okay, point taken. Will update it in next version of patch.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:
On 09/13/2016 07:26 AM, Amit Kapila wrote:
Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1].I have been testing patch in various scenarios, and it has a positive
performance impact in some cases.This is especially seen in cases where the values of the indexed column are
unique - SELECTs can see a 40-60% benefit over a similar query using b-tree.
Here, I think it is better if we have the data comparing the situation
of hash index with respect to HEAD as well. What I mean to say is
that you are claiming that after the hash index improvements SELECT
workload is 40-60% better, but where do we stand as of HEAD?
UPDATE also sees an improvement.
Can you explain this more? Is it more compare to HEAD or more as
compare to Btree? Isn't this contradictory to what the test in below
mail shows?
In cases where the indexed column value isn't unique, it takes a long time
to build the index due to the overflow page creation.Also in cases where the index column is updated with a high number of
clients, ala-- ddl.sql --
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for
hash, and 10062 tps for b-tree).
Thanks for doing the tests. Have you applied both concurrent index
and cache the meta page patch for these tests? So from above tests,
we can say that after these set of patches read-only workloads will be
significantly improved even better than btree in quite-a-few useful
cases. However when the indexed column is updated, there is still a
large gap as compare to btree (what about the case when the indexed
column is not updated in read-write transaction as in our pgbench
read-write transactions, by any chance did you ran any such test?). I
think we need to focus on improving cases where index columns are
updated, but it is better to do that work as a separate patch.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 09/14/2016 07:24 AM, Amit Kapila wrote:
On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:On 09/13/2016 07:26 AM, Amit Kapila wrote:
Attached, new version of patch which contains the fix for problem
reported on write-ahead-log of hash index thread [1].I have been testing patch in various scenarios, and it has a positive
performance impact in some cases.This is especially seen in cases where the values of the indexed column are
unique - SELECTs can see a 40-60% benefit over a similar query using b-tree.Here, I think it is better if we have the data comparing the situation
of hash index with respect to HEAD as well. What I mean to say is
that you are claiming that after the hash index improvements SELECT
workload is 40-60% better, but where do we stand as of HEAD?
The tests I have done are with a copy of a production database using the
same queries sent with a b-tree index for the primary key, and the same
with a hash index. Those are seeing a speed-up of the mentioned 40-60%
in execution time - some involve JOINs.
Largest of those tables is 390Mb with a CHAR() based primary key.
UPDATE also sees an improvement.
Can you explain this more? Is it more compare to HEAD or more as
compare to Btree? Isn't this contradictory to what the test in below
mail shows?
Same thing here - where the fields involving the hash index aren't updated.
In cases where the indexed column value isn't unique, it takes a long time
to build the index due to the overflow page creation.Also in cases where the index column is updated with a high number of
clients, ala-- ddl.sql --
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for
hash, and 10062 tps for b-tree).Thanks for doing the tests. Have you applied both concurrent index
and cache the meta page patch for these tests? So from above tests,
we can say that after these set of patches read-only workloads will be
significantly improved even better than btree in quite-a-few useful
cases.
Agreed.
However when the indexed column is updated, there is still a
large gap as compare to btree (what about the case when the indexed
column is not updated in read-write transaction as in our pgbench
read-write transactions, by any chance did you ran any such test?).
I have done a run to look at the concurrency / TPS aspect of the
implementation - to try something different than Mark's work on testing
the pgbench setup.
With definitions as above, with SELECT as
-- select.sql --
\set id random(1,10)
BEGIN;
SELECT * FROM test WHERE id = :id;
COMMIT;
and UPDATE/Indexed with an index on 'val', and finally UPDATE/Nonindexed
w/o one.
[1]: /messages/by-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com
on master. btree is master too.
Machine is a 28C/56T with 256Gb RAM with 2 x RAID10 SSD for data + wal.
Clients ran with -M prepared.
[1]: /messages/by-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com
/messages/by-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com
[2]: /messages/by-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1Wp5VyLvkf8Q@mail.gmail.com
/messages/by-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1Wp5VyLvkf8Q@mail.gmail.com
[3]: /messages/by-id/CAA4eK1JUYr_aB7BxFnSg5+JQhiwgkLKgAcFK9bfD4MLfFK6Oqw@mail.gmail.com
/messages/by-id/CAA4eK1JUYr_aB7BxFnSg5+JQhiwgkLKgAcFK9bfD4MLfFK6Oqw@mail.gmail.com
Don't know if you find this useful due to the small number of rows, but
let me know if there are other tests I can run, f.ex. bump the number of
rows.
I
think we need to focus on improving cases where index columns are
updated, but it is better to do that work as a separate patch.
Ok.
Best regards,
Jesper
Attachments:
select.pngimage/png; name=select.pngDownload
�PNG
IHDR ^ T z0�* pHYs � ���}� D
IDATx��� \���m�"D< ������xDIG�fV��)b���)�G�����x"��i)��"�%������&��
>�W����w�����|���CS*�2 �?�t7 �Z� r� r� r� r� r� r� r� r� r� r� r� r� r� r� r� r� r� r� M�n�����g�����8��o������e2�/��T*�V��������O�_�������w�0a�������{ ��!� ""���^�z���#I.>}����ss��)//�1c�NsB���
��kWpp���b����?���o=zt����o�@@�� M@�gd����b��=�L�bii��O?����\�vm�����_711�*�����9r�DcK7 �N �@ t���N�L�N�:�
��y���� Y.���;���@44I&''��+W�3FKK��������-k�h�� M0o�<�D���S\\<l�0����o��V�un�����mnn���y����A44�����������+W�$&&�?���:22����Z��6�!�I���� ��d$�l�}��W"������9sf+��H��>0`@K?@;�hh����O������+:w��
���;�=z���� M��� ���r�������?���>����}yK IKK��}{+� � � 00p�������iaaaBB����/^�
�>r�H///2w\�z�� ��555566v��
l� � D#@899=zt����������������6m�8q�l��M��M����
[�l!����_r�\33��c�^�r����O � ���j/�����6�Bc6m%��Y��k�(q �� � � � � � � � � � � � � � � � � � G��1<<|��)R������O�g���e�2�p8^^^III����������/ @����Hq��]�+$����,--k������BCC����_ �vK�����cC�=���B������j���111l6���������f ��R�h 'KY�D�������mmm���caaA�fffd`nn���N���"