Microvacuum support for Hash Index

Started by Ashutosh Sharmaabout 9 years ago47 messages

ashu.coek88@gmail.com

about 9 years ago

3 attachment(s)

Hi All,

I have added a microvacuum support for hash index access method and
attached is the v1 patch for the same. The patch basically takes care
of the following things:

1. Firstly, it changes the marking of dead tuples from
'tuple-at-a-time' to 'page-at-a-time' during hash index scan. For this
we accumulate the heap tids and offset of all the hash index tuples if
it is pointed by kill_prior_tuple during scan and then mark all
accumulated tids as LP_DEAD either while stepping from one page to
another (assuming the scan in both forward and backward direction) or
during end of the hash index scan or during rescan.

2. Secondly, when inserting tuple into hash index table, if not enough
space is found on a current page then it ensures that we first clean
the dead tuples if found in the current hash index page before moving
to the next page in a bucket chain or going for a bucket split. This
basically increases the page reusability and reduces the number of
page splits, thereby reducing the overall size of hash index table.

I have compared the hash index size with and without my patch
(microvacuum_hash_index_v1.patch attached with this mail) on a high
end machine at various scale factors and the results are shown below.
For testing this, i have created hash index (pgbench_accounts_aid) on
aid column of 'pgbench_accounts' table instead of primary key and the
results shown in below table are for the same. The patch
(pgbench.patch) having these changes is also attached with this mail.
Moreover, I am using my own script file (file_hash_kill_prior_tuple)
for updating the index column with pgbench read-write command. The
script file 'file_hash_kill_prior_tuple' is also attached with this
mail.

Here are some initial test results showing the benefit of this patch:

postgresql.conf and pgbench settings:
autovacuum=off
client counts = 64
run time duration = 15 mins

./pgbench -c $threads -j $threads -T 900 postgres -f
~/file_hash_kill_prior_tuple

Scale Factor hash index size @ start HEAD HEAD + Patch
10 32 MB 579 MB 158 MB
50 128 MB 630 MB 350 MB
100 256 MB 1255 MB 635 MB
300 1024 MB 2233 MB 1093 MB

As shown in above result, at 10 scale factor the hash index size has
reduced by almost 4 times whereas at 50 and 300 scale factors it has
reduced by half with my patch. This basically proves that we can
reduce the hash index size to a good extent with this patch.

System specifications:
---------------------------------
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 8
NUMA node(s): 8
Vendor ID: GenuineIntel

Note: The patch (microvacuum_hash_index_v1.patch) is prepared on top
of concurrent_hash_index_v8.patch-[1]/messages/by-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com and wal_hash_index_v5.1.patch[2]/messages/by-id/CAA4eK1KE=+kkowyYD0vmch=ph4ND3H1tViAB+0cWTHqjZDDfqg@mail.gmail.com
for hash index.

[1]: /messages/by-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com
[2]: /messages/by-id/CAA4eK1KE=+kkowyYD0vmch=ph4ND3H1tViAB+0cWTHqjZDDfqg@mail.gmail.com

Attachments:

microvacuum_hash_index_v1.patchtext/x-patch; charset=US-ASCII; name=microvacuum_hash_index_v1.patchDownload

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index db73f05..a0720ef 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -325,14 +325,21 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.)
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+				so->killedItems = palloc(MaxIndexTuplesPerPage *
+										 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+				so->killedItems[so->numKilled].indexOffset =
+							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->numKilled++;
+			}
 		}
 
 		/*
@@ -439,6 +446,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so->hashso_skip_moved_tuples = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -454,6 +464,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		hashkillitems(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
@@ -480,6 +494,10 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		hashkillitems(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
@@ -809,6 +827,15 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 
+			/*
+			 * Let us mark the page as clean if vacuum removes the DEAD tuples
+			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
+			 * flag.
+			 */
+			if (tuples_removed && *tuples_removed > 0 &&
+				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d030a8d..5f2bc7c 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -921,6 +921,82 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 		UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * replay delete operation in hash index to remove
+ * tuples marked as DEAD during index tuple insertion.
+ */
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+	XLogRecPtr lsn = record->EndRecPtr;
+	xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+	Buffer bucketbuf = InvalidBuffer;
+	Buffer buffer;
+	Buffer metabuf;
+	Page page;
+	XLogRedoAction action;
+
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &buffer);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &buffer);
+	}
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		char *ptr;
+		Size len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(buffer);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page metapage;
+		HashMetaPage metap;
+
+		metapage = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(metapage);
+
+		metap->hashm_ntuples -= xldata->ntuples;
+
+		PageSetLSN(metapage, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
 void
 hash_redo(XLogReaderState *record)
 {
@@ -964,6 +1040,9 @@ hash_redo(XLogReaderState *record)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			hash_xlog_update_meta_page(record);
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			hash_xlog_vacuum_one_page(record);
+			break;
 		default:
 			elog(PANIC, "hash_redo: unknown op code %u", info);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 3514138..1bcb214 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -19,7 +19,11 @@
 #include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
+#include "storage/lwlock.h"
+#include "storage/buf_internals.h"
 
+static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+								  Buffer bucket_buf, bool is_primary_bucket_page);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -206,6 +210,28 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	while (PageGetFreeSpace(page) < itemsz)
 	{
 		/*
+		 * Check if current page has any DEAD tuples. If yes,
+		 * delete these tuples and see if we can get a space for
+		 * the new item to be inserted before moving to the next
+		 * page in the bucket chain.
+		 */
+		if (H_HAS_DEAD_TUPLES(pageopaque) && CheckBufferForCleanup(bucket_buf))
+		{
+			/*
+			 * Write-lock the meta page so that we can decrement
+			 * tuple count.
+			 */
+			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+			_hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+								  (buf == bucket_buf) ? true : false);
+
+			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+			if (PageGetFreeSpace(page) >= itemsz)
+				break;				/* OK, now we have enough space */
+		}
+
+		/*
 		 * no space on this page; check for an overflow page
 		 */
 		BlockNumber nextblkno = pageopaque->hasho_nextblkno;
@@ -247,7 +273,8 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			Assert(PageGetFreeSpace(page) >= itemsz);
 		}
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
-		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE);
+		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE ||
+			   pageopaque->hasho_flag == (LH_OVERFLOW_PAGE | LH_PAGE_HAS_DEAD_TUPLES));
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
@@ -390,3 +417,89 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 				 RelationGetRelationName(rel));
 	}
 }
+
+/*
+ * _hash_vacuum_one_page - vacuum just one index page.
+ * Try to remove LP_DEAD items from the given page.  We
+ * must acquire cleanup lock on the primary bucket page
+ * before calling this function.
+ */
+
+static void
+_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+					  Buffer bucket_buf, bool is_primary_bucket_page)
+{
+	OffsetNumber	deletable[MaxOffsetNumber];
+	int ndeletable = 0;
+	OffsetNumber offnum,
+				 maxoff;
+	Page	page = BufferGetPage(buf);
+	HashPageOpaque	pageopaque;
+	HashMetaPage	metap;
+	double tuples_removed = 0;
+
+	/* Scan each tuple in page to see if it is marked as LP_DEAD */
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemId))
+		{
+			deletable[ndeletable++] = offnum;
+			tuples_removed += 1;
+		}
+	}
+
+	if (ndeletable > 0)
+	{
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
+		PageIndexMultiDelete(page, deletable, ndeletable);
+
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
+		metap = HashPageGetMeta(BufferGetPage(metabuf));
+		metap->hashm_ntuples -= tuples_removed;
+
+		MarkBufferDirty(buf);
+		MarkBufferDirty(metabuf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			xl_hash_vacuum	xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.is_primary_bucket_page = is_primary_bucket_page;
+			xlrec.ntuples = tuples_removed;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHashVacuum);
+
+			/*
+			 * primary bucket buffer needs to be registered to ensure
+			 * that we acquire cleanup lock during replay.
+			 */
+			if (!xlrec.is_primary_bucket_page)
+				XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(1, (char *) deletable,
+						ndeletable * sizeof(OffsetNumber));
+
+			XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+		}
+
+		END_CRIT_SECTION();
+	}
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 0df64a8..574998e 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -455,6 +455,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					if (offnum <= maxoff)
 					{
 						Assert(offnum >= FirstOffsetNumber);
+
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
 						/*
@@ -473,6 +474,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						hashkillitems(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
@@ -544,6 +549,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					if (offnum >= FirstOffsetNumber)
 					{
 						Assert(offnum <= maxoff);
+
 						itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
 						/*
@@ -562,6 +568,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						hashkillitems(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index b5164d7..4350e32 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -19,6 +19,7 @@
 #include "access/relscan.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
+#include "storage/buf_internals.h"
 
 
 /*
@@ -489,3 +490,72 @@ _hash_get_newbucket(Relation rel, Bucket curr_bucket,
 
 	return new_bucket;
 }
+
+/*
+ * hashkillitems - set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0).
+ *
+ * We match items by heap TID before assuming they are the right ones to
+ * delete. If an item has moved off the current page due to a split, we'll
+ * fail to find it and do nothing (this is not an error case --- we assume
+ * the item will eventually get marked in a future indexscan).
+ */
+void
+hashkillitems(IndexScanDesc scan)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Page	page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum, maxoff;
+	int	numKilled = so->numKilled;
+	int		i;
+	bool	killedsomething = false;
+
+	Assert(so->numKilled > 0);
+	Assert(so->killedItems != NULL);
+
+	/*
+	 * Always reset the scan state, so we don't look for same
+	 * items on other pages.
+	 */
+	so->numKilled = 0;
+
+	page = BufferGetPage(so->hashso_curbuf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		offnum = so->killedItems[i].indexOffset;
+
+		while (offnum <= maxoff)
+		{
+			ItemId	iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			{
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;		/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		MarkBufferDirtyHint(so->hashso_curbuf, true);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 245ce97..7fc5721 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -155,6 +155,8 @@ hash_identify(uint8 info)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			id = "UPDATE_META_PAGE";
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			id = "VACUUM_ONE_PAGE";
 	}
 
 	return id;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index c0434f5..185d1e8 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -57,6 +57,7 @@ typedef uint32 Bucket;
 #define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
 #define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
 #define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
+#define LH_PAGE_HAS_DEAD_TUPLES	(1 << 7)
 
 typedef struct HashPageOpaqueData
 {
@@ -74,6 +75,7 @@ typedef HashPageOpaqueData *HashPageOpaque;
 #define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
 #define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
 										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+#define H_HAS_DEAD_TUPLES(opaque)		((opaque)->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
 
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
@@ -83,6 +85,13 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
+typedef struct HashScanPosItem    /* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+} HashScanPosItem;
+
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -116,6 +125,10 @@ typedef struct HashScanOpaqueData
 
 	/* Whether scan needs to skip tuples that are moved by split */
 	bool		hashso_skip_moved_tuples;
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			numKilled;		/* number of currently stored items */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -177,6 +190,7 @@ typedef struct HashMetaPageData
 
 typedef HashMetaPageData *HashMetaPage;
 
+
 /*
  * Maximum size of a hash index item (it's okay to have only one per page)
  */
@@ -381,6 +395,7 @@ extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
 extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
 extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
 					uint32 lowmask, uint32 maxbucket);
+extern void hashkillitems(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 30e16c0..e9946d1 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -43,6 +43,7 @@
 #define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
 												 * vacuum */
 
+#define XLOG_HASH_VACUUM_ONE_PAGE	0xC0	/* remove dead tuples from index page */
 
 /*
  * xl_hash_split_allocpage flag values, 8 bits are available.
@@ -257,6 +258,24 @@ typedef struct xl_hash_init_bitmap_page
 #define SizeOfHashInitBitmapPage	\
 	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
+/*
+ * This is what we need for index tuple deletion and to
+ * update the meta page.
+ *
+ * This data record is used for XLOG_HASH_VACUUM_ONE_PAGE
+ *
+ * Backup Blk 0/1: bucket page
+ * Backup Blk 2: meta page
+ */
+typedef struct xl_hash_vacuum
+{
+	double		ntuples;
+	bool		is_primary_bucket_page;
+}	xl_hash_vacuum;
+
+#define SizeOfHashVacuum	\
+	(offsetof(xl_hash_vacuum, is_primary_bucket_page) + sizeof(bool))
+
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);

file_hash_kill_prior_tupleapplication/octet-stream; name=file_hash_kill_prior_tupleDownload

pgbench.patchtext/x-patch; charset=US-ASCII; name=pgbench.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 87fb006..9fda82d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2381,9 +2381,9 @@ init(bool is_no_vacuum)
 		}
 	};
 	static const char *const DDLINDEXes[] = {
-		"alter table pgbench_branches add primary key (bid)",
-		"alter table pgbench_tellers add primary key (tid)",
-		"alter table pgbench_accounts add primary key (aid)"
+		"create index pgbench_branches_bid on pgbench_branches using hash (bid)",
+		"create index pgbench_tellers_tid on pgbench_tellers using hash (tid)",
+		"create index pgbench_accounts_aid on pgbench_accounts using hash (aid)"
 	};
 	static const char *const DDLKEYs[] = {
 		"alter table pgbench_tellers add foreign key (bid) references pgbench_branches",

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Ashutosh Sharma (#1)

Re: Microvacuum support for Hash Index

On Mon, Oct 24, 2016 at 2:21 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi All,

I have added a microvacuum support for hash index access method and
attached is the v1 patch for the same.

This is an important functionality for hash index as we already do
have same functionality for other types of indexes like btree.

The patch basically takes care
of the following things:

1. Firstly, it changes the marking of dead tuples from
'tuple-at-a-time' to 'page-at-a-time' during hash index scan. For this
we accumulate the heap tids and offset of all the hash index tuples if
it is pointed by kill_prior_tuple during scan and then mark all
accumulated tids as LP_DEAD either while stepping from one page to
another (assuming the scan in both forward and backward direction) or
during end of the hash index scan or during rescan.

2. Secondly, when inserting tuple into hash index table, if not enough
space is found on a current page then it ensures that we first clean
the dead tuples if found in the current hash index page before moving
to the next page in a bucket chain or going for a bucket split. This
basically increases the page reusability and reduces the number of
page splits, thereby reducing the overall size of hash index table.

Few comments on patch:

1.
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+ Buffer bucketbuf = InvalidBuffer;
+ Buffer buffer;
+ Buffer metabuf;
+ Page page;
+ XLogRedoAction action;

While replaying the delete/vacuum record on standby, it can conflict
with some already running queries. Basically the replay can remove
some row which can be visible on standby. You need to resolve
conflicts similar to what we do in btree delete records (refer
btree_xlog_delete).

2.
+ /*
+ * Write-lock the meta page so that we can decrement
+ * tuple count.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+ _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+  (buf == bucket_buf) ? true : false);
+
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);

It seems here meta page lock is acquired for duration more than
required and also it is not required when there are no deletable items
on page. You can take the metapage lock before decrementing the count.

3.
Assert(offnum <= maxoff);
+

Spurious space. There are some other similar spurious white space
changes in patch, remove them as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ashutosh Sharma

ashu.coek88@gmail.com

about 9 years ago

In reply to: Amit Kapila (#2)

Re: Microvacuum support for Hash Index

Hi Amit,

Thanks for showing your interest and reviewing my patch. I have
started looking into your review comments. I will share the updated
patch in a day or two.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Fri, Oct 28, 2016 at 4:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Oct 24, 2016 at 2:21 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi All,

I have added a microvacuum support for hash index access method and
attached is the v1 patch for the same.

This is an important functionality for hash index as we already do
have same functionality for other types of indexes like btree.

The patch basically takes care
of the following things:

1. Firstly, it changes the marking of dead tuples from
'tuple-at-a-time' to 'page-at-a-time' during hash index scan. For this
we accumulate the heap tids and offset of all the hash index tuples if
it is pointed by kill_prior_tuple during scan and then mark all
accumulated tids as LP_DEAD either while stepping from one page to
another (assuming the scan in both forward and backward direction) or
during end of the hash index scan or during rescan.

2. Secondly, when inserting tuple into hash index table, if not enough
space is found on a current page then it ensures that we first clean
the dead tuples if found in the current hash index page before moving
to the next page in a bucket chain or going for a bucket split. This
basically increases the page reusability and reduces the number of
page splits, thereby reducing the overall size of hash index table.

Few comments on patch:
1.
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+ XLogRecPtr lsn = record->EndRecPtr;
+ xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+ Buffer bucketbuf = InvalidBuffer;
+ Buffer buffer;
+ Buffer metabuf;
+ Page page;
+ XLogRedoAction action;
While replaying the delete/vacuum record on standby, it can conflict
with some already running queries. Basically the replay can remove
some row which can be visible on standby. You need to resolve
conflicts similar to what we do in btree delete records (refer
btree_xlog_delete).
2.
+ /*
+ * Write-lock the meta page so that we can decrement
+ * tuple count.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+ _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+  (buf == bucket_buf) ? true : false);
+
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
It seems here meta page lock is acquired for duration more than
required and also it is not required when there are no deletable items
on page. You can take the metapage lock before decrementing the count.

3.
Assert(offnum <= maxoff);
+

Spurious space. There are some other similar spurious white space
changes in patch, remove them as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ashutosh Sharma

ashu.coek88@gmail.com

about 9 years ago

In reply to: Amit Kapila (#2)

1 attachment(s)

Re: Microvacuum support for Hash Index

Hi,

While replaying the delete/vacuum record on standby, it can conflict
with some already running queries. Basically the replay can remove
some row which can be visible on standby. You need to resolve
conflicts similar to what we do in btree delete records (refer
btree_xlog_delete).

Agreed. Thanks for putting this point. I have taken care of it in the
attached v2 patch.

+ /*
+ * Write-lock the meta page so that we can decrement
+ * tuple count.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+ _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+  (buf == bucket_buf) ? true : false);
+
+ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
It seems here meta page lock is acquired for duration more than
required and also it is not required when there are no deletable items
on page. You can take the metapage lock before decrementing the count.

Ok. Corrected. Please refer to the attached v2 patch.

Spurious space. There are some other similar spurious white space
changes in patch, remove them as well.

Corrected. Please refer attached v2 patch.

Attachments:

microvacuum_hash_index_v2.patchtext/x-patch; charset=US-ASCII; name=microvacuum_hash_index_v2.patchDownload

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index db73f05..4a4d614 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -157,7 +157,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	if (buildstate.spool)
 	{
 		/* sort the tuples and insert them into the index */
-		_h_indexbuild(buildstate.spool);
+		_h_indexbuild(buildstate.spool, heap->rd_node);
 		_h_spooldestroy(buildstate.spool);
 	}
 
@@ -196,6 +196,8 @@ hashbuildCallback(Relation index,
 	Datum		index_values[1];
 	bool		index_isnull[1];
 	IndexTuple	itup;
+	Relation	rel;
+	RelFileNode	rnode;
 
 	/* convert data to a hash key; on failure, do not insert anything */
 	if (!_hash_convert_tuple(index,
@@ -212,8 +214,12 @@ hashbuildCallback(Relation index,
 		/* form an index tuple and point it at the heap tuple */
 		itup = index_form_tuple(RelationGetDescr(index),
 								index_values, index_isnull);
+		/* Get RelfileNode from relation OID */
+		rel = relation_open(htup->t_tableOid, NoLock);
+		rnode = rel->rd_node;
+		relation_close(rel, NoLock);
 		itup->t_tid = htup->t_self;
-		_hash_doinsert(index, itup);
+		_hash_doinsert(index, itup, rnode);
 		pfree(itup);
 	}
 
@@ -245,7 +251,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 	itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
 	itup->t_tid = *ht_ctid;
 
-	_hash_doinsert(rel, itup);
+	_hash_doinsert(rel, itup, heapRel->rd_node);
 
 	pfree(itup);
 
@@ -325,14 +331,21 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.)
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+				so->killedItems = palloc(MaxIndexTuplesPerPage *
+										 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+				so->killedItems[so->numKilled].indexOffset =
+							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->numKilled++;
+			}
 		}
 
 		/*
@@ -439,6 +452,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so->hashso_skip_moved_tuples = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -454,6 +470,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		hashkillitems(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
@@ -480,6 +500,10 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		hashkillitems(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
@@ -809,6 +833,15 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 
+			/*
+			 * Let us mark the page as clean if vacuum removes the DEAD tuples
+			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
+			 * flag.
+			 */
+			if (tuples_removed && *tuples_removed > 0 &&
+				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d030a8d..c6dc20b 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -14,8 +14,13 @@
  */
 #include "postgres.h"
 
+#include "access/heapam_xlog.h"
 #include "access/hash_xlog.h"
 #include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "access/transam.h"
+#include "storage/procarray.h"
+#include "miscadmin.h"
 
 /*
  * replay a hash index meta page
@@ -921,6 +926,247 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 		UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted. This puts the work for calculating latestRemovedXid
+ * into the recovery path rather than the primary path.
+ *
+ * It's possible that this generates a fair amount of I/O, since an index
+ * block may have hundreds of tuples being deleted. Repeat accesses to the
+ * same heap blocks are common, though are not yet optimised.
+ */
+static TransactionId
+hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
+{
+	xl_hash_vacuum	*xlrec = (xl_hash_vacuum *) XLogRecGetData(record);
+	OffsetNumber	*unused;
+	Buffer		ibuffer,
+				hbuffer;
+	Page		ipage,
+				hpage;
+	RelFileNode	rnode;
+	BlockNumber	blkno;
+	ItemId		iitemid,
+				hitemid;
+	IndexTuple	itup;
+	HeapTupleHeader	htuphdr;
+	BlockNumber	hblkno;
+	OffsetNumber	hoffnum;
+	TransactionId	latestRemovedXid = InvalidTransactionId;
+	int		i;
+	char *ptr;
+	Size len;
+
+	/*
+	 * If there's nothing running on the standby we don't need to derive a
+	 * full latestRemovedXid value, so use a fast path out of here.  This
+	 * returns InvalidTransactionId, and so will conflict with all HS
+	 * transactions; but since we just worked out that that's zero people,
+	 * it's OK.
+	 */
+	if (CountDBBackends(InvalidOid) == 0)
+		return latestRemovedXid;
+
+	/*
+	 * Get index page.  If the DB is consistent, this should not fail, nor
+	 * should any of the heap page fetches below.  If one does, we return
+	 * InvalidTransactionId to cancel all HS transactions.  That's probably
+	 * overkill, but it's safe, and certainly better than panicking here.
+	 */
+	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
+
+	if (!BufferIsValid(ibuffer))
+		return InvalidTransactionId;
+	LockBuffer(ibuffer, HASH_READ);
+	ipage = (Page) BufferGetPage(ibuffer);
+
+	/*
+	 * Loop through the deleted index items to obtain the TransactionId from
+	 * the heap items they point to.
+	 */
+	ptr = XLogRecGetBlockData(record, 1, &len);
+
+	unused = (OffsetNumber *) ptr;
+
+	for (i = 0; i < xlrec->ntuples; i++)
+	{
+		/*
+		 * Identify the index tuple about to be deleted.
+		 */
+		iitemid = PageGetItemId(ipage, unused[i]);
+		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+
+		/*
+		 * Locate the heap page that the index tuple points at
+		 */
+		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM,
+										 hblkno, RBM_NORMAL);
+
+		if (!BufferIsValid(hbuffer))
+		{
+			UnlockReleaseBuffer(ibuffer);
+			return InvalidTransactionId;
+		}
+		LockBuffer(hbuffer, HASH_READ);
+		hpage = (Page) BufferGetPage(hbuffer);
+
+		/*
+		 * Look up the heap tuple header that the index tuple points at by
+		 * using the heap node supplied with the xlrec. We can't use
+		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
+		 * Note that we are not looking at tuple data here, just headers.
+		 */
+		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
+		hitemid = PageGetItemId(hpage, hoffnum);
+
+		/*
+		 * Follow any redirections until we find something useful.
+		 */
+		while (ItemIdIsRedirected(hitemid))
+		{
+			hoffnum = ItemIdGetRedirect(hitemid);
+			hitemid = PageGetItemId(hpage, hoffnum);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/*
+		 * If the heap item has storage, then read the header and use that to
+		 * set latestRemovedXid.
+		 *
+		 * Some LP_DEAD items may not be accessible, so we ignore them.
+		 */
+		if (ItemIdHasStorage(hitemid))
+		{
+			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+		}
+		else if (ItemIdIsDead(hitemid))
+		{
+			/*
+			 * Conjecture: if hitemid is dead then it had xids before the xids
+			 * marked on LP_NORMAL items. So we just ignore this item and move
+			 * onto the next, for the purposes of calculating
+			 * latestRemovedxids.
+			 */
+		}
+		else
+			Assert(!ItemIdIsUsed(hitemid));
+
+		UnlockReleaseBuffer(hbuffer);
+	}
+
+	UnlockReleaseBuffer(ibuffer);
+
+	/*
+	 * If all heap tuples were LP_DEAD then we will be returning
+	 * InvalidTransactionId here, which avoids conflicts. This matches
+	 * existing logic which assumes that LP_DEAD tuples must already be older
+	 * than the latestRemovedXid on the cleanup record that set them as
+	 * LP_DEAD, hence must already have generated a conflict.
+	 */
+	return latestRemovedXid;
+}
+
+/*
+ * replay delete operation in hash index to remove
+ * tuples marked as DEAD during index tuple insertion.
+ */
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+	XLogRecPtr lsn = record->EndRecPtr;
+	xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+	Buffer bucketbuf = InvalidBuffer;
+	Buffer buffer;
+	Buffer metabuf;
+	Page page;
+	XLogRedoAction action;
+
+	/*
+	 * If we have any conflict processing to do, it must happen before we
+	 * update the page.
+	 *
+	 * Hash Index delete records can conflict with standby queries.You might
+	 * think that vacuum records would conflict as well, but we've handled
+	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
+	 * just once when that arrives.  After that we know that no conflicts
+	 * exist from individual hash index vacuum records on that index.
+	 */
+	if (InHotStandby)
+	{
+		TransactionId latestRemovedXid =
+					hash_xlog_vacuum_get_latestRemovedXid(record);
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 1, &rnode, NULL, NULL);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+	}
+
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL,
+											   true, &buffer);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &buffer);
+	}
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		char *ptr;
+		Size len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(buffer);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page metapage;
+		HashMetaPage metap;
+
+		metapage = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(metapage);
+
+		metap->hashm_ntuples -= xldata->ntuples;
+
+		PageSetLSN(metapage, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
 void
 hash_redo(XLogReaderState *record)
 {
@@ -964,6 +1210,9 @@ hash_redo(XLogReaderState *record)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			hash_xlog_update_meta_page(record);
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			hash_xlog_vacuum_one_page(record);
+			break;
 		default:
 			elog(PANIC, "hash_redo: unknown op code %u", info);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 3514138..7435db0 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -19,7 +19,12 @@
 #include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
+#include "storage/lwlock.h"
+#include "storage/buf_internals.h"
 
+static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+								  Buffer bucket_buf, bool is_primary_bucket_page,
+								  RelFileNode hnode);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -28,7 +33,7 @@
  *		and hashinsert.  By here, itup is completely filled in.
  */
 void
-_hash_doinsert(Relation rel, IndexTuple itup)
+_hash_doinsert(Relation rel, IndexTuple itup, RelFileNode hnode)
 {
 	Buffer		buf = InvalidBuffer;
 	Buffer		bucket_buf;
@@ -206,6 +211,22 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	while (PageGetFreeSpace(page) < itemsz)
 	{
 		/*
+		 * Check if current page has any DEAD tuples. If yes,
+		 * delete these tuples and see if we can get a space for
+		 * the new item to be inserted before moving to the next
+		 * page in the bucket chain.
+		 */
+		if (H_HAS_DEAD_TUPLES(pageopaque) && CheckBufferForCleanup(bucket_buf))
+		{
+			_hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+								  (buf == bucket_buf) ? true : false,
+								  hnode);
+
+			if (PageGetFreeSpace(page) >= itemsz)
+				break;				/* OK, now we have enough space */
+		}
+
+		/*
 		 * no space on this page; check for an overflow page
 		 */
 		BlockNumber nextblkno = pageopaque->hasho_nextblkno;
@@ -247,7 +268,8 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 			Assert(PageGetFreeSpace(page) >= itemsz);
 		}
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
-		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE);
+		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE ||
+			   pageopaque->hasho_flag == (LH_OVERFLOW_PAGE | LH_PAGE_HAS_DEAD_TUPLES));
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
@@ -390,3 +412,98 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 				 RelationGetRelationName(rel));
 	}
 }
+
+/*
+ * _hash_vacuum_one_page - vacuum just one index page.
+ * Try to remove LP_DEAD items from the given page.  We
+ * must acquire cleanup lock on the primary bucket page
+ * before calling this function.
+ */
+
+static void
+_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+					  Buffer bucket_buf, bool is_primary_bucket_page,
+					  RelFileNode hnode)
+{
+	OffsetNumber	deletable[MaxOffsetNumber];
+	int ndeletable = 0;
+	OffsetNumber offnum,
+				 maxoff;
+	Page	page = BufferGetPage(buf);
+	HashPageOpaque	pageopaque;
+	HashMetaPage	metap;
+	double tuples_removed = 0;
+
+	/* Scan each tuple in page to see if it is marked as LP_DEAD */
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemId))
+		{
+			deletable[ndeletable++] = offnum;
+			tuples_removed += 1;
+		}
+	}
+
+	if (ndeletable > 0)
+	{
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
+		PageIndexMultiDelete(page, deletable, ndeletable);
+
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
+		/*
+		 * Write-lock the meta page so that we can decrement
+		 * tuple count.
+		 */
+		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+		metap = HashPageGetMeta(BufferGetPage(metabuf));
+		metap->hashm_ntuples -= tuples_removed;
+
+		MarkBufferDirty(buf);
+		MarkBufferDirty(metabuf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			xl_hash_vacuum	xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.hnode = hnode;
+			xlrec.is_primary_bucket_page = is_primary_bucket_page;
+			xlrec.ntuples = tuples_removed;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHashVacuum);
+
+			/*
+			 * primary bucket buffer needs to be registered to ensure
+			 * that we acquire cleanup lock during replay.
+			 */
+			if (!xlrec.is_primary_bucket_page)
+				XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(1, (char *) deletable,
+						ndeletable * sizeof(OffsetNumber));
+
+			XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+		}
+
+		END_CRIT_SECTION();
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	}
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 0df64a8..316f891 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -473,6 +473,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						hashkillitems(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
@@ -562,6 +566,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						hashkillitems(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index 8938ab5..aa4c7b7 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -101,7 +101,7 @@ _h_spool(HSpool *hspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire index.
  */
 void
-_h_indexbuild(HSpool *hspool)
+_h_indexbuild(HSpool *hspool, RelFileNode rnode)
 {
 	IndexTuple	itup;
 	bool		should_free;
@@ -128,7 +128,7 @@ _h_indexbuild(HSpool *hspool)
 		Assert(hashkey >= lasthashkey);
 #endif
 
-		_hash_doinsert(hspool->index, itup);
+		_hash_doinsert(hspool->index, itup, rnode);
 		if (should_free)
 			pfree(itup);
 	}
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index b5164d7..4350e32 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -19,6 +19,7 @@
 #include "access/relscan.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
+#include "storage/buf_internals.h"
 
 
 /*
@@ -489,3 +490,72 @@ _hash_get_newbucket(Relation rel, Bucket curr_bucket,
 
 	return new_bucket;
 }
+
+/*
+ * hashkillitems - set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0).
+ *
+ * We match items by heap TID before assuming they are the right ones to
+ * delete. If an item has moved off the current page due to a split, we'll
+ * fail to find it and do nothing (this is not an error case --- we assume
+ * the item will eventually get marked in a future indexscan).
+ */
+void
+hashkillitems(IndexScanDesc scan)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Page	page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum, maxoff;
+	int	numKilled = so->numKilled;
+	int		i;
+	bool	killedsomething = false;
+
+	Assert(so->numKilled > 0);
+	Assert(so->killedItems != NULL);
+
+	/*
+	 * Always reset the scan state, so we don't look for same
+	 * items on other pages.
+	 */
+	so->numKilled = 0;
+
+	page = BufferGetPage(so->hashso_curbuf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		offnum = so->killedItems[i].indexOffset;
+
+		while (offnum <= maxoff)
+		{
+			ItemId	iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			{
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;		/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		MarkBufferDirtyHint(so->hashso_curbuf, true);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 245ce97..7fc5721 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -155,6 +155,8 @@ hash_identify(uint8 info)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			id = "UPDATE_META_PAGE";
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			id = "VACUUM_ONE_PAGE";
 	}
 
 	return id;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index c0434f5..6fc7cd0 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -57,6 +57,7 @@ typedef uint32 Bucket;
 #define LH_BUCKET_NEW_PAGE_SPLIT	(1 << 4)
 #define LH_BUCKET_OLD_PAGE_SPLIT	(1 << 5)
 #define LH_BUCKET_PAGE_HAS_GARBAGE	(1 << 6)
+#define LH_PAGE_HAS_DEAD_TUPLES	(1 << 7)
 
 typedef struct HashPageOpaqueData
 {
@@ -74,6 +75,7 @@ typedef HashPageOpaqueData *HashPageOpaque;
 #define H_NEW_INCOMPLETE_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT)
 #define H_INCOMPLETE_SPLIT(opaque)		(((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) || \
 										 ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT))
+#define H_HAS_DEAD_TUPLES(opaque)		((opaque)->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
 
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
@@ -83,6 +85,13 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
+typedef struct HashScanPosItem    /* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+} HashScanPosItem;
+
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -116,6 +125,10 @@ typedef struct HashScanOpaqueData
 
 	/* Whether scan needs to skip tuples that are moved by split */
 	bool		hashso_skip_moved_tuples;
+
+	/* info about killed items if any (killedItems is NULL if never used) */
+	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			numKilled;		/* number of currently stored items */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -177,6 +190,7 @@ typedef struct HashMetaPageData
 
 typedef HashMetaPageData *HashMetaPage;
 
+
 /*
  * Maximum size of a hash index item (it's okay to have only one per page)
  */
@@ -303,7 +317,7 @@ extern Datum hash_uint32(uint32 k);
 /* private routines */
 
 /* hashinsert.c */
-extern void _hash_doinsert(Relation rel, IndexTuple itup);
+extern void _hash_doinsert(Relation rel, IndexTuple itup, RelFileNode hnode);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
@@ -361,7 +375,7 @@ extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
 extern void _h_spooldestroy(HSpool *hspool);
 extern void _h_spool(HSpool *hspool, ItemPointer self,
 		 Datum *values, bool *isnull);
-extern void _h_indexbuild(HSpool *hspool);
+extern void _h_indexbuild(HSpool *hspool, RelFileNode rnode);
 
 /* hashutil.c */
 extern bool _hash_checkqual(IndexScanDesc scan, IndexTuple itup);
@@ -381,6 +395,7 @@ extern BlockNumber _hash_get_oldblk(Relation rel, HashPageOpaque opaque);
 extern BlockNumber _hash_get_newblk(Relation rel, HashPageOpaque opaque);
 extern Bucket _hash_get_newbucket(Relation rel, Bucket curr_bucket,
 					uint32 lowmask, uint32 maxbucket);
+extern void hashkillitems(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Buffer bucket_buf,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 30e16c0..b4d2bf2 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -43,6 +43,7 @@
 #define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
 												 * vacuum */
 
+#define XLOG_HASH_VACUUM_ONE_PAGE	0xC0	/* remove dead tuples from index page */
 
 /*
  * xl_hash_split_allocpage flag values, 8 bits are available.
@@ -257,6 +258,25 @@ typedef struct xl_hash_init_bitmap_page
 #define SizeOfHashInitBitmapPage	\
 	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
+/*
+ * This is what we need for index tuple deletion and to
+ * update the meta page.
+ *
+ * This data record is used for XLOG_HASH_VACUUM_ONE_PAGE
+ *
+ * Backup Blk 0/1: bucket page
+ * Backup Blk 2: meta page
+ */
+typedef struct xl_hash_vacuum
+{
+	RelFileNode	hnode;
+	double		ntuples;
+	bool		is_primary_bucket_page;
+}	xl_hash_vacuum;
+
+#define SizeOfHashVacuum	\
+	(offsetof(xl_hash_vacuum, is_primary_bucket_page) + sizeof(bool))
+
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);

Jesper Pedersen

jesper.pedersen@redhat.com

about 9 years ago

In reply to: Ashutosh Sharma (#4)

Re: Microvacuum support for Hash Index

Hi,

On 11/02/2016 01:38 AM, Ashutosh Sharma wrote:

While replaying the delete/vacuum record on standby, it can conflict
with some already running queries. Basically the replay can remove
some row which can be visible on standby. You need to resolve
conflicts similar to what we do in btree delete records (refer
btree_xlog_delete).

Agreed. Thanks for putting this point. I have taken care of it in the
attached v2 patch.

Some initial comments.

_hash_vacuum_one_page:

+		END_CRIT_SECTION();
+		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);

The _hash_chgbufaccess() needs a comment.

You also need a place where you call pfree for so->killedItems - maybe
in hashkillitems().

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ashutosh Sharma

ashu.coek88@gmail.com

about 9 years ago

In reply to: Jesper Pedersen (#5)

Re: Microvacuum support for Hash Index

Hi Jesper,

Some initial comments.

_hash_vacuum_one_page:
+               END_CRIT_SECTION();
+               _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
The _hash_chgbufaccess() needs a comment.

You also need a place where you call pfree for so->killedItems - maybe in
hashkillitems().

Thanks for reviewing this patch. I would like to update you that this
patch has got dependency on a patch for concurrent hash index and WAL
log in hash index. So, till these two patches for hash index are not
stable I won't be able to share you a next version of patch for
supporting microvacuum in hash index.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jesper Pedersen

jesper.pedersen@redhat.com

about 9 years ago

In reply to: Ashutosh Sharma (#6)

Re: Microvacuum support for Hash Index

On 11/11/2016 12:11 AM, Ashutosh Sharma wrote:

Thanks for reviewing this patch. I would like to update you that this
patch has got dependency on a patch for concurrent hash index and WAL
log in hash index. So, till these two patches for hash index are not
stable I won't be able to share you a next version of patch for
supporting microvacuum in hash index.

As the concurrent hash index patch was committed in 6d46f4 this patch
needs a rebase.

I have moved this submission to the next CF.

Thanks for working on this !

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jesper Pedersen

jesper.pedersen@redhat.com

about 9 years ago

In reply to: Ashutosh Sharma (#6)

Re: Microvacuum support for Hash Index

On 11/11/2016 12:11 AM, Ashutosh Sharma wrote:

Hi Jesper,
Some initial comments.

_hash_vacuum_one_page:
+               END_CRIT_SECTION();
+               _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
The _hash_chgbufaccess() needs a comment.

You also need a place where you call pfree for so->killedItems - maybe in
hashkillitems().
Thanks for reviewing this patch. I would like to update you that this
patch has got dependency on a patch for concurrent hash index and WAL
log in hash index. So, till these two patches for hash index are not
stable I won't be able to share you a next version of patch for
supporting microvacuum in hash index.

This can be rebased on the WAL v7 patch [1]/messages/by-id/CAA4eK1+dmGNTFMnLO4EbOWJDHUq=+a2L8T=72ifXeh-Kd8HOsg@mail.gmail.com. In addition to the previous
comments you need to take commit 7819ba into account.

[1]: /messages/by-id/CAA4eK1+dmGNTFMnLO4EbOWJDHUq=+a2L8T=72ifXeh-Kd8HOsg@mail.gmail.com
/messages/by-id/CAA4eK1+dmGNTFMnLO4EbOWJDHUq=+a2L8T=72ifXeh-Kd8HOsg@mail.gmail.com

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ashutosh Sharma

ashu.coek88@gmail.com

about 9 years ago

In reply to: Jesper Pedersen (#8)

1 attachment(s)

Re: Microvacuum support for Hash Index

Hi,

This can be rebased on the WAL v7 patch [1]. In addition to the previous
comments you need to take commit 7819ba into account.

Attached is the v3 patch rebased on postgreSQL HEAD and WAL v7 patch.
It also takes care of all the previous comments from Jesper - [1]/messages/by-id/a751842f-2aed-9f2e-104c-34cfe06bfbe2@redhat.com.

Also, I have changed the status of this patch to "Needs review" for
this commit-fest.

[1]: /messages/by-id/a751842f-2aed-9f2e-104c-34cfe06bfbe2@redhat.com

With Regards,
Ashutosh Sharma.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

microvacuum_hash_index_v3.patchinvalid/octet-stream; name=microvacuum_hash_index_v3.patchDownload

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f186e52..c96ff0a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -157,7 +157,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	if (buildstate.spool)
 	{
 		/* sort the tuples and insert them into the index */
-		_h_indexbuild(buildstate.spool);
+		_h_indexbuild(buildstate.spool, heap->rd_node);
 		_h_spooldestroy(buildstate.spool);
 	}
 
@@ -196,6 +196,8 @@ hashbuildCallback(Relation index,
 	Datum		index_values[1];
 	bool		index_isnull[1];
 	IndexTuple	itup;
+	Relation	rel;
+	RelFileNode	rnode;
 
 	/* convert data to a hash key; on failure, do not insert anything */
 	if (!_hash_convert_tuple(index,
@@ -212,8 +214,12 @@ hashbuildCallback(Relation index,
 		/* form an index tuple and point it at the heap tuple */
 		itup = index_form_tuple(RelationGetDescr(index),
 								index_values, index_isnull);
+		/* Get RelfileNode from relation OID */
+		rel = relation_open(htup->t_tableOid, NoLock);
+		rnode = rel->rd_node;
+		relation_close(rel, NoLock);
 		itup->t_tid = htup->t_self;
-		_hash_doinsert(index, itup);
+		_hash_doinsert(index, itup, rnode);
 		pfree(itup);
 	}
 
@@ -245,7 +251,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 	itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
 	itup->t_tid = *ht_ctid;
 
-	_hash_doinsert(rel, itup);
+	_hash_doinsert(rel, itup, heapRel->rd_node);
 
 	pfree(itup);
 
@@ -325,14 +331,21 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.)
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+				so->killedItems = palloc(MaxIndexTuplesPerPage *
+										 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+				so->killedItems[so->numKilled].indexOffset =
+							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->numKilled++;
+			}
 		}
 
 		/*
@@ -440,6 +453,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -455,6 +471,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		hashkillitems(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
@@ -482,8 +502,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		hashkillitems(scan);
+
 	_hash_dropscanbuf(rel, so);
 
+	if (so->killedItems != NULL)
+		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
@@ -835,6 +861,16 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
+
+			/*
+			 * Let us mark the page as clean if vacuum removes the DEAD tuples
+			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
+			 * flag.
+			 */
+			if (tuples_removed && *tuples_removed > 0 &&
+				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 41429a7..3f88804 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -14,8 +14,13 @@
  */
 #include "postgres.h"
 
+#include "access/heapam_xlog.h"
 #include "access/hash_xlog.h"
 #include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "access/transam.h"
+#include "storage/procarray.h"
+#include "miscadmin.h"
 
 /*
  * replay a hash index meta page
@@ -921,6 +926,247 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 		UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted. This puts the work for calculating latestRemovedXid
+ * into the recovery path rather than the primary path.
+ *
+ * It's possible that this generates a fair amount of I/O, since an index
+ * block may have hundreds of tuples being deleted. Repeat accesses to the
+ * same heap blocks are common, though are not yet optimised.
+ */
+static TransactionId
+hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
+{
+	xl_hash_vacuum	*xlrec = (xl_hash_vacuum *) XLogRecGetData(record);
+	OffsetNumber	*unused;
+	Buffer		ibuffer,
+				hbuffer;
+	Page		ipage,
+				hpage;
+	RelFileNode	rnode;
+	BlockNumber	blkno;
+	ItemId		iitemid,
+				hitemid;
+	IndexTuple	itup;
+	HeapTupleHeader	htuphdr;
+	BlockNumber	hblkno;
+	OffsetNumber	hoffnum;
+	TransactionId	latestRemovedXid = InvalidTransactionId;
+	int		i;
+	char *ptr;
+	Size len;
+
+	/*
+	 * If there's nothing running on the standby we don't need to derive a
+	 * full latestRemovedXid value, so use a fast path out of here.  This
+	 * returns InvalidTransactionId, and so will conflict with all HS
+	 * transactions; but since we just worked out that that's zero people,
+	 * it's OK.
+	 */
+	if (CountDBBackends(InvalidOid) == 0)
+		return latestRemovedXid;
+
+	/*
+	 * Get index page.  If the DB is consistent, this should not fail, nor
+	 * should any of the heap page fetches below.  If one does, we return
+	 * InvalidTransactionId to cancel all HS transactions.  That's probably
+	 * overkill, but it's safe, and certainly better than panicking here.
+	 */
+	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
+
+	if (!BufferIsValid(ibuffer))
+		return InvalidTransactionId;
+	LockBuffer(ibuffer, HASH_READ);
+	ipage = (Page) BufferGetPage(ibuffer);
+
+	/*
+	 * Loop through the deleted index items to obtain the TransactionId from
+	 * the heap items they point to.
+	 */
+	ptr = XLogRecGetBlockData(record, 1, &len);
+
+	unused = (OffsetNumber *) ptr;
+
+	for (i = 0; i < xlrec->ntuples; i++)
+	{
+		/*
+		 * Identify the index tuple about to be deleted.
+		 */
+		iitemid = PageGetItemId(ipage, unused[i]);
+		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+
+		/*
+		 * Locate the heap page that the index tuple points at
+		 */
+		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM,
+										 hblkno, RBM_NORMAL);
+
+		if (!BufferIsValid(hbuffer))
+		{
+			UnlockReleaseBuffer(ibuffer);
+			return InvalidTransactionId;
+		}
+		LockBuffer(hbuffer, HASH_READ);
+		hpage = (Page) BufferGetPage(hbuffer);
+
+		/*
+		 * Look up the heap tuple header that the index tuple points at by
+		 * using the heap node supplied with the xlrec. We can't use
+		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
+		 * Note that we are not looking at tuple data here, just headers.
+		 */
+		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
+		hitemid = PageGetItemId(hpage, hoffnum);
+
+		/*
+		 * Follow any redirections until we find something useful.
+		 */
+		while (ItemIdIsRedirected(hitemid))
+		{
+			hoffnum = ItemIdGetRedirect(hitemid);
+			hitemid = PageGetItemId(hpage, hoffnum);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/*
+		 * If the heap item has storage, then read the header and use that to
+		 * set latestRemovedXid.
+		 *
+		 * Some LP_DEAD items may not be accessible, so we ignore them.
+		 */
+		if (ItemIdHasStorage(hitemid))
+		{
+			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+		}
+		else if (ItemIdIsDead(hitemid))
+		{
+			/*
+			 * Conjecture: if hitemid is dead then it had xids before the xids
+			 * marked on LP_NORMAL items. So we just ignore this item and move
+			 * onto the next, for the purposes of calculating
+			 * latestRemovedxids.
+			 */
+		}
+		else
+			Assert(!ItemIdIsUsed(hitemid));
+
+		UnlockReleaseBuffer(hbuffer);
+	}
+
+	UnlockReleaseBuffer(ibuffer);
+
+	/*
+	 * If all heap tuples were LP_DEAD then we will be returning
+	 * InvalidTransactionId here, which avoids conflicts. This matches
+	 * existing logic which assumes that LP_DEAD tuples must already be older
+	 * than the latestRemovedXid on the cleanup record that set them as
+	 * LP_DEAD, hence must already have generated a conflict.
+	 */
+	return latestRemovedXid;
+}
+
+/*
+ * replay delete operation in hash index to remove
+ * tuples marked as DEAD during index tuple insertion.
+ */
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+	XLogRecPtr lsn = record->EndRecPtr;
+	xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+	Buffer bucketbuf = InvalidBuffer;
+	Buffer buffer;
+	Buffer metabuf;
+	Page page;
+	XLogRedoAction action;
+
+	/*
+	 * If we have any conflict processing to do, it must happen before we
+	 * update the page.
+	 *
+	 * Hash Index delete records can conflict with standby queries.You might
+	 * think that vacuum records would conflict as well, but we've handled
+	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
+	 * just once when that arrives.  After that we know that no conflicts
+	 * exist from individual hash index vacuum records on that index.
+	 */
+	if (InHotStandby)
+	{
+		TransactionId latestRemovedXid =
+					hash_xlog_vacuum_get_latestRemovedXid(record);
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 1, &rnode, NULL, NULL);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+	}
+
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL,
+											   true, &buffer);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &buffer);
+	}
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		char *ptr;
+		Size len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(buffer);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page metapage;
+		HashMetaPage metap;
+
+		metapage = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(metapage);
+
+		metap->hashm_ntuples -= xldata->ntuples;
+
+		PageSetLSN(metapage, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
 void
 hash_redo(XLogReaderState *record)
 {
@@ -964,6 +1210,9 @@ hash_redo(XLogReaderState *record)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			hash_xlog_update_meta_page(record);
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			hash_xlog_vacuum_one_page(record);
+			break;
 		default:
 			elog(PANIC, "hash_redo: unknown op code %u", info);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 057bd3c..826b0db 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -19,7 +19,12 @@
 #include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
+#include "storage/lwlock.h"
+#include "storage/buf_internals.h"
 
+static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+								  Buffer bucket_buf, bool is_primary_bucket_page,
+								  RelFileNode hnode);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -28,7 +33,7 @@
  *		and hashinsert.  By here, itup is completely filled in.
  */
 void
-_hash_doinsert(Relation rel, IndexTuple itup)
+_hash_doinsert(Relation rel, IndexTuple itup, RelFileNode hnode)
 {
 	Buffer		buf = InvalidBuffer;
 	Buffer		bucket_buf;
@@ -166,10 +171,28 @@ restart_insert:
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
+		BlockNumber nextblkno;
+
+		/*
+		 * Check if current page has any DEAD tuples. If yes,
+		 * delete these tuples and see if we can get a space for
+		 * the new item to be inserted before moving to the next
+		 * page in the bucket chain.
+		 */
+		if (H_HAS_DEAD_TUPLES(pageopaque) && IsBufferCleanupOK(bucket_buf))
+		{
+			_hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+								  (buf == bucket_buf) ? true : false,
+								  hnode);
+
+			if (PageGetFreeSpace(page) >= itemsz)
+				break;				/* OK, now we have enough space */
+		}
+
 		/*
 		 * no space on this page; check for an overflow page
 		 */
-		BlockNumber nextblkno = pageopaque->hasho_nextblkno;
+		nextblkno = pageopaque->hasho_nextblkno;
 
 		if (BlockNumberIsValid(nextblkno))
 		{
@@ -205,7 +228,8 @@ restart_insert:
 			Assert(PageGetFreeSpace(page) >= itemsz);
 		}
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
-		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE);
+		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE ||
+			   pageopaque->hasho_flag == (LH_OVERFLOW_PAGE | LH_PAGE_HAS_DEAD_TUPLES));
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
@@ -347,3 +371,102 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 				 RelationGetRelationName(rel));
 	}
 }
+
+/*
+ * _hash_vacuum_one_page - vacuum just one index page.
+ * Try to remove LP_DEAD items from the given page.  We
+ * must acquire cleanup lock on the primary bucket page
+ * before calling this function.
+ */
+
+static void
+_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+					  Buffer bucket_buf, bool is_primary_bucket_page,
+					  RelFileNode hnode)
+{
+	OffsetNumber	deletable[MaxOffsetNumber];
+	int ndeletable = 0;
+	OffsetNumber offnum,
+				 maxoff;
+	Page	page = BufferGetPage(buf);
+	HashPageOpaque	pageopaque;
+	HashMetaPage	metap;
+	double tuples_removed = 0;
+
+	/* Scan each tuple in page to see if it is marked as LP_DEAD */
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemId))
+		{
+			deletable[ndeletable++] = offnum;
+			tuples_removed += 1;
+		}
+	}
+
+	if (ndeletable > 0)
+	{
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
+		PageIndexMultiDelete(page, deletable, ndeletable);
+
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
+		/*
+		 * Write-lock the meta page so that we can decrement
+		 * tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+		metap = HashPageGetMeta(BufferGetPage(metabuf));
+		metap->hashm_ntuples -= tuples_removed;
+
+		MarkBufferDirty(buf);
+		MarkBufferDirty(metabuf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			xl_hash_vacuum	xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.hnode = hnode;
+			xlrec.is_primary_bucket_page = is_primary_bucket_page;
+			xlrec.ntuples = tuples_removed;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHashVacuum);
+
+			/*
+			 * primary bucket buffer needs to be registered to ensure
+			 * that we acquire cleanup lock during replay.
+			 */
+			if (!xlrec.is_primary_bucket_page)
+				XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(1, (char *) deletable,
+						ndeletable * sizeof(OffsetNumber));
+
+			XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+		}
+
+		END_CRIT_SECTION();
+		/*
+		 * Releasing write lock on meta page as we have updated
+		 * the tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+	}
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9aaee1e..75aa0f8 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -509,6 +509,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						hashkillitems(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
@@ -562,6 +566,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						hashkillitems(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index ea8f109..60483cf 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -101,7 +101,7 @@ _h_spool(HSpool *hspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire index.
  */
 void
-_h_indexbuild(HSpool *hspool)
+_h_indexbuild(HSpool *hspool, RelFileNode rnode)
 {
 	IndexTuple	itup;
 #ifdef USE_ASSERT_CHECKING
@@ -126,6 +126,6 @@ _h_indexbuild(HSpool *hspool)
 		Assert(hashkey >= lasthashkey);
 #endif
 
-		_hash_doinsert(hspool->index, itup);
+		_hash_doinsert(hspool->index, itup, rnode);
 	}
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..9e338d9 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -19,6 +19,7 @@
 #include "access/relscan.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
+#include "storage/buf_internals.h"
 
 #define CALC_NEW_BUCKET(old_bucket, lowmask) \
 			old_bucket | (lowmask + 1)
@@ -446,3 +447,72 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 
 	return new_bucket;
 }
+
+/*
+ * hashkillitems - set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0).
+ *
+ * We match items by heap TID before assuming they are the right ones to
+ * delete. If an item has moved off the current page due to a split, we'll
+ * fail to find it and do nothing (this is not an error case --- we assume
+ * the item will eventually get marked in a future indexscan).
+ */
+void
+hashkillitems(IndexScanDesc scan)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Page	page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum, maxoff;
+	int	numKilled = so->numKilled;
+	int		i;
+	bool	killedsomething = false;
+
+	Assert(so->numKilled > 0);
+	Assert(so->killedItems != NULL);
+
+	/*
+	 * Always reset the scan state, so we don't look for same
+	 * items on other pages.
+	 */
+	so->numKilled = 0;
+
+	page = BufferGetPage(so->hashso_curbuf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		offnum = so->killedItems[i].indexOffset;
+
+		while (offnum <= maxoff)
+		{
+			ItemId	iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			{
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;		/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		MarkBufferDirtyHint(so->hashso_curbuf, true);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 5e3f7d8..5a06bb1 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -155,6 +155,8 @@ hash_identify(uint8 info)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			id = "UPDATE_META_PAGE";
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			id = "VACUUM_ONE_PAGE";
 	}
 
 	return id;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 8328fc5..c6355de 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -57,6 +57,7 @@ typedef uint32 Bucket;
 #define LH_BUCKET_BEING_POPULATED	(1 << 4)
 #define LH_BUCKET_BEING_SPLIT	(1 << 5)
 #define LH_BUCKET_NEEDS_SPLIT_CLEANUP	(1 << 6)
+#define LH_PAGE_HAS_DEAD_TUPLES	(1 << 7)
 
 typedef struct HashPageOpaqueData
 {
@@ -72,6 +73,7 @@ typedef HashPageOpaqueData *HashPageOpaque;
 #define H_NEEDS_SPLIT_CLEANUP(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
 #define H_BUCKET_BEING_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
 #define H_BUCKET_BEING_POPULATED(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
+#define H_HAS_DEAD_TUPLES(opaque)		((opaque)->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
 
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
@@ -81,6 +83,13 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
+typedef struct HashScanPosItem    /* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+} HashScanPosItem;
+
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -121,6 +130,9 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
+	/* info about killed items if any (killedItems is NULL if never used) */
+	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			numKilled;			/* number of currently stored items */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -182,6 +194,7 @@ typedef struct HashMetaPageData
 
 typedef HashMetaPageData *HashMetaPage;
 
+
 /*
  * Maximum size of a hash index item (it's okay to have only one per page)
  */
@@ -307,7 +320,7 @@ extern Datum hash_uint32(uint32 k);
 /* private routines */
 
 /* hashinsert.c */
-extern void _hash_doinsert(Relation rel, IndexTuple itup);
+extern void _hash_doinsert(Relation rel, IndexTuple itup, RelFileNode hnode);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
@@ -362,7 +375,7 @@ extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
 extern void _h_spooldestroy(HSpool *hspool);
 extern void _h_spool(HSpool *hspool, ItemPointer self,
 		 Datum *values, bool *isnull);
-extern void _h_indexbuild(HSpool *hspool);
+extern void _h_indexbuild(HSpool *hspool, RelFileNode rnode);
 
 /* hashutil.c */
 extern bool _hash_checkqual(IndexScanDesc scan, IndexTuple itup);
@@ -382,6 +395,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
+extern void hashkillitems(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index c53f878..aab4ac2 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -43,6 +43,7 @@
 #define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
 												 * vacuum */
 
+#define XLOG_HASH_VACUUM_ONE_PAGE	0xC0	/* remove dead tuples from index page */
 
 /*
  * xl_hash_split_allocpage flag values, 8 bits are available.
@@ -257,6 +258,25 @@ typedef struct xl_hash_init_bitmap_page
 #define SizeOfHashInitBitmapPage	\
 	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
+/*
+ * This is what we need for index tuple deletion and to
+ * update the meta page.
+ *
+ * This data record is used for XLOG_HASH_VACUUM_ONE_PAGE
+ *
+ * Backup Blk 0/1: bucket page
+ * Backup Blk 2: meta page
+ */
+typedef struct xl_hash_vacuum
+{
+	RelFileNode	hnode;
+	double		ntuples;
+	bool		is_primary_bucket_page;
+}	xl_hash_vacuum;
+
+#define SizeOfHashVacuum	\
+	(offsetof(xl_hash_vacuum, is_primary_bucket_page) + sizeof(bool))
+
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);

#10

Jesper Pedersen

jesper.pedersen@redhat.com

about 9 years ago

In reply to: Ashutosh Sharma (#9)

Re: Microvacuum support for Hash Index

Hi Ashutosh,

On 01/04/2017 06:13 AM, Ashutosh Sharma wrote:

Attached is the v3 patch rebased on postgreSQL HEAD and WAL v7 patch.
It also takes care of all the previous comments from Jesper - [1].

With an --enable-cassert build (master / WAL v7 / MV v3) and

-- ddl.sql --
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;
-- ddl.sql --

-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;
-- test.sql --

using pgbench -M prepared -c 10 -j 10 -T 600 -f test.sql test

crashes after a few minutes with

TRAP: FailedAssertion("!(LWLockHeldByMeInMode(((LWLock*)
(&(bufHdr)->content_lock)), LW_EXCLUSIVE))", File: "bufmgr.c", Line: 3781)

BTW, better rename 'hashkillitems' to '_hash_kill_items' to follow the
naming convention in hash.h

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Ashutosh Sharma

ashu.coek88@gmail.com

about 9 years ago

In reply to: Jesper Pedersen (#10)

1 attachment(s)

Re: Microvacuum support for Hash Index

Hi,

using pgbench -M prepared -c 10 -j 10 -T 600 -f test.sql test

crashes after a few minutes with

TRAP: FailedAssertion("!(LWLockHeldByMeInMode(((LWLock*)
(&(bufHdr)->content_lock)), LW_EXCLUSIVE))", File: "bufmgr.c", Line: 3781)

Attached v4 patch fixes this assertion failure.

BTW, better rename 'hashkillitems' to '_hash_kill_items' to follow the
naming convention in hash.h

okay, I have renamed 'hashkillitems' to '_hash_kill_items'. Please
check the attached v4 patch.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

Attachments:

microvacuum_hash_index_v4.patchinvalid/octet-stream; name=microvacuum_hash_index_v4.patchDownload

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f186e52..cdfaf54 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -157,7 +157,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	if (buildstate.spool)
 	{
 		/* sort the tuples and insert them into the index */
-		_h_indexbuild(buildstate.spool);
+		_h_indexbuild(buildstate.spool, heap->rd_node);
 		_h_spooldestroy(buildstate.spool);
 	}
 
@@ -196,6 +196,8 @@ hashbuildCallback(Relation index,
 	Datum		index_values[1];
 	bool		index_isnull[1];
 	IndexTuple	itup;
+	Relation	rel;
+	RelFileNode	rnode;
 
 	/* convert data to a hash key; on failure, do not insert anything */
 	if (!_hash_convert_tuple(index,
@@ -212,8 +214,12 @@ hashbuildCallback(Relation index,
 		/* form an index tuple and point it at the heap tuple */
 		itup = index_form_tuple(RelationGetDescr(index),
 								index_values, index_isnull);
+		/* Get RelfileNode from relation OID */
+		rel = relation_open(htup->t_tableOid, NoLock);
+		rnode = rel->rd_node;
+		relation_close(rel, NoLock);
 		itup->t_tid = htup->t_self;
-		_hash_doinsert(index, itup);
+		_hash_doinsert(index, itup, rnode);
 		pfree(itup);
 	}
 
@@ -245,7 +251,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 	itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
 	itup->t_tid = *ht_ctid;
 
-	_hash_doinsert(rel, itup);
+	_hash_doinsert(rel, itup, heapRel->rd_node);
 
 	pfree(itup);
 
@@ -325,14 +331,21 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.)
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+				so->killedItems = palloc(MaxIndexTuplesPerPage *
+										 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+				so->killedItems[so->numKilled].indexOffset =
+							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->numKilled++;
+			}
 		}
 
 		/*
@@ -440,6 +453,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -455,6 +471,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
@@ -482,8 +502,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
+	if (so->killedItems != NULL)
+		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
@@ -835,6 +861,16 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
+
+			/*
+			 * Let us mark the page as clean if vacuum removes the DEAD tuples
+			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
+			 * flag.
+			 */
+			if (tuples_removed && *tuples_removed > 0 &&
+				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 41429a7..3f88804 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -14,8 +14,13 @@
  */
 #include "postgres.h"
 
+#include "access/heapam_xlog.h"
 #include "access/hash_xlog.h"
 #include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "access/transam.h"
+#include "storage/procarray.h"
+#include "miscadmin.h"
 
 /*
  * replay a hash index meta page
@@ -921,6 +926,247 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 		UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted. This puts the work for calculating latestRemovedXid
+ * into the recovery path rather than the primary path.
+ *
+ * It's possible that this generates a fair amount of I/O, since an index
+ * block may have hundreds of tuples being deleted. Repeat accesses to the
+ * same heap blocks are common, though are not yet optimised.
+ */
+static TransactionId
+hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
+{
+	xl_hash_vacuum	*xlrec = (xl_hash_vacuum *) XLogRecGetData(record);
+	OffsetNumber	*unused;
+	Buffer		ibuffer,
+				hbuffer;
+	Page		ipage,
+				hpage;
+	RelFileNode	rnode;
+	BlockNumber	blkno;
+	ItemId		iitemid,
+				hitemid;
+	IndexTuple	itup;
+	HeapTupleHeader	htuphdr;
+	BlockNumber	hblkno;
+	OffsetNumber	hoffnum;
+	TransactionId	latestRemovedXid = InvalidTransactionId;
+	int		i;
+	char *ptr;
+	Size len;
+
+	/*
+	 * If there's nothing running on the standby we don't need to derive a
+	 * full latestRemovedXid value, so use a fast path out of here.  This
+	 * returns InvalidTransactionId, and so will conflict with all HS
+	 * transactions; but since we just worked out that that's zero people,
+	 * it's OK.
+	 */
+	if (CountDBBackends(InvalidOid) == 0)
+		return latestRemovedXid;
+
+	/*
+	 * Get index page.  If the DB is consistent, this should not fail, nor
+	 * should any of the heap page fetches below.  If one does, we return
+	 * InvalidTransactionId to cancel all HS transactions.  That's probably
+	 * overkill, but it's safe, and certainly better than panicking here.
+	 */
+	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
+
+	if (!BufferIsValid(ibuffer))
+		return InvalidTransactionId;
+	LockBuffer(ibuffer, HASH_READ);
+	ipage = (Page) BufferGetPage(ibuffer);
+
+	/*
+	 * Loop through the deleted index items to obtain the TransactionId from
+	 * the heap items they point to.
+	 */
+	ptr = XLogRecGetBlockData(record, 1, &len);
+
+	unused = (OffsetNumber *) ptr;
+
+	for (i = 0; i < xlrec->ntuples; i++)
+	{
+		/*
+		 * Identify the index tuple about to be deleted.
+		 */
+		iitemid = PageGetItemId(ipage, unused[i]);
+		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+
+		/*
+		 * Locate the heap page that the index tuple points at
+		 */
+		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM,
+										 hblkno, RBM_NORMAL);
+
+		if (!BufferIsValid(hbuffer))
+		{
+			UnlockReleaseBuffer(ibuffer);
+			return InvalidTransactionId;
+		}
+		LockBuffer(hbuffer, HASH_READ);
+		hpage = (Page) BufferGetPage(hbuffer);
+
+		/*
+		 * Look up the heap tuple header that the index tuple points at by
+		 * using the heap node supplied with the xlrec. We can't use
+		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
+		 * Note that we are not looking at tuple data here, just headers.
+		 */
+		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
+		hitemid = PageGetItemId(hpage, hoffnum);
+
+		/*
+		 * Follow any redirections until we find something useful.
+		 */
+		while (ItemIdIsRedirected(hitemid))
+		{
+			hoffnum = ItemIdGetRedirect(hitemid);
+			hitemid = PageGetItemId(hpage, hoffnum);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/*
+		 * If the heap item has storage, then read the header and use that to
+		 * set latestRemovedXid.
+		 *
+		 * Some LP_DEAD items may not be accessible, so we ignore them.
+		 */
+		if (ItemIdHasStorage(hitemid))
+		{
+			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+		}
+		else if (ItemIdIsDead(hitemid))
+		{
+			/*
+			 * Conjecture: if hitemid is dead then it had xids before the xids
+			 * marked on LP_NORMAL items. So we just ignore this item and move
+			 * onto the next, for the purposes of calculating
+			 * latestRemovedxids.
+			 */
+		}
+		else
+			Assert(!ItemIdIsUsed(hitemid));
+
+		UnlockReleaseBuffer(hbuffer);
+	}
+
+	UnlockReleaseBuffer(ibuffer);
+
+	/*
+	 * If all heap tuples were LP_DEAD then we will be returning
+	 * InvalidTransactionId here, which avoids conflicts. This matches
+	 * existing logic which assumes that LP_DEAD tuples must already be older
+	 * than the latestRemovedXid on the cleanup record that set them as
+	 * LP_DEAD, hence must already have generated a conflict.
+	 */
+	return latestRemovedXid;
+}
+
+/*
+ * replay delete operation in hash index to remove
+ * tuples marked as DEAD during index tuple insertion.
+ */
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+	XLogRecPtr lsn = record->EndRecPtr;
+	xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+	Buffer bucketbuf = InvalidBuffer;
+	Buffer buffer;
+	Buffer metabuf;
+	Page page;
+	XLogRedoAction action;
+
+	/*
+	 * If we have any conflict processing to do, it must happen before we
+	 * update the page.
+	 *
+	 * Hash Index delete records can conflict with standby queries.You might
+	 * think that vacuum records would conflict as well, but we've handled
+	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
+	 * just once when that arrives.  After that we know that no conflicts
+	 * exist from individual hash index vacuum records on that index.
+	 */
+	if (InHotStandby)
+	{
+		TransactionId latestRemovedXid =
+					hash_xlog_vacuum_get_latestRemovedXid(record);
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 1, &rnode, NULL, NULL);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+	}
+
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL,
+											   true, &buffer);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &buffer);
+	}
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		char *ptr;
+		Size len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(buffer);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page metapage;
+		HashMetaPage metap;
+
+		metapage = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(metapage);
+
+		metap->hashm_ntuples -= xldata->ntuples;
+
+		PageSetLSN(metapage, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
 void
 hash_redo(XLogReaderState *record)
 {
@@ -964,6 +1210,9 @@ hash_redo(XLogReaderState *record)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			hash_xlog_update_meta_page(record);
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			hash_xlog_vacuum_one_page(record);
+			break;
 		default:
 			elog(PANIC, "hash_redo: unknown op code %u", info);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 057bd3c..e886544 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -19,7 +19,12 @@
 #include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
+#include "storage/lwlock.h"
+#include "storage/buf_internals.h"
 
+static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+								  Buffer bucket_buf, bool is_primary_bucket_page,
+								  RelFileNode hnode);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -28,7 +33,7 @@
  *		and hashinsert.  By here, itup is completely filled in.
  */
 void
-_hash_doinsert(Relation rel, IndexTuple itup)
+_hash_doinsert(Relation rel, IndexTuple itup, RelFileNode hnode)
 {
 	Buffer		buf = InvalidBuffer;
 	Buffer		bucket_buf;
@@ -166,10 +171,41 @@ restart_insert:
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
+		BlockNumber nextblkno;
+
+		/*
+		 * Check if current page has any DEAD tuples. If yes,
+		 * delete these tuples and see if we can get a space for
+		 * the new item to be inserted before moving to the next
+		 * page in the bucket chain.
+		 */
+		if (H_HAS_DEAD_TUPLES(pageopaque))
+		{
+			if (bucket_buf != buf)
+				LockBuffer(bucket_buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (IsBufferCleanupOK(bucket_buf))
+			{
+				_hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+									  (buf == bucket_buf) ? true : false,
+									  hnode);
+				if (bucket_buf != buf)
+					LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
+
+				if (PageGetFreeSpace(page) >= itemsz)
+					break;				/* OK, now we have enough space */
+			}
+			else
+			{
+				if (bucket_buf != buf)
+					LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
+			}
+		}
+
 		/*
 		 * no space on this page; check for an overflow page
 		 */
-		BlockNumber nextblkno = pageopaque->hasho_nextblkno;
+		nextblkno = pageopaque->hasho_nextblkno;
 
 		if (BlockNumberIsValid(nextblkno))
 		{
@@ -205,7 +241,8 @@ restart_insert:
 			Assert(PageGetFreeSpace(page) >= itemsz);
 		}
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
-		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE);
+		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE ||
+			   pageopaque->hasho_flag == (LH_OVERFLOW_PAGE | LH_PAGE_HAS_DEAD_TUPLES));
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
@@ -347,3 +384,102 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 				 RelationGetRelationName(rel));
 	}
 }
+
+/*
+ * _hash_vacuum_one_page - vacuum just one index page.
+ * Try to remove LP_DEAD items from the given page.  We
+ * must acquire cleanup lock on the primary bucket page
+ * before calling this function.
+ */
+
+static void
+_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+					  Buffer bucket_buf, bool is_primary_bucket_page,
+					  RelFileNode hnode)
+{
+	OffsetNumber	deletable[MaxOffsetNumber];
+	int ndeletable = 0;
+	OffsetNumber offnum,
+				 maxoff;
+	Page	page = BufferGetPage(buf);
+	HashPageOpaque	pageopaque;
+	HashMetaPage	metap;
+	double tuples_removed = 0;
+
+	/* Scan each tuple in page to see if it is marked as LP_DEAD */
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemId))
+		{
+			deletable[ndeletable++] = offnum;
+			tuples_removed += 1;
+		}
+	}
+
+	if (ndeletable > 0)
+	{
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
+		PageIndexMultiDelete(page, deletable, ndeletable);
+
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
+		/*
+		 * Write-lock the meta page so that we can decrement
+		 * tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+		metap = HashPageGetMeta(BufferGetPage(metabuf));
+		metap->hashm_ntuples -= tuples_removed;
+
+		MarkBufferDirty(buf);
+		MarkBufferDirty(metabuf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			xl_hash_vacuum	xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.hnode = hnode;
+			xlrec.is_primary_bucket_page = is_primary_bucket_page;
+			xlrec.ntuples = tuples_removed;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHashVacuum);
+
+			/*
+			 * primary bucket buffer needs to be registered to ensure
+			 * that we acquire cleanup lock during replay.
+			 */
+			if (!xlrec.is_primary_bucket_page)
+				XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(1, (char *) deletable,
+						ndeletable * sizeof(OffsetNumber));
+
+			XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+		}
+
+		END_CRIT_SECTION();
+		/*
+		 * Releasing write lock on meta page as we have updated
+		 * the tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+	}
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9aaee1e..5d1c5be 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -509,6 +509,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
@@ -562,6 +566,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index ea8f109..60483cf 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -101,7 +101,7 @@ _h_spool(HSpool *hspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire index.
  */
 void
-_h_indexbuild(HSpool *hspool)
+_h_indexbuild(HSpool *hspool, RelFileNode rnode)
 {
 	IndexTuple	itup;
 #ifdef USE_ASSERT_CHECKING
@@ -126,6 +126,6 @@ _h_indexbuild(HSpool *hspool)
 		Assert(hashkey >= lasthashkey);
 #endif
 
-		_hash_doinsert(hspool->index, itup);
+		_hash_doinsert(hspool->index, itup, rnode);
 	}
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..4810553 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -19,6 +19,7 @@
 #include "access/relscan.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
+#include "storage/buf_internals.h"
 
 #define CALC_NEW_BUCKET(old_bucket, lowmask) \
 			old_bucket | (lowmask + 1)
@@ -446,3 +447,72 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 
 	return new_bucket;
 }
+
+/*
+ * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0).
+ *
+ * We match items by heap TID before assuming they are the right ones to
+ * delete. If an item has moved off the current page due to a split, we'll
+ * fail to find it and do nothing (this is not an error case --- we assume
+ * the item will eventually get marked in a future indexscan).
+ */
+void
+_hash_kill_items(IndexScanDesc scan)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Page	page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum, maxoff;
+	int	numKilled = so->numKilled;
+	int		i;
+	bool	killedsomething = false;
+
+	Assert(so->numKilled > 0);
+	Assert(so->killedItems != NULL);
+
+	/*
+	 * Always reset the scan state, so we don't look for same
+	 * items on other pages.
+	 */
+	so->numKilled = 0;
+
+	page = BufferGetPage(so->hashso_curbuf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		offnum = so->killedItems[i].indexOffset;
+
+		while (offnum <= maxoff)
+		{
+			ItemId	iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			{
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;		/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		MarkBufferDirtyHint(so->hashso_curbuf, true);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 5e3f7d8..5a06bb1 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -155,6 +155,8 @@ hash_identify(uint8 info)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			id = "UPDATE_META_PAGE";
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			id = "VACUUM_ONE_PAGE";
 	}
 
 	return id;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 8328fc5..bb73012 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -57,6 +57,7 @@ typedef uint32 Bucket;
 #define LH_BUCKET_BEING_POPULATED	(1 << 4)
 #define LH_BUCKET_BEING_SPLIT	(1 << 5)
 #define LH_BUCKET_NEEDS_SPLIT_CLEANUP	(1 << 6)
+#define LH_PAGE_HAS_DEAD_TUPLES	(1 << 7)
 
 typedef struct HashPageOpaqueData
 {
@@ -72,6 +73,7 @@ typedef HashPageOpaqueData *HashPageOpaque;
 #define H_NEEDS_SPLIT_CLEANUP(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
 #define H_BUCKET_BEING_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
 #define H_BUCKET_BEING_POPULATED(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
+#define H_HAS_DEAD_TUPLES(opaque)		((opaque)->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
 
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
@@ -81,6 +83,13 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
+typedef struct HashScanPosItem    /* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+} HashScanPosItem;
+
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -121,6 +130,9 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
+	/* info about killed items if any (killedItems is NULL if never used) */
+	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			numKilled;			/* number of currently stored items */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -182,6 +194,7 @@ typedef struct HashMetaPageData
 
 typedef HashMetaPageData *HashMetaPage;
 
+
 /*
  * Maximum size of a hash index item (it's okay to have only one per page)
  */
@@ -307,7 +320,7 @@ extern Datum hash_uint32(uint32 k);
 /* private routines */
 
 /* hashinsert.c */
-extern void _hash_doinsert(Relation rel, IndexTuple itup);
+extern void _hash_doinsert(Relation rel, IndexTuple itup, RelFileNode hnode);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
@@ -362,7 +375,7 @@ extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
 extern void _h_spooldestroy(HSpool *hspool);
 extern void _h_spool(HSpool *hspool, ItemPointer self,
 		 Datum *values, bool *isnull);
-extern void _h_indexbuild(HSpool *hspool);
+extern void _h_indexbuild(HSpool *hspool, RelFileNode rnode);
 
 /* hashutil.c */
 extern bool _hash_checkqual(IndexScanDesc scan, IndexTuple itup);
@@ -382,6 +395,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
+extern void _hash_kill_items(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index c53f878..aab4ac2 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -43,6 +43,7 @@
 #define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
 												 * vacuum */
 
+#define XLOG_HASH_VACUUM_ONE_PAGE	0xC0	/* remove dead tuples from index page */
 
 /*
  * xl_hash_split_allocpage flag values, 8 bits are available.
@@ -257,6 +258,25 @@ typedef struct xl_hash_init_bitmap_page
 #define SizeOfHashInitBitmapPage	\
 	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
+/*
+ * This is what we need for index tuple deletion and to
+ * update the meta page.
+ *
+ * This data record is used for XLOG_HASH_VACUUM_ONE_PAGE
+ *
+ * Backup Blk 0/1: bucket page
+ * Backup Blk 2: meta page
+ */
+typedef struct xl_hash_vacuum
+{
+	RelFileNode	hnode;
+	double		ntuples;
+	bool		is_primary_bucket_page;
+}	xl_hash_vacuum;
+
+#define SizeOfHashVacuum	\
+	(offsetof(xl_hash_vacuum, is_primary_bucket_page) + sizeof(bool))
+
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);

#12

Jesper Pedersen

jesper.pedersen@redhat.com

about 9 years ago

In reply to: Ashutosh Sharma (#11)

Re: Microvacuum support for Hash Index

Hi Ashutosh,

On 01/06/2017 12:54 AM, Ashutosh Sharma wrote:

using pgbench -M prepared -c 10 -j 10 -T 600 -f test.sql test

crashes after a few minutes with

TRAP: FailedAssertion("!(LWLockHeldByMeInMode(((LWLock*)
(&(bufHdr)->content_lock)), LW_EXCLUSIVE))", File: "bufmgr.c", Line: 3781)

Attached v4 patch fixes this assertion failure.

Yes, that fixes the problem.

However (master / WAL v7 / MV v4),

--- ddl.sql ---
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;
--- ddl.sql ---

--- test.sql ---
\set id random(1,10)
\set val random(0,10)
BEGIN;
DELETE FROM test WHERE id = :id;
INSERT INTO test VALUES (:id, :val);
COMMIT;
--- test.sql ---

gives

#9 0x000000000098a83e in elog_finish (elevel=20, fmt=0xb6ea92
"incorrect local pin count: %d") at elog.c:1378
#10 0x00000000007f0b33 in LockBufferForCleanup (buffer=1677) at
bufmgr.c:3605
#11 0x0000000000549390 in XLogReadBufferForRedoExtended
(record=0x2afced8, block_id=1 '\001', mode=RBM_NORMAL,
get_cleanup_lock=1 '\001', buf=0x7ffe3ee27c8c) at xlogutils.c:394
#12 0x00000000004c5026 in hash_xlog_vacuum_one_page (record=0x2afced8)
at hash_xlog.c:1109
#13 0x00000000004c5547 in hash_redo (record=0x2afced8) at hash_xlog.c:1214
#14 0x000000000053a361 in StartupXLOG () at xlog.c:6975
#15 0x00000000007a4ca0 in StartupProcessMain () at startup.c:216

on the slave instance in a master-slave setup.

Also, the src/backend/access/README file should be updated with a
description of the changes which this patch provides.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Ashutosh Sharma

ashu.coek88@gmail.com

about 9 years ago

In reply to: Jesper Pedersen (#12)

1 attachment(s)

Re: Microvacuum support for Hash Index

Hi Jesper,

However (master / WAL v7 / MV v4),
--- ddl.sql ---
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;
--- ddl.sql ---
--- test.sql ---
\set id random(1,10)
\set val random(0,10)
BEGIN;
DELETE FROM test WHERE id = :id;
INSERT INTO test VALUES (:id, :val);
COMMIT;
--- test.sql ---
gives

#9 0x000000000098a83e in elog_finish (elevel=20, fmt=0xb6ea92 "incorrect
local pin count: %d") at elog.c:1378
#10 0x00000000007f0b33 in LockBufferForCleanup (buffer=1677) at
bufmgr.c:3605
#11 0x0000000000549390 in XLogReadBufferForRedoExtended (record=0x2afced8,
block_id=1 '\001', mode=RBM_NORMAL, get_cleanup_lock=1 '\001',
buf=0x7ffe3ee27c8c) at xlogutils.c:394
#12 0x00000000004c5026 in hash_xlog_vacuum_one_page (record=0x2afced8) at
hash_xlog.c:1109
#13 0x00000000004c5547 in hash_redo (record=0x2afced8) at hash_xlog.c:1214
#14 0x000000000053a361 in StartupXLOG () at xlog.c:6975
#15 0x00000000007a4ca0 in StartupProcessMain () at startup.c:216

on the slave instance in a master-slave setup.

Thanks for reporting this problem. It is basically coming because i
forgot to unpin the bucketbuf in hash_xlog_vacuum_one_page(). Please
find the attached v5 patch that fixes the issue.

Also, the src/backend/access/README file should be updated with a
description of the changes which this patch provides.

okay, I have updated the insertion algorithm in the README file.

--
With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

Attachments:

microvacuum_hash_index_v5.patchinvalid/octet-stream; name=microvacuum_hash_index_v5.patchDownload

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 06ef477..0e669f1 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -259,7 +259,10 @@ The insertion algorithm is rather similar:
 		if we get the lock on both the buckets
 			finish the split using algorithm mentioned below for split
 		release the pin on old bucket and restart the insert from beginning.
-	if current page is full, release lock but not pin, read/exclusive-lock
+	if current page is full, first check if this page contains any dead tuples.
+	if yes, remove dead tuples from the current page and again check for the
+	availability of the space. If enough space found, insert the tuple else
+	release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
 	take buffer content lock in exclusive mode on metapage
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index f186e52..cdfaf54 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -157,7 +157,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	if (buildstate.spool)
 	{
 		/* sort the tuples and insert them into the index */
-		_h_indexbuild(buildstate.spool);
+		_h_indexbuild(buildstate.spool, heap->rd_node);
 		_h_spooldestroy(buildstate.spool);
 	}
 
@@ -196,6 +196,8 @@ hashbuildCallback(Relation index,
 	Datum		index_values[1];
 	bool		index_isnull[1];
 	IndexTuple	itup;
+	Relation	rel;
+	RelFileNode	rnode;
 
 	/* convert data to a hash key; on failure, do not insert anything */
 	if (!_hash_convert_tuple(index,
@@ -212,8 +214,12 @@ hashbuildCallback(Relation index,
 		/* form an index tuple and point it at the heap tuple */
 		itup = index_form_tuple(RelationGetDescr(index),
 								index_values, index_isnull);
+		/* Get RelfileNode from relation OID */
+		rel = relation_open(htup->t_tableOid, NoLock);
+		rnode = rel->rd_node;
+		relation_close(rel, NoLock);
 		itup->t_tid = htup->t_self;
-		_hash_doinsert(index, itup);
+		_hash_doinsert(index, itup, rnode);
 		pfree(itup);
 	}
 
@@ -245,7 +251,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 	itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
 	itup->t_tid = *ht_ctid;
 
-	_hash_doinsert(rel, itup);
+	_hash_doinsert(rel, itup, heapRel->rd_node);
 
 	pfree(itup);
 
@@ -325,14 +331,21 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.)
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+				so->killedItems = palloc(MaxIndexTuplesPerPage *
+										 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+				so->killedItems[so->numKilled].indexOffset =
+							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->numKilled++;
+			}
 		}
 
 		/*
@@ -440,6 +453,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -455,6 +471,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
@@ -482,8 +502,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
+	if (so->killedItems != NULL)
+		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
@@ -835,6 +861,16 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
+
+			/*
+			 * Let us mark the page as clean if vacuum removes the DEAD tuples
+			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
+			 * flag.
+			 */
+			if (tuples_removed && *tuples_removed > 0 &&
+				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 41429a7..9ef6e96 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -14,8 +14,13 @@
  */
 #include "postgres.h"
 
+#include "access/heapam_xlog.h"
 #include "access/hash_xlog.h"
 #include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "access/transam.h"
+#include "storage/procarray.h"
+#include "miscadmin.h"
 
 /*
  * replay a hash index meta page
@@ -921,6 +926,250 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 		UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted. This puts the work for calculating latestRemovedXid
+ * into the recovery path rather than the primary path.
+ *
+ * It's possible that this generates a fair amount of I/O, since an index
+ * block may have hundreds of tuples being deleted. Repeat accesses to the
+ * same heap blocks are common, though are not yet optimised.
+ */
+static TransactionId
+hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
+{
+	xl_hash_vacuum	*xlrec = (xl_hash_vacuum *) XLogRecGetData(record);
+	OffsetNumber	*unused;
+	Buffer		ibuffer,
+				hbuffer;
+	Page		ipage,
+				hpage;
+	RelFileNode	rnode;
+	BlockNumber	blkno;
+	ItemId		iitemid,
+				hitemid;
+	IndexTuple	itup;
+	HeapTupleHeader	htuphdr;
+	BlockNumber	hblkno;
+	OffsetNumber	hoffnum;
+	TransactionId	latestRemovedXid = InvalidTransactionId;
+	int		i;
+	char *ptr;
+	Size len;
+
+	/*
+	 * If there's nothing running on the standby we don't need to derive a
+	 * full latestRemovedXid value, so use a fast path out of here.  This
+	 * returns InvalidTransactionId, and so will conflict with all HS
+	 * transactions; but since we just worked out that that's zero people,
+	 * it's OK.
+	 */
+	if (CountDBBackends(InvalidOid) == 0)
+		return latestRemovedXid;
+
+	/*
+	 * Get index page.  If the DB is consistent, this should not fail, nor
+	 * should any of the heap page fetches below.  If one does, we return
+	 * InvalidTransactionId to cancel all HS transactions.  That's probably
+	 * overkill, but it's safe, and certainly better than panicking here.
+	 */
+	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
+
+	if (!BufferIsValid(ibuffer))
+		return InvalidTransactionId;
+	LockBuffer(ibuffer, HASH_READ);
+	ipage = (Page) BufferGetPage(ibuffer);
+
+	/*
+	 * Loop through the deleted index items to obtain the TransactionId from
+	 * the heap items they point to.
+	 */
+	ptr = XLogRecGetBlockData(record, 1, &len);
+
+	unused = (OffsetNumber *) ptr;
+
+	for (i = 0; i < xlrec->ntuples; i++)
+	{
+		/*
+		 * Identify the index tuple about to be deleted.
+		 */
+		iitemid = PageGetItemId(ipage, unused[i]);
+		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+
+		/*
+		 * Locate the heap page that the index tuple points at
+		 */
+		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM,
+										 hblkno, RBM_NORMAL);
+
+		if (!BufferIsValid(hbuffer))
+		{
+			UnlockReleaseBuffer(ibuffer);
+			return InvalidTransactionId;
+		}
+		LockBuffer(hbuffer, HASH_READ);
+		hpage = (Page) BufferGetPage(hbuffer);
+
+		/*
+		 * Look up the heap tuple header that the index tuple points at by
+		 * using the heap node supplied with the xlrec. We can't use
+		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
+		 * Note that we are not looking at tuple data here, just headers.
+		 */
+		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
+		hitemid = PageGetItemId(hpage, hoffnum);
+
+		/*
+		 * Follow any redirections until we find something useful.
+		 */
+		while (ItemIdIsRedirected(hitemid))
+		{
+			hoffnum = ItemIdGetRedirect(hitemid);
+			hitemid = PageGetItemId(hpage, hoffnum);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/*
+		 * If the heap item has storage, then read the header and use that to
+		 * set latestRemovedXid.
+		 *
+		 * Some LP_DEAD items may not be accessible, so we ignore them.
+		 */
+		if (ItemIdHasStorage(hitemid))
+		{
+			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+		}
+		else if (ItemIdIsDead(hitemid))
+		{
+			/*
+			 * Conjecture: if hitemid is dead then it had xids before the xids
+			 * marked on LP_NORMAL items. So we just ignore this item and move
+			 * onto the next, for the purposes of calculating
+			 * latestRemovedxids.
+			 */
+		}
+		else
+			Assert(!ItemIdIsUsed(hitemid));
+
+		UnlockReleaseBuffer(hbuffer);
+	}
+
+	UnlockReleaseBuffer(ibuffer);
+
+	/*
+	 * If all heap tuples were LP_DEAD then we will be returning
+	 * InvalidTransactionId here, which avoids conflicts. This matches
+	 * existing logic which assumes that LP_DEAD tuples must already be older
+	 * than the latestRemovedXid on the cleanup record that set them as
+	 * LP_DEAD, hence must already have generated a conflict.
+	 */
+	return latestRemovedXid;
+}
+
+/*
+ * replay delete operation in hash index to remove
+ * tuples marked as DEAD during index tuple insertion.
+ */
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+	XLogRecPtr lsn = record->EndRecPtr;
+	xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+	Buffer bucketbuf = InvalidBuffer;
+	Buffer buffer;
+	Buffer metabuf;
+	Page page;
+	XLogRedoAction action;
+
+	/*
+	 * If we have any conflict processing to do, it must happen before we
+	 * update the page.
+	 *
+	 * Hash Index delete records can conflict with standby queries.You might
+	 * think that vacuum records would conflict as well, but we've handled
+	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
+	 * just once when that arrives.  After that we know that no conflicts
+	 * exist from individual hash index vacuum records on that index.
+	 */
+	if (InHotStandby)
+	{
+		TransactionId latestRemovedXid =
+					hash_xlog_vacuum_get_latestRemovedXid(record);
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 1, &rnode, NULL, NULL);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+	}
+
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL,
+											   true, &buffer);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &buffer);
+	}
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		char *ptr;
+		Size len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(buffer);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page metapage;
+		HashMetaPage metap;
+
+		metapage = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(metapage);
+
+		metap->hashm_ntuples -= xldata->ntuples;
+
+		PageSetLSN(metapage, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
 void
 hash_redo(XLogReaderState *record)
 {
@@ -964,6 +1213,9 @@ hash_redo(XLogReaderState *record)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			hash_xlog_update_meta_page(record);
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			hash_xlog_vacuum_one_page(record);
+			break;
 		default:
 			elog(PANIC, "hash_redo: unknown op code %u", info);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 057bd3c..e886544 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -19,7 +19,12 @@
 #include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
+#include "storage/lwlock.h"
+#include "storage/buf_internals.h"
 
+static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+								  Buffer bucket_buf, bool is_primary_bucket_page,
+								  RelFileNode hnode);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -28,7 +33,7 @@
  *		and hashinsert.  By here, itup is completely filled in.
  */
 void
-_hash_doinsert(Relation rel, IndexTuple itup)
+_hash_doinsert(Relation rel, IndexTuple itup, RelFileNode hnode)
 {
 	Buffer		buf = InvalidBuffer;
 	Buffer		bucket_buf;
@@ -166,10 +171,41 @@ restart_insert:
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
+		BlockNumber nextblkno;
+
+		/*
+		 * Check if current page has any DEAD tuples. If yes,
+		 * delete these tuples and see if we can get a space for
+		 * the new item to be inserted before moving to the next
+		 * page in the bucket chain.
+		 */
+		if (H_HAS_DEAD_TUPLES(pageopaque))
+		{
+			if (bucket_buf != buf)
+				LockBuffer(bucket_buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (IsBufferCleanupOK(bucket_buf))
+			{
+				_hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+									  (buf == bucket_buf) ? true : false,
+									  hnode);
+				if (bucket_buf != buf)
+					LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
+
+				if (PageGetFreeSpace(page) >= itemsz)
+					break;				/* OK, now we have enough space */
+			}
+			else
+			{
+				if (bucket_buf != buf)
+					LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
+			}
+		}
+
 		/*
 		 * no space on this page; check for an overflow page
 		 */
-		BlockNumber nextblkno = pageopaque->hasho_nextblkno;
+		nextblkno = pageopaque->hasho_nextblkno;
 
 		if (BlockNumberIsValid(nextblkno))
 		{
@@ -205,7 +241,8 @@ restart_insert:
 			Assert(PageGetFreeSpace(page) >= itemsz);
 		}
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
-		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE);
+		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE ||
+			   pageopaque->hasho_flag == (LH_OVERFLOW_PAGE | LH_PAGE_HAS_DEAD_TUPLES));
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
@@ -347,3 +384,102 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 				 RelationGetRelationName(rel));
 	}
 }
+
+/*
+ * _hash_vacuum_one_page - vacuum just one index page.
+ * Try to remove LP_DEAD items from the given page.  We
+ * must acquire cleanup lock on the primary bucket page
+ * before calling this function.
+ */
+
+static void
+_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+					  Buffer bucket_buf, bool is_primary_bucket_page,
+					  RelFileNode hnode)
+{
+	OffsetNumber	deletable[MaxOffsetNumber];
+	int ndeletable = 0;
+	OffsetNumber offnum,
+				 maxoff;
+	Page	page = BufferGetPage(buf);
+	HashPageOpaque	pageopaque;
+	HashMetaPage	metap;
+	double tuples_removed = 0;
+
+	/* Scan each tuple in page to see if it is marked as LP_DEAD */
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemId))
+		{
+			deletable[ndeletable++] = offnum;
+			tuples_removed += 1;
+		}
+	}
+
+	if (ndeletable > 0)
+	{
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
+		PageIndexMultiDelete(page, deletable, ndeletable);
+
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
+		/*
+		 * Write-lock the meta page so that we can decrement
+		 * tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+		metap = HashPageGetMeta(BufferGetPage(metabuf));
+		metap->hashm_ntuples -= tuples_removed;
+
+		MarkBufferDirty(buf);
+		MarkBufferDirty(metabuf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			xl_hash_vacuum	xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.hnode = hnode;
+			xlrec.is_primary_bucket_page = is_primary_bucket_page;
+			xlrec.ntuples = tuples_removed;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHashVacuum);
+
+			/*
+			 * primary bucket buffer needs to be registered to ensure
+			 * that we acquire cleanup lock during replay.
+			 */
+			if (!xlrec.is_primary_bucket_page)
+				XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(1, (char *) deletable,
+						ndeletable * sizeof(OffsetNumber));
+
+			XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+		}
+
+		END_CRIT_SECTION();
+		/*
+		 * Releasing write lock on meta page as we have updated
+		 * the tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+	}
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9aaee1e..5d1c5be 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -509,6 +509,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
@@ -562,6 +566,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index ea8f109..60483cf 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -101,7 +101,7 @@ _h_spool(HSpool *hspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire index.
  */
 void
-_h_indexbuild(HSpool *hspool)
+_h_indexbuild(HSpool *hspool, RelFileNode rnode)
 {
 	IndexTuple	itup;
 #ifdef USE_ASSERT_CHECKING
@@ -126,6 +126,6 @@ _h_indexbuild(HSpool *hspool)
 		Assert(hashkey >= lasthashkey);
 #endif
 
-		_hash_doinsert(hspool->index, itup);
+		_hash_doinsert(hspool->index, itup, rnode);
 	}
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..4810553 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -19,6 +19,7 @@
 #include "access/relscan.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
+#include "storage/buf_internals.h"
 
 #define CALC_NEW_BUCKET(old_bucket, lowmask) \
 			old_bucket | (lowmask + 1)
@@ -446,3 +447,72 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 
 	return new_bucket;
 }
+
+/*
+ * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0).
+ *
+ * We match items by heap TID before assuming they are the right ones to
+ * delete. If an item has moved off the current page due to a split, we'll
+ * fail to find it and do nothing (this is not an error case --- we assume
+ * the item will eventually get marked in a future indexscan).
+ */
+void
+_hash_kill_items(IndexScanDesc scan)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Page	page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum, maxoff;
+	int	numKilled = so->numKilled;
+	int		i;
+	bool	killedsomething = false;
+
+	Assert(so->numKilled > 0);
+	Assert(so->killedItems != NULL);
+
+	/*
+	 * Always reset the scan state, so we don't look for same
+	 * items on other pages.
+	 */
+	so->numKilled = 0;
+
+	page = BufferGetPage(so->hashso_curbuf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		offnum = so->killedItems[i].indexOffset;
+
+		while (offnum <= maxoff)
+		{
+			ItemId	iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			{
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;		/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		MarkBufferDirtyHint(so->hashso_curbuf, true);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 5e3f7d8..5a06bb1 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -155,6 +155,8 @@ hash_identify(uint8 info)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			id = "UPDATE_META_PAGE";
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			id = "VACUUM_ONE_PAGE";
 	}
 
 	return id;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 8328fc5..bb73012 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -57,6 +57,7 @@ typedef uint32 Bucket;
 #define LH_BUCKET_BEING_POPULATED	(1 << 4)
 #define LH_BUCKET_BEING_SPLIT	(1 << 5)
 #define LH_BUCKET_NEEDS_SPLIT_CLEANUP	(1 << 6)
+#define LH_PAGE_HAS_DEAD_TUPLES	(1 << 7)
 
 typedef struct HashPageOpaqueData
 {
@@ -72,6 +73,7 @@ typedef HashPageOpaqueData *HashPageOpaque;
 #define H_NEEDS_SPLIT_CLEANUP(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
 #define H_BUCKET_BEING_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
 #define H_BUCKET_BEING_POPULATED(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
+#define H_HAS_DEAD_TUPLES(opaque)		((opaque)->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
 
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
@@ -81,6 +83,13 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
+typedef struct HashScanPosItem    /* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+} HashScanPosItem;
+
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -121,6 +130,9 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
+	/* info about killed items if any (killedItems is NULL if never used) */
+	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			numKilled;			/* number of currently stored items */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -182,6 +194,7 @@ typedef struct HashMetaPageData
 
 typedef HashMetaPageData *HashMetaPage;
 
+
 /*
  * Maximum size of a hash index item (it's okay to have only one per page)
  */
@@ -307,7 +320,7 @@ extern Datum hash_uint32(uint32 k);
 /* private routines */
 
 /* hashinsert.c */
-extern void _hash_doinsert(Relation rel, IndexTuple itup);
+extern void _hash_doinsert(Relation rel, IndexTuple itup, RelFileNode hnode);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
@@ -362,7 +375,7 @@ extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
 extern void _h_spooldestroy(HSpool *hspool);
 extern void _h_spool(HSpool *hspool, ItemPointer self,
 		 Datum *values, bool *isnull);
-extern void _h_indexbuild(HSpool *hspool);
+extern void _h_indexbuild(HSpool *hspool, RelFileNode rnode);
 
 /* hashutil.c */
 extern bool _hash_checkqual(IndexScanDesc scan, IndexTuple itup);
@@ -382,6 +395,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
+extern void _hash_kill_items(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index c53f878..aab4ac2 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -43,6 +43,7 @@
 #define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
 												 * vacuum */
 
+#define XLOG_HASH_VACUUM_ONE_PAGE	0xC0	/* remove dead tuples from index page */
 
 /*
  * xl_hash_split_allocpage flag values, 8 bits are available.
@@ -257,6 +258,25 @@ typedef struct xl_hash_init_bitmap_page
 #define SizeOfHashInitBitmapPage	\
 	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
+/*
+ * This is what we need for index tuple deletion and to
+ * update the meta page.
+ *
+ * This data record is used for XLOG_HASH_VACUUM_ONE_PAGE
+ *
+ * Backup Blk 0/1: bucket page
+ * Backup Blk 2: meta page
+ */
+typedef struct xl_hash_vacuum
+{
+	RelFileNode	hnode;
+	double		ntuples;
+	bool		is_primary_bucket_page;
+}	xl_hash_vacuum;
+
+#define SizeOfHashVacuum	\
+	(offsetof(xl_hash_vacuum, is_primary_bucket_page) + sizeof(bool))
+
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);

#14

Jesper Pedersen

jesper.pedersen@redhat.com

about 9 years ago

In reply to: Ashutosh Sharma (#13)

Re: Microvacuum support for Hash Index

Hi Ashutosh,

On 01/10/2017 05:24 AM, Ashutosh Sharma wrote:

Thanks for reporting this problem. It is basically coming because i
forgot to unpin the bucketbuf in hash_xlog_vacuum_one_page(). Please
find the attached v5 patch that fixes the issue.

The crash is now fixed, but the

--- test.sql ---
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;
--- test.sql ---

case gives

client 6 aborted in command 3 of script 0; ERROR: deadlock detected
DETAIL: Process 14608 waits for ShareLock on transaction 1444620;
blocked by process 14610.
Process 14610 waits for ShareLock on transaction 1444616; blocked by
process 14608.
HINT: See server log for query details.
CONTEXT: while rechecking updated tuple (12,3) in relation "test"
...

using pgbench -M prepared -c 10 -j 10 -T 300 -f test.sql test

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Jesper Pedersen

jesper.pedersen@redhat.com

almost 9 years ago

In reply to: Jesper Pedersen (#14)

Re: Microvacuum support for Hash Index

Hi Ashutosh,

On 01/10/2017 08:40 AM, Jesper Pedersen wrote:

On 01/10/2017 05:24 AM, Ashutosh Sharma wrote:

Thanks for reporting this problem. It is basically coming because i
forgot to unpin the bucketbuf in hash_xlog_vacuum_one_page(). Please
find the attached v5 patch that fixes the issue.

The crash is now fixed, but the
--- test.sql ---
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;
--- test.sql ---
case gives

client 6 aborted in command 3 of script 0; ERROR: deadlock detected
DETAIL: Process 14608 waits for ShareLock on transaction 1444620;
blocked by process 14610.
Process 14610 waits for ShareLock on transaction 1444616; blocked by
process 14608.
HINT: See server log for query details.
CONTEXT: while rechecking updated tuple (12,3) in relation "test"
...

using pgbench -M prepared -c 10 -j 10 -T 300 -f test.sql test

I'm not seeing this deadlock with just the WAL v8 patch applied.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Jesper Pedersen (#15)

Re: Microvacuum support for Hash Index

Hi Jesper,

I'm not seeing this deadlock with just the WAL v8 patch applied.

okay, Thanks for confirming that.

I would like to update you that I am not able to reproduce this issue
at my end. I suspect that the steps i am following might be slightly
different than your's. Could you please have a look at steps mentioned
below and confirm if there is something different that I am doing.

Firstly, I am running the test-case on following git commit in head:

<git-commmit>
commit ba61a04bc7fefeee03416d9911eb825c4897c223
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu Jan 19 19:52:13 2017 -0500

Avoid core dump for empty prepared statement in an aborted transaction.

Brown-paper-bag bug in commit ab1f0c822: the old code here coped with
null CachedPlanSource.raw_parse_tree, the new code not so much.
Per report from Dave Cramer.
</git-commit>

On top of above commit, I have applied WAL v8 patch for hash index and
MV v5 patch.

Now, with an --enable-cassert build I am following below steps:

1) Created a 'test' database

2) psql -d test -f ~/ddl.sql

where ddl.sql is,

3) pgbench -M prepared -c 10 -j 10 -T 1800 -f ~/test.sql test

where test.sql is,

-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;
-- test.sql --

Machine details are as follows:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 8

Also, It would be great if you could confirm as if you have been
getting this issue repeatedly. Thanks.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Jesper Pedersen

jesper.pedersen@redhat.com

almost 9 years ago

In reply to: Ashutosh Sharma (#16)

Re: Microvacuum support for Hash Index

Hi Ashutosh,

On 01/20/2017 04:18 AM, Ashutosh Sharma wrote:

okay, Thanks for confirming that.

I would like to update you that I am not able to reproduce this issue
at my end. I suspect that the steps i am following might be slightly
different than your's. Could you please have a look at steps mentioned
below and confirm if there is something different that I am doing.

Firstly, I am running the test-case on following git commit in head:

<git-commmit>
commit ba61a04bc7fefeee03416d9911eb825c4897c223
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Thu Jan 19 19:52:13 2017 -0500

Avoid core dump for empty prepared statement in an aborted transaction.

Brown-paper-bag bug in commit ab1f0c822: the old code here coped with
null CachedPlanSource.raw_parse_tree, the new code not so much.
Per report from Dave Cramer.
</git-commit>

On top of above commit, I have applied WAL v8 patch for hash index and
MV v5 patch.

Now, with an --enable-cassert build I am following below steps:

1) Created a 'test' database

2) psql -d test -f ~/ddl.sql

where ddl.sql is,

-- ddl.sql --
CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val;
CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id);
CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val);
ANALYZE;
-- ddl.sql --

3) pgbench -M prepared -c 10 -j 10 -T 1800 -f ~/test.sql test

where test.sql is,

-- test.sql --
\set id random(1,10)
\set val random(0,10)
BEGIN;
UPDATE test SET val = :val WHERE id = :id;
COMMIT;
-- test.sql --

Machine details are as follows:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 8

Also, It would be great if you could confirm as if you have been
getting this issue repeatedly. Thanks.

Yeah, those are the steps; just with a Skylake laptop.

However, I restarted with a fresh master, with WAL v8 and MV v5, and
can't reproduce the issue.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Jesper Pedersen

jesper.pedersen@redhat.com

almost 9 years ago

In reply to: Jesper Pedersen (#17)

Re: Microvacuum support for Hash Index

Hi Ashutosh,

On 01/20/2017 03:24 PM, Jesper Pedersen wrote:

Yeah, those are the steps; just with a Skylake laptop.

However, I restarted with a fresh master, with WAL v8 and MV v5, and
can't reproduce the issue.

I have done some more testing with this, and have moved to the patch
back to 'Needs Review' pending Amit's comments.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Jesper Pedersen

jesper.pedersen@redhat.com

almost 9 years ago

In reply to: Jesper Pedersen (#18)

Re: Microvacuum support for Hash Index

On 01/23/2017 02:53 PM, Jesper Pedersen wrote:

I have done some more testing with this, and have moved to the patch
back to 'Needs Review' pending Amit's comments.

Moved to "Ready for Committer".

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Jesper Pedersen (#19)

Re: Microvacuum support for Hash Index

On Thu, Jan 26, 2017 at 6:38 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 01/23/2017 02:53 PM, Jesper Pedersen wrote:

I have done some more testing with this, and have moved to the patch
back to 'Needs Review' pending Amit's comments.

Moved to "Ready for Committer".

Don't you think we should try to identify the reason of the deadlock
error reported by you up thread [1]/messages/by-id/dc6d7247-050f-4014-8c80-a4ee676eb384@redhat.com? I know that you and Ashutosh are
not able to reproduce it, but still I feel some investigation is
required to find the reason. It is quite possible that the test case
is such that the deadlock is expected in rare cases, if that is the
case then it is okay. I have not spent enough time on that to comment
whether it is a test or code issue.

[1]: /messages/by-id/dc6d7247-050f-4014-8c80-a4ee676eb384@redhat.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#20)

3 attachment(s)

Re: Microvacuum support for Hash Index

I have done some more testing with this, and have moved to the patch
back to 'Needs Review' pending Amit's comments.

Moved to "Ready for Committer".

Don't you think we should try to identify the reason of the deadlock
error reported by you up thread [1]? I know that you and Ashutosh are
not able to reproduce it, but still I feel some investigation is
required to find the reason. It is quite possible that the test case
is such that the deadlock is expected in rare cases, if that is the
case then it is okay. I have not spent enough time on that to comment
whether it is a test or code issue.

[1] - /messages/by-id/dc6d7247-050f-4014-8c80-a4ee676eb384@redhat.com

I am finally able to reproduce the issue using the attached script
file (deadlock_report). Basically, once I was able to reproduce the
issue with hash index I also thought of checking it with a non unique
B-Tree index and was able to reproduce it with B-Tree index as well.
This certainly tells us that there is nothing wrong at the code level
rather there is something wrong with the test script which is causing
this deadlock issue. Well, I have ran pgbench with two different
configurations and my observations are as follows:

1) With Primary keys i.e. uinque values: I have never encountered
deadlock issue with this configuration.

2) With non unique indexes (be it hash or B-Tree): I have seen
deadlock many a times with this configuration. Basically when we have
non-unique values associated with a column then there is a high
probability that multiple records will get updated with a single
'UPDATE' statement and when there are huge number of backends trying
to do so there is high chance of getting deadlock which i assume is
expected behaviour in database.

Also, Attached are pgbench_bt.patch and pgbench_hash.patch files that
has changes done to create btree and hash index.

pgbench settings:
Scale Factor = 300
Shared Buffer= 1GB
Client counts = 64
run time duration = 30 mins
read-write workload.

./pgbench -c $threads -j $threads -T $time_for_reading -M prepared
postgres -f /home/ashu/deadlock_report

I hope this makes the things clear now and if there is no more
concerns it can be moved to 'Ready for committer' state. Thank you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

deadlock_reportapplication/octet-stream; name=deadlock_reportDownload

pgbench_bt.patchinvalid/octet-stream; name=pgbench_bt.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index f6cb5d4..4fff035 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2624,9 +2624,9 @@ init(bool is_no_vacuum)
 		}
 	};
 	static const char *const DDLINDEXes[] = {
-		"alter table pgbench_branches add primary key (bid)",
-		"alter table pgbench_tellers add primary key (tid)",
-		"alter table pgbench_accounts add primary key (aid)"
+		"create index pgbench_branches_bid on pgbench_branches using btree (bid)",
+		"create index pgbench_tellers_tid on pgbench_tellers using btree (tid)",
+		"create index pgbench_accounts_aid on pgbench_accounts using btree (aid)"
 	};
 	static const char *const DDLKEYs[] = {
 		"alter table pgbench_tellers add foreign key (bid) references pgbench_branches",

pgbench_hash.patchinvalid/octet-stream; name=pgbench_hash.patchDownload

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 87fb006..9fda82d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2381,9 +2381,9 @@ init(bool is_no_vacuum)
 		}
 	};
 	static const char *const DDLINDEXes[] = {
-		"alter table pgbench_branches add primary key (bid)",
-		"alter table pgbench_tellers add primary key (tid)",
-		"alter table pgbench_accounts add primary key (aid)"
+		"create index pgbench_branches_bid on pgbench_branches using hash (bid)",
+		"create index pgbench_tellers_tid on pgbench_tellers using hash (tid)",
+		"create index pgbench_accounts_aid on pgbench_accounts using hash (aid)"
 	};
 	static const char *const DDLKEYs[] = {
 		"alter table pgbench_tellers add foreign key (bid) references pgbench_branches",

#22

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Ashutosh Sharma (#21)

Re: Microvacuum support for Hash Index

On Fri, Jan 27, 2017 at 5:15 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Don't you think we should try to identify the reason of the deadlock
error reported by you up thread [1]? I know that you and Ashutosh are
not able to reproduce it, but still I feel some investigation is
required to find the reason. It is quite possible that the test case
is such that the deadlock is expected in rare cases, if that is the
case then it is okay. I have not spent enough time on that to comment
whether it is a test or code issue.

[1] - /messages/by-id/dc6d7247-050f-4014-8c80-a4ee676eb384@redhat.com

I am finally able to reproduce the issue using the attached script
file (deadlock_report). Basically, once I was able to reproduce the
issue with hash index I also thought of checking it with a non unique
B-Tree index and was able to reproduce it with B-Tree index as well.
This certainly tells us that there is nothing wrong at the code level
rather there is something wrong with the test script which is causing
this deadlock issue. Well, I have ran pgbench with two different
configurations and my observations are as follows:

1) With Primary keys i.e. uinque values: I have never encountered
deadlock issue with this configuration.

2) With non unique indexes (be it hash or B-Tree): I have seen
deadlock many a times with this configuration. Basically when we have
non-unique values associated with a column then there is a high
probability that multiple records will get updated with a single
'UPDATE' statement and when there are huge number of backends trying
to do so there is high chance of getting deadlock which i assume is
expected behaviour in database.

I agree with your analysis, surely trying to update multiple rows with
same values from different backends can lead to deadlock.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#22)

Re: Microvacuum support for Hash Index

On Sat, Jan 28, 2017 at 8:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 27, 2017 at 5:15 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Don't you think we should try to identify the reason of the deadlock
error reported by you up thread [1]? I know that you and Ashutosh are
not able to reproduce it, but still I feel some investigation is
required to find the reason. It is quite possible that the test case
is such that the deadlock is expected in rare cases, if that is the
case then it is okay. I have not spent enough time on that to comment
whether it is a test or code issue.

I am finally able to reproduce the issue using the attached script
file (deadlock_report). Basically, once I was able to reproduce the
issue with hash index I also thought of checking it with a non unique
B-Tree index and was able to reproduce it with B-Tree index as well.
This certainly tells us that there is nothing wrong at the code level
rather there is something wrong with the test script which is causing
this deadlock issue. Well, I have ran pgbench with two different
configurations and my observations are as follows:

1) With Primary keys i.e. uinque values: I have never encountered
deadlock issue with this configuration.

2) With non unique indexes (be it hash or B-Tree): I have seen
deadlock many a times with this configuration. Basically when we have
non-unique values associated with a column then there is a high
probability that multiple records will get updated with a single
'UPDATE' statement and when there are huge number of backends trying
to do so there is high chance of getting deadlock which i assume is
expected behaviour in database.

I agree with your analysis, surely trying to update multiple rows with
same values from different backends can lead to deadlock.

Moved that to CF 2017-03.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Michael Paquier (#23)

2 attachment(s)

Re: Microvacuum support for Hash Index

Hi,

Attached is the v6 patch for microvacuum in hash index rebased on top
of 'v10 patch for WAL in hash index - [1]/messages/by-id/CAA4eK1+k5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn+72o1UA@mail.gmail.com' and 'v1 patch for WAL
consistency check for hash index - [2]/messages/by-id/CAGz5QCJLERUn_zoO0eDv6_Y_d0o4tNTMPeR7ivTLBg4rUrJdwg@mail.gmail.com'.

[1]: /messages/by-id/CAA4eK1+k5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn+72o1UA@mail.gmail.com
[2]: /messages/by-id/CAGz5QCJLERUn_zoO0eDv6_Y_d0o4tNTMPeR7ivTLBg4rUrJdwg@mail.gmail.com

Also, the patch (mask_hint_bit_LH_PAGE_HAS_DEAD_TUPLES.patch) to mask
'LH_PAGE_HAS_DEAD_TUPLES' flag which got added as a part of
Microvacuum patch is attached with this mail.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Wed, Feb 1, 2017 at 10:30 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Show quoted text

On Sat, Jan 28, 2017 at 8:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 27, 2017 at 5:15 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Don't you think we should try to identify the reason of the deadlock
error reported by you up thread [1]? I know that you and Ashutosh are
not able to reproduce it, but still I feel some investigation is
required to find the reason. It is quite possible that the test case
is such that the deadlock is expected in rare cases, if that is the
case then it is okay. I have not spent enough time on that to comment
whether it is a test or code issue.

I am finally able to reproduce the issue using the attached script
file (deadlock_report). Basically, once I was able to reproduce the
issue with hash index I also thought of checking it with a non unique
B-Tree index and was able to reproduce it with B-Tree index as well.
This certainly tells us that there is nothing wrong at the code level
rather there is something wrong with the test script which is causing
this deadlock issue. Well, I have ran pgbench with two different
configurations and my observations are as follows:

1) With Primary keys i.e. uinque values: I have never encountered
deadlock issue with this configuration.

2) With non unique indexes (be it hash or B-Tree): I have seen
deadlock many a times with this configuration. Basically when we have
non-unique values associated with a column then there is a high
probability that multiple records will get updated with a single
'UPDATE' statement and when there are huge number of backends trying
to do so there is high chance of getting deadlock which i assume is
expected behaviour in database.

I agree with your analysis, surely trying to update multiple rows with
same values from different backends can lead to deadlock.

Moved that to CF 2017-03.
--
Michael

Attachments:

microvacuum_hash_index_v6.patchapplication/x-download; name=microvacuum_hash_index_v6.patchDownload

From 00addf795c86be3329e15279608837fa206d5cdd Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Tue, 14 Mar 2017 13:26:16 +0530
Subject: [PATCH] Microvaccum support in Hash Index

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README         |   5 +-
 src/backend/access/hash/hash.c         |  54 +++++--
 src/backend/access/hash/hash_xlog.c    | 252 +++++++++++++++++++++++++++++++++
 src/backend/access/hash/hashinsert.c   | 142 ++++++++++++++++++-
 src/backend/access/hash/hashsearch.c   |   8 ++
 src/backend/access/hash/hashsort.c     |   4 +-
 src/backend/access/hash/hashutil.c     |  70 +++++++++
 src/backend/access/rmgrdesc/hashdesc.c |   2 +
 src/include/access/hash.h              |  18 ++-
 src/include/access/hash_xlog.h         |  20 +++
 10 files changed, 558 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index bd13d07..91367e3 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -284,7 +284,10 @@ The insertion algorithm is rather similar:
 		if we get the lock on both the buckets
 			finish the split using algorithm mentioned below for split
 		release the pin on old bucket and restart the insert from beginning.
-	if current page is full, release lock but not pin, read/exclusive-lock
+	if current page is full, first check if this page contains any dead tuples.
+	if yes, remove dead tuples from the current page and again check for the
+	availability of the space. If enough space found, insert the tuple else
+	release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
 	take buffer content lock in exclusive mode on metapage
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6416769..7d3998e 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -162,7 +162,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	if (buildstate.spool)
 	{
 		/* sort the tuples and insert them into the index */
-		_h_indexbuild(buildstate.spool);
+		_h_indexbuild(buildstate.spool, heap->rd_node);
 		_h_spooldestroy(buildstate.spool);
 	}
 
@@ -201,6 +201,8 @@ hashbuildCallback(Relation index,
 	Datum		index_values[1];
 	bool		index_isnull[1];
 	IndexTuple	itup;
+	Relation	rel;
+	RelFileNode	rnode;
 
 	/* convert data to a hash key; on failure, do not insert anything */
 	if (!_hash_convert_tuple(index,
@@ -217,8 +219,12 @@ hashbuildCallback(Relation index,
 		/* form an index tuple and point it at the heap tuple */
 		itup = index_form_tuple(RelationGetDescr(index),
 								index_values, index_isnull);
+		/* Get RelfileNode from relation OID */
+		rel = relation_open(htup->t_tableOid, NoLock);
+		rnode = rel->rd_node;
+		relation_close(rel, NoLock);
 		itup->t_tid = htup->t_self;
-		_hash_doinsert(index, itup);
+		_hash_doinsert(index, itup, rnode);
 		pfree(itup);
 	}
 
@@ -251,7 +257,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 	itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
 	itup->t_tid = *ht_ctid;
 
-	_hash_doinsert(rel, itup);
+	_hash_doinsert(rel, itup, heapRel->rd_node);
 
 	pfree(itup);
 
@@ -331,14 +337,21 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.)
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+				so->killedItems = palloc(MaxIndexTuplesPerPage *
+										 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+				so->killedItems[so->numKilled].indexOffset =
+							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->numKilled++;
+			}
 		}
 
 		/*
@@ -446,6 +459,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -461,6 +477,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
@@ -488,8 +508,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
+	if (so->killedItems != NULL)
+		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
@@ -848,6 +874,16 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
+
+			/*
+			 * Let us mark the page as clean if vacuum removes the DEAD tuples
+			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
+			 * flag.
+			 */
+			if (tuples_removed && *tuples_removed > 0 &&
+				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 9d2f86f..ee02d9a 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -14,10 +14,15 @@
  */
 #include "postgres.h"
 
+#include "access/heapam_xlog.h"
 #include "access/bufmask.h"
 #include "access/hash.h"
 #include "access/hash_xlog.h"
 #include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "access/transam.h"
+#include "storage/procarray.h"
+#include "miscadmin.h"
 
 /*
  * replay a hash index meta page
@@ -915,6 +920,250 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 		UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted. This puts the work for calculating latestRemovedXid
+ * into the recovery path rather than the primary path.
+ *
+ * It's possible that this generates a fair amount of I/O, since an index
+ * block may have hundreds of tuples being deleted. Repeat accesses to the
+ * same heap blocks are common, though are not yet optimised.
+ */
+static TransactionId
+hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
+{
+	xl_hash_vacuum	*xlrec = (xl_hash_vacuum *) XLogRecGetData(record);
+	OffsetNumber	*unused;
+	Buffer		ibuffer,
+				hbuffer;
+	Page		ipage,
+				hpage;
+	RelFileNode	rnode;
+	BlockNumber	blkno;
+	ItemId		iitemid,
+				hitemid;
+	IndexTuple	itup;
+	HeapTupleHeader	htuphdr;
+	BlockNumber	hblkno;
+	OffsetNumber	hoffnum;
+	TransactionId	latestRemovedXid = InvalidTransactionId;
+	int		i;
+	char *ptr;
+	Size len;
+
+	/*
+	 * If there's nothing running on the standby we don't need to derive a
+	 * full latestRemovedXid value, so use a fast path out of here.  This
+	 * returns InvalidTransactionId, and so will conflict with all HS
+	 * transactions; but since we just worked out that that's zero people,
+	 * it's OK.
+	 */
+	if (CountDBBackends(InvalidOid) == 0)
+		return latestRemovedXid;
+
+	/*
+	 * Get index page.  If the DB is consistent, this should not fail, nor
+	 * should any of the heap page fetches below.  If one does, we return
+	 * InvalidTransactionId to cancel all HS transactions.  That's probably
+	 * overkill, but it's safe, and certainly better than panicking here.
+	 */
+	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
+
+	if (!BufferIsValid(ibuffer))
+		return InvalidTransactionId;
+	LockBuffer(ibuffer, HASH_READ);
+	ipage = (Page) BufferGetPage(ibuffer);
+
+	/*
+	 * Loop through the deleted index items to obtain the TransactionId from
+	 * the heap items they point to.
+	 */
+	ptr = XLogRecGetBlockData(record, 1, &len);
+
+	unused = (OffsetNumber *) ptr;
+
+	for (i = 0; i < xlrec->ntuples; i++)
+	{
+		/*
+		 * Identify the index tuple about to be deleted.
+		 */
+		iitemid = PageGetItemId(ipage, unused[i]);
+		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+
+		/*
+		 * Locate the heap page that the index tuple points at
+		 */
+		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM,
+										 hblkno, RBM_NORMAL);
+
+		if (!BufferIsValid(hbuffer))
+		{
+			UnlockReleaseBuffer(ibuffer);
+			return InvalidTransactionId;
+		}
+		LockBuffer(hbuffer, HASH_READ);
+		hpage = (Page) BufferGetPage(hbuffer);
+
+		/*
+		 * Look up the heap tuple header that the index tuple points at by
+		 * using the heap node supplied with the xlrec. We can't use
+		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
+		 * Note that we are not looking at tuple data here, just headers.
+		 */
+		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
+		hitemid = PageGetItemId(hpage, hoffnum);
+
+		/*
+		 * Follow any redirections until we find something useful.
+		 */
+		while (ItemIdIsRedirected(hitemid))
+		{
+			hoffnum = ItemIdGetRedirect(hitemid);
+			hitemid = PageGetItemId(hpage, hoffnum);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/*
+		 * If the heap item has storage, then read the header and use that to
+		 * set latestRemovedXid.
+		 *
+		 * Some LP_DEAD items may not be accessible, so we ignore them.
+		 */
+		if (ItemIdHasStorage(hitemid))
+		{
+			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+		}
+		else if (ItemIdIsDead(hitemid))
+		{
+			/*
+			 * Conjecture: if hitemid is dead then it had xids before the xids
+			 * marked on LP_NORMAL items. So we just ignore this item and move
+			 * onto the next, for the purposes of calculating
+			 * latestRemovedxids.
+			 */
+		}
+		else
+			Assert(!ItemIdIsUsed(hitemid));
+
+		UnlockReleaseBuffer(hbuffer);
+	}
+
+	UnlockReleaseBuffer(ibuffer);
+
+	/*
+	 * If all heap tuples were LP_DEAD then we will be returning
+	 * InvalidTransactionId here, which avoids conflicts. This matches
+	 * existing logic which assumes that LP_DEAD tuples must already be older
+	 * than the latestRemovedXid on the cleanup record that set them as
+	 * LP_DEAD, hence must already have generated a conflict.
+	 */
+	return latestRemovedXid;
+}
+
+/*
+ * replay delete operation in hash index to remove
+ * tuples marked as DEAD during index tuple insertion.
+ */
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+	XLogRecPtr lsn = record->EndRecPtr;
+	xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+	Buffer bucketbuf = InvalidBuffer;
+	Buffer buffer;
+	Buffer metabuf;
+	Page page;
+	XLogRedoAction action;
+
+	/*
+	 * If we have any conflict processing to do, it must happen before we
+	 * update the page.
+	 *
+	 * Hash Index delete records can conflict with standby queries.You might
+	 * think that vacuum records would conflict as well, but we've handled
+	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
+	 * just once when that arrives.  After that we know that no conflicts
+	 * exist from individual hash index vacuum records on that index.
+	 */
+	if (InHotStandby)
+	{
+		TransactionId latestRemovedXid =
+					hash_xlog_vacuum_get_latestRemovedXid(record);
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 1, &rnode, NULL, NULL);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+	}
+
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL,
+											   true, &buffer);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &buffer);
+	}
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		char *ptr;
+		Size len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(buffer);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page metapage;
+		HashMetaPage metap;
+
+		metapage = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(metapage);
+
+		metap->hashm_ntuples -= xldata->ntuples;
+
+		PageSetLSN(metapage, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
 void
 hash_redo(XLogReaderState *record)
 {
@@ -958,6 +1207,9 @@ hash_redo(XLogReaderState *record)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			hash_xlog_update_meta_page(record);
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			hash_xlog_vacuum_one_page(record);
+			break;
 		default:
 			elog(PANIC, "hash_redo: unknown op code %u", info);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 241728f..0e28f33 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -19,7 +19,12 @@
 #include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
+#include "storage/lwlock.h"
+#include "storage/buf_internals.h"
 
+static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+								  Buffer bucket_buf, bool is_primary_bucket_page,
+								  RelFileNode hnode);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -28,7 +33,7 @@
  *		and hashinsert.  By here, itup is completely filled in.
  */
 void
-_hash_doinsert(Relation rel, IndexTuple itup)
+_hash_doinsert(Relation rel, IndexTuple itup, RelFileNode hnode)
 {
 	Buffer		buf = InvalidBuffer;
 	Buffer		bucket_buf;
@@ -118,10 +123,41 @@ restart_insert:
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
+		BlockNumber nextblkno;
+
+		/*
+		 * Check if current page has any DEAD tuples. If yes,
+		 * delete these tuples and see if we can get a space for
+		 * the new item to be inserted before moving to the next
+		 * page in the bucket chain.
+		 */
+		if (H_HAS_DEAD_TUPLES(pageopaque))
+		{
+			if (bucket_buf != buf)
+				LockBuffer(bucket_buf, BUFFER_LOCK_EXCLUSIVE);
+
+			if (IsBufferCleanupOK(bucket_buf))
+			{
+				_hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+									  (buf == bucket_buf) ? true : false,
+									  hnode);
+				if (bucket_buf != buf)
+					LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
+
+				if (PageGetFreeSpace(page) >= itemsz)
+					break;				/* OK, now we have enough space */
+			}
+			else
+			{
+				if (bucket_buf != buf)
+					LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
+			}
+		}
+
 		/*
 		 * no space on this page; check for an overflow page
 		 */
-		BlockNumber nextblkno = pageopaque->hasho_nextblkno;
+		nextblkno = pageopaque->hasho_nextblkno;
 
 		if (BlockNumberIsValid(nextblkno))
 		{
@@ -157,7 +193,8 @@ restart_insert:
 			Assert(PageGetFreeSpace(page) >= itemsz);
 		}
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
-		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE);
+		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE ||
+			   pageopaque->hasho_flag == (LH_OVERFLOW_PAGE | LH_PAGE_HAS_DEAD_TUPLES));
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
@@ -300,3 +337,102 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 				 RelationGetRelationName(rel));
 	}
 }
+
+/*
+ * _hash_vacuum_one_page - vacuum just one index page.
+ * Try to remove LP_DEAD items from the given page.  We
+ * must acquire cleanup lock on the primary bucket page
+ * before calling this function.
+ */
+
+static void
+_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+					  Buffer bucket_buf, bool is_primary_bucket_page,
+					  RelFileNode hnode)
+{
+	OffsetNumber	deletable[MaxOffsetNumber];
+	int ndeletable = 0;
+	OffsetNumber offnum,
+				 maxoff;
+	Page	page = BufferGetPage(buf);
+	HashPageOpaque	pageopaque;
+	HashMetaPage	metap;
+	double tuples_removed = 0;
+
+	/* Scan each tuple in page to see if it is marked as LP_DEAD */
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemId))
+		{
+			deletable[ndeletable++] = offnum;
+			tuples_removed += 1;
+		}
+	}
+
+	if (ndeletable > 0)
+	{
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
+		PageIndexMultiDelete(page, deletable, ndeletable);
+
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
+		/*
+		 * Write-lock the meta page so that we can decrement
+		 * tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+		metap = HashPageGetMeta(BufferGetPage(metabuf));
+		metap->hashm_ntuples -= tuples_removed;
+
+		MarkBufferDirty(buf);
+		MarkBufferDirty(metabuf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			xl_hash_vacuum	xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.hnode = hnode;
+			xlrec.is_primary_bucket_page = is_primary_bucket_page;
+			xlrec.ntuples = tuples_removed;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHashVacuum);
+
+			/*
+			 * primary bucket buffer needs to be registered to ensure
+			 * that we acquire cleanup lock during replay.
+			 */
+			if (!xlrec.is_primary_bucket_page)
+				XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(1, (char *) deletable,
+						ndeletable * sizeof(OffsetNumber));
+
+			XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+		}
+
+		END_CRIT_SECTION();
+		/*
+		 * Releasing write lock on meta page as we have updated
+		 * the tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+	}
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index d733770..2d92049 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -465,6 +465,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
@@ -518,6 +522,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index ea8f109..60483cf 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -101,7 +101,7 @@ _h_spool(HSpool *hspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire index.
  */
 void
-_h_indexbuild(HSpool *hspool)
+_h_indexbuild(HSpool *hspool, RelFileNode rnode)
 {
 	IndexTuple	itup;
 #ifdef USE_ASSERT_CHECKING
@@ -126,6 +126,6 @@ _h_indexbuild(HSpool *hspool)
 		Assert(hashkey >= lasthashkey);
 #endif
 
-		_hash_doinsert(hspool->index, itup);
+		_hash_doinsert(hspool->index, itup, rnode);
 	}
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..4810553 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -19,6 +19,7 @@
 #include "access/relscan.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
+#include "storage/buf_internals.h"
 
 #define CALC_NEW_BUCKET(old_bucket, lowmask) \
 			old_bucket | (lowmask + 1)
@@ -446,3 +447,72 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 
 	return new_bucket;
 }
+
+/*
+ * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0).
+ *
+ * We match items by heap TID before assuming they are the right ones to
+ * delete. If an item has moved off the current page due to a split, we'll
+ * fail to find it and do nothing (this is not an error case --- we assume
+ * the item will eventually get marked in a future indexscan).
+ */
+void
+_hash_kill_items(IndexScanDesc scan)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Page	page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum, maxoff;
+	int	numKilled = so->numKilled;
+	int		i;
+	bool	killedsomething = false;
+
+	Assert(so->numKilled > 0);
+	Assert(so->killedItems != NULL);
+
+	/*
+	 * Always reset the scan state, so we don't look for same
+	 * items on other pages.
+	 */
+	so->numKilled = 0;
+
+	page = BufferGetPage(so->hashso_curbuf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		offnum = so->killedItems[i].indexOffset;
+
+		while (offnum <= maxoff)
+		{
+			ItemId	iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			{
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;		/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		MarkBufferDirtyHint(so->hashso_curbuf, true);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index f1cc9ff..5bd5c8d 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -154,6 +154,8 @@ hash_identify(uint8 info)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			id = "UPDATE_META_PAGE";
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			id = "VACUUM_ONE_PAGE";
 	}
 
 	return id;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index bfdfed8..fb6e34f 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -57,6 +57,7 @@ typedef uint32 Bucket;
 #define LH_BUCKET_BEING_POPULATED	(1 << 4)
 #define LH_BUCKET_BEING_SPLIT	(1 << 5)
 #define LH_BUCKET_NEEDS_SPLIT_CLEANUP	(1 << 6)
+#define LH_PAGE_HAS_DEAD_TUPLES	(1 << 7)
 
 #define LH_PAGE_TYPE \
 	(LH_OVERFLOW_PAGE|LH_BUCKET_PAGE|LH_BITMAP_PAGE|LH_META_PAGE)
@@ -86,6 +87,7 @@ typedef HashPageOpaqueData *HashPageOpaque;
 #define H_NEEDS_SPLIT_CLEANUP(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
 #define H_BUCKET_BEING_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
 #define H_BUCKET_BEING_POPULATED(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
+#define H_HAS_DEAD_TUPLES(opaque)		((opaque)->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
 
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
@@ -95,6 +97,13 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
+typedef struct HashScanPosItem    /* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+} HashScanPosItem;
+
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -135,6 +144,9 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
+	/* info about killed items if any (killedItems is NULL if never used) */
+	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			numKilled;			/* number of currently stored items */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -196,6 +208,7 @@ typedef struct HashMetaPageData
 
 typedef HashMetaPageData *HashMetaPage;
 
+
 /*
  * Maximum size of a hash index item (it's okay to have only one per page)
  */
@@ -300,7 +313,7 @@ extern Datum hash_uint32(uint32 k);
 /* private routines */
 
 /* hashinsert.c */
-extern void _hash_doinsert(Relation rel, IndexTuple itup);
+extern void _hash_doinsert(Relation rel, IndexTuple itup, RelFileNode hnode);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
@@ -361,7 +374,7 @@ extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
 extern void _h_spooldestroy(HSpool *hspool);
 extern void _h_spool(HSpool *hspool, ItemPointer self,
 		 Datum *values, bool *isnull);
-extern void _h_indexbuild(HSpool *hspool);
+extern void _h_indexbuild(HSpool *hspool, RelFileNode rnode);
 
 /* hashutil.c */
 extern bool _hash_checkqual(IndexScanDesc scan, IndexTuple itup);
@@ -381,6 +394,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
+extern void _hash_kill_items(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 552d642..4e505cf 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -44,6 +44,7 @@
 #define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
 												 * vacuum */
 
+#define XLOG_HASH_VACUUM_ONE_PAGE	0xC0	/* remove dead tuples from index page */
 
 /*
  * xl_hash_split_allocate_page flag values, 8 bits are available.
@@ -250,6 +251,25 @@ typedef struct xl_hash_init_bitmap_page
 #define SizeOfHashInitBitmapPage	\
 	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
+/*
+ * This is what we need for index tuple deletion and to
+ * update the meta page.
+ *
+ * This data record is used for XLOG_HASH_VACUUM_ONE_PAGE
+ *
+ * Backup Blk 0/1: bucket page
+ * Backup Blk 2: meta page
+ */
+typedef struct xl_hash_vacuum
+{
+	RelFileNode	hnode;
+	double		ntuples;
+	bool		is_primary_bucket_page;
+}	xl_hash_vacuum;
+
+#define SizeOfHashVacuum	\
+	(offsetof(xl_hash_vacuum, is_primary_bucket_page) + sizeof(bool))
+
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
-- 
1.8.3.1

mask_hint_bit_LH_PAGE_HAS_DEAD_TUPLES.patchapplication/x-download; name=mask_hint_bit_LH_PAGE_HAS_DEAD_TUPLES.patchDownload

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index ee02d9a..064caa8 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1249,4 +1249,10 @@ hash_mask(char *pagedata, BlockNumber blkno)
 		 */
 		mask_lp_flags(page);
 	}
+
+	/*
+	 * LH_PAGE_HAS_DEAD_TUPLES is just an un-logged hint bit. So, mask it.
+	 * See _hash_kill_items() for details.
+	 */
+	opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
 }

#25

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Ashutosh Sharma (#24)

Re: Microvacuum support for Hash Index

On Tue, Mar 14, 2017 at 8:02 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached is the v6 patch for microvacuum in hash index rebased on top
of 'v10 patch for WAL in hash index - [1]' and 'v1 patch for WAL
consistency check for hash index - [2]'.

[1] - /messages/by-id/CAA4eK1+k5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn+72o1UA@mail.gmail.com
[2] - /messages/by-id/CAGz5QCJLERUn_zoO0eDv6_Y_d0o4tNTMPeR7ivTLBg4rUrJdwg@mail.gmail.com

Also, the patch (mask_hint_bit_LH_PAGE_HAS_DEAD_TUPLES.patch) to mask
'LH_PAGE_HAS_DEAD_TUPLES' flag which got added as a part of
Microvacuum patch is attached with this mail.

Generally, this patch looks like a pretty straightforward adaptation
of the similar btree mechanism to hash indexes, so if it works for
btree it ought to work here, too. But I noticed a few things while
reading through it.

+        /* Get RelfileNode from relation OID */
+        rel = relation_open(htup->t_tableOid, NoLock);
+        rnode = rel->rd_node;
+        relation_close(rel, NoLock);
         itup->t_tid = htup->t_self;
-        _hash_doinsert(index, itup);
+        _hash_doinsert(index, itup, rnode);

This is an awfully low-level place to be doing something like this.
I'm not sure exactly where this should be happening, but not in the
per-tuple callback.

+    /*
+     * If there's nothing running on the standby we don't need to derive a
+     * full latestRemovedXid value, so use a fast path out of here.  This
+     * returns InvalidTransactionId, and so will conflict with all HS
+     * transactions; but since we just worked out that that's zero people,
+     * it's OK.
+     */
+    if (CountDBBackends(InvalidOid) == 0)
+        return latestRemovedXid;

I see that this comment (and most of what surrounds it) was just
copied from the existing btree example, but isn't there a discrepancy
between the comment and the code? It says it returns
InvalidTransactionId, but it doesn't. Also, you dropped the XXX from
the btree original, and the following reachedConsistency check.

+     * Hash Index delete records can conflict with standby queries.You might
+     * think that vacuum records would conflict as well, but we've handled

But they're not called delete records in a hash index. The function
is called hash_xlog_vacuum_one_page. The corresponding btree function
is btree_xlog_delete. So this comment needs a little more updating.

+            if (IsBufferCleanupOK(bucket_buf))
+            {
+                _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+                                      (buf == bucket_buf) ? true : false,
+                                      hnode);
+                if (bucket_buf != buf)
+                    LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
+
+                if (PageGetFreeSpace(page) >= itemsz)
+                    break;              /* OK, now we have enough space */
+            }

I might be missing something, but I don't quite see why this needs a
cleanup lock on the primary bucket page. I would think a cleanup lock
on the page we're actually modifying would suffice, and I think if
that's correct it would be a better way to go. If that's not correct,
then I think the comments needs some work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#25)

Re: Microvacuum support for Hash Index

On Wed, Mar 15, 2017 at 1:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 14, 2017 at 8:02 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Attached is the v6 patch for microvacuum in hash index rebased on top
of 'v10 patch for WAL in hash index - [1]' and 'v1 patch for WAL
consistency check for hash index - [2]'.

[1] - /messages/by-id/CAA4eK1+k5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn+72o1UA@mail.gmail.com
[2] - /messages/by-id/CAGz5QCJLERUn_zoO0eDv6_Y_d0o4tNTMPeR7ivTLBg4rUrJdwg@mail.gmail.com

Also, the patch (mask_hint_bit_LH_PAGE_HAS_DEAD_TUPLES.patch) to mask
'LH_PAGE_HAS_DEAD_TUPLES' flag which got added as a part of
Microvacuum patch is attached with this mail.

Generally, this patch looks like a pretty straightforward adaptation
of the similar btree mechanism to hash indexes, so if it works for
btree it ought to work here, too. But I noticed a few things while
reading through it.
+        /* Get RelfileNode from relation OID */
+        rel = relation_open(htup->t_tableOid, NoLock);
+        rnode = rel->rd_node;
+        relation_close(rel, NoLock);
itup->t_tid = htup->t_self;
-        _hash_doinsert(index, itup);
+        _hash_doinsert(index, itup, rnode);
This is an awfully low-level place to be doing something like this.
I'm not sure exactly where this should be happening, but not in the
per-tuple callback.

I think one possibility is to get it using
indexrel->rd_index->indrelid in _hash_doinsert().

But they're not called delete records in a hash index. The function
is called hash_xlog_vacuum_one_page. The corresponding btree function
is btree_xlog_delete. So this comment needs a little more updating.
+            if (IsBufferCleanupOK(bucket_buf))
+            {
+                _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+                                      (buf == bucket_buf) ? true : false,
+                                      hnode);
+                if (bucket_buf != buf)
+                    LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
+
+                if (PageGetFreeSpace(page) >= itemsz)
+                    break;              /* OK, now we have enough space */
+            }
I might be missing something, but I don't quite see why this needs a
cleanup lock on the primary bucket page. I would think a cleanup lock
on the page we're actually modifying would suffice, and I think if
that's correct it would be a better way to go.

Offhand, I also don't see any problem with it.

Few other comments:
1.
+ if (ndeletable > 0)
+ {
+ /* No ereport(ERROR) until changes are logged */
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;

You clearing this flag while logging the action, but same is not taken
care during replay. Any reasons?

2.
+ /* No ereport(ERROR) until changes are logged */
+ START_CRIT_SECTION
();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ pageopaque =
(HashPageOpaque) PageGetSpecialPointer(page);
+ pageopaque->hasho_flag &=
~LH_PAGE_HAS_DEAD_TUPLES;
+
+ /*
+ * Write-lock the meta page so that we can
decrement
+ * tuple count.
+ */
+ LockBuffer(metabuf,
BUFFER_LOCK_EXCLUSIVE);

The lock on buffer should be acquired before critical section.

3.
-
/*
- * Since this can be redone later if needed, mark as a
hint.
- */
- MarkBufferDirtyHint(buf, true);
+
if (so->numKilled < MaxIndexTuplesPerPage)
+ {
+ so-

killedItems[so->numKilled].heapTid = so->hashso_heappos;

+ so-

killedItems[so->numKilled].indexOffset =

+
ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+ so->numKilled++;
+
}

By looking at above code, the first thing that comes to mind is when
numKilled can become greater than MaxIndexTuplesPerPage and why we are
ignoring the marking of dead tuples when it becomes greater than
MaxIndexTuplesPerPage. After looking at similar btree code, I realize
that it could
happen if user reverses the scan direction. I think you should
mention in comments that see btgettuple to know the reason of
numKilled overun test or something like that.

4.
+ * We match items by heap TID before assuming they are the right ones to
+ * delete. If an item has
moved off the current page due to a split, we'll
+ * fail to find it and do nothing (this is not an
error case --- we assume
+ * the item will eventually get marked in a future indexscan).
+ */
+void
+_hash_kill_items(IndexScanDesc scan)

I think this comment doesn't make much sense for hash index because
item won't move off from the current page due to split, only later
cleanup can remove it.

+
/*
* Maximum size of a hash index item (it's okay to have only one per page)

Spurious white space change.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Robert Haas (#25)

1 attachment(s)

Re: Microvacuum support for Hash Index

Generally, this patch looks like a pretty straightforward adaptation
of the similar btree mechanism to hash indexes, so if it works for
btree it ought to work here, too. But I noticed a few things while
reading through it.
+        /* Get RelfileNode from relation OID */
+        rel = relation_open(htup->t_tableOid, NoLock);
+        rnode = rel->rd_node;
+        relation_close(rel, NoLock);
itup->t_tid = htup->t_self;
-        _hash_doinsert(index, itup);
+        _hash_doinsert(index, itup, rnode);
This is an awfully low-level place to be doing something like this.
I'm not sure exactly where this should be happening, but not in the
per-tuple callback.

Okay, Now I have done this inside _hash_doinsert() instead of callback
function. Please have a look into the attached v7 patch.

+    /*
+     * If there's nothing running on the standby we don't need to derive a
+     * full latestRemovedXid value, so use a fast path out of here.  This
+     * returns InvalidTransactionId, and so will conflict with all HS
+     * transactions; but since we just worked out that that's zero people,
+     * it's OK.
+     */
+    if (CountDBBackends(InvalidOid) == 0)
+        return latestRemovedXid;
I see that this comment (and most of what surrounds it) was just
copied from the existing btree example, but isn't there a discrepancy
between the comment and the code? It says it returns
InvalidTransactionId, but it doesn't. Also, you dropped the XXX from
the btree original, and the following reachedConsistency check.

It does return InvalidTransactionId if there are no backends running
across any database in the standby. As shown below 'latestRemovedXid'
is initialised with InvalidTransactionId,

TransactionId latestRemovedXid = InvalidTransactionId;

So, If there are no backend processes running across any databases in
standby latestRemovedXid will be returned as it is.

I have also added the note in XXX in above comment. Please check v7
patch attached with this mail.

+     * Hash Index delete records can conflict with standby queries.You might
+     * think that vacuum records would conflict as well, but we've handled
But they're not called delete records in a hash index. The function
is called hash_xlog_vacuum_one_page. The corresponding btree function
is btree_xlog_delete. So this comment needs a little more updating.

Okay, I have tried to rephrase it to avoid the confusion.

+            if (IsBufferCleanupOK(bucket_buf))
+            {
+                _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+                                      (buf == bucket_buf) ? true : false,
+                                      hnode);
+                if (bucket_buf != buf)
+                    LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
+
+                if (PageGetFreeSpace(page) >= itemsz)
+                    break;              /* OK, now we have enough space */
+            }
I might be missing something, but I don't quite see why this needs a
cleanup lock on the primary bucket page. I would think a cleanup lock
on the page we're actually modifying would suffice, and I think if
that's correct it would be a better way to go. If that's not correct,
then I think the comments needs some work.

Thanks for your that suggestion... I spent a lot of time thinking on
this and also had a small discussion with Amit but could not find any
issue with taking cleanup lock on modified page instead of primary
bucket page.I had to do some decent code changes for this. Attached v7
patch has the changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

microvacuum_hash_index_v7.patchapplication/x-download; name=microvacuum_hash_index_v7.patchDownload

From d57b8aaafc8debbe010f49de55150fdbaa461792 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Wed, 15 Mar 2017 20:35:58 +0530
Subject: [PATCH] microvacuum_hash_index_v7.patch

---
 src/backend/access/hash/README         |   5 +-
 src/backend/access/hash/hash.c         |  45 ++++++-
 src/backend/access/hash/hash_xlog.c    | 240 +++++++++++++++++++++++++++++++++
 src/backend/access/hash/hashinsert.c   | 127 ++++++++++++++++-
 src/backend/access/hash/hashsearch.c   |   8 ++
 src/backend/access/hash/hashutil.c     |  68 ++++++++++
 src/backend/access/rmgrdesc/hashdesc.c |   2 +
 src/include/access/hash.h              |  13 ++
 src/include/access/hash_xlog.h         |  20 +++
 9 files changed, 519 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 53b0e0d..1541438 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -284,7 +284,10 @@ The insertion algorithm is rather similar:
 		if we get the lock on both the buckets
 			finish the split using algorithm mentioned below for split
 		release the pin on old bucket and restart the insert from beginning.
-	if current page is full, release lock but not pin, read/exclusive-lock
+	if current page is full, first check if this page contains any dead tuples.
+	if yes, remove dead tuples from the current page and again check for the
+	availability of the space. If enough space found, insert the tuple else
+	release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
 	take buffer content lock in exclusive mode on metapage
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6416769..74b4ab7 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -331,14 +331,24 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.) In case if caller reverses the indexscan
+			 * direction it is quite possible that the same item might
+			 * get entered multiple times. But, we don't detect that
+			 * instead we just forget any excess entries.
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+				so->killedItems = palloc(MaxIndexTuplesPerPage *
+										 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+				so->killedItems[so->numKilled].indexOffset =
+							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->numKilled++;
+			}
 		}
 
 		/*
@@ -446,6 +456,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -461,6 +474,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
@@ -488,8 +505,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
+	if (so->killedItems != NULL)
+		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
@@ -848,6 +871,16 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
+
+			/*
+			 * Let us mark the page as clean if vacuum removes the DEAD tuples
+			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
+			 * flag.
+			 */
+			if (tuples_removed && *tuples_removed > 0 &&
+				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 0c830ab..adaab28 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -14,10 +14,15 @@
  */
 #include "postgres.h"
 
+#include "access/heapam_xlog.h"
 #include "access/bufmask.h"
 #include "access/hash.h"
 #include "access/hash_xlog.h"
 #include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "access/transam.h"
+#include "storage/procarray.h"
+#include "miscadmin.h"
 
 /*
  * replay a hash index meta page
@@ -915,6 +920,238 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 		UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted. This puts the work for calculating latestRemovedXid
+ * into the recovery path rather than the primary path.
+ *
+ * It's possible that this generates a fair amount of I/O, since an index
+ * block may have hundreds of tuples being deleted. Repeat accesses to the
+ * same heap blocks are common, though are not yet optimised.
+ */
+static TransactionId
+hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
+{
+	xl_hash_vacuum	*xlrec = (xl_hash_vacuum *) XLogRecGetData(record);
+	OffsetNumber	*unused;
+	Buffer		ibuffer,
+				hbuffer;
+	Page		ipage,
+				hpage;
+	RelFileNode	rnode;
+	BlockNumber	blkno;
+	ItemId		iitemid,
+				hitemid;
+	IndexTuple	itup;
+	HeapTupleHeader	htuphdr;
+	BlockNumber	hblkno;
+	OffsetNumber	hoffnum;
+	TransactionId	latestRemovedXid = InvalidTransactionId;
+	int		i;
+	char *ptr;
+	Size len;
+
+	/*
+	 * If there's nothing running on the standby we don't need to derive a
+	 * full latestRemovedXid value, so use a fast path out of here.  This
+	 * returns InvalidTransactionId, and so will conflict with all HS
+	 * transactions; but since we just worked out that that's zero people,
+	 * it's OK.
+	 *
+	 * XXX There is a race condition here, which is that a new backend might
+	 * start just after we look.  If so, it cannot need to conflict, but this
+	 * coding will result in throwing a conflict anyway.
+	 */
+	if (CountDBBackends(InvalidOid) == 0)
+		return latestRemovedXid;
+
+	/*
+	 * Get index page.  If the DB is consistent, this should not fail, nor
+	 * should any of the heap page fetches below.  If one does, we return
+	 * InvalidTransactionId to cancel all HS transactions.  That's probably
+	 * overkill, but it's safe, and certainly better than panicking here.
+	 */
+	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
+
+	if (!BufferIsValid(ibuffer))
+		return InvalidTransactionId;
+	LockBuffer(ibuffer, HASH_READ);
+	ipage = (Page) BufferGetPage(ibuffer);
+
+	/*
+	 * Loop through the deleted index items to obtain the TransactionId from
+	 * the heap items they point to.
+	 */
+	ptr = XLogRecGetBlockData(record, 1, &len);
+
+	unused = (OffsetNumber *) ptr;
+
+	for (i = 0; i < xlrec->ntuples; i++)
+	{
+		/*
+		 * Identify the index tuple about to be deleted.
+		 */
+		iitemid = PageGetItemId(ipage, unused[i]);
+		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+
+		/*
+		 * Locate the heap page that the index tuple points at
+		 */
+		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM,
+										 hblkno, RBM_NORMAL);
+
+		if (!BufferIsValid(hbuffer))
+		{
+			UnlockReleaseBuffer(ibuffer);
+			return InvalidTransactionId;
+		}
+		LockBuffer(hbuffer, HASH_READ);
+		hpage = (Page) BufferGetPage(hbuffer);
+
+		/*
+		 * Look up the heap tuple header that the index tuple points at by
+		 * using the heap node supplied with the xlrec. We can't use
+		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
+		 * Note that we are not looking at tuple data here, just headers.
+		 */
+		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
+		hitemid = PageGetItemId(hpage, hoffnum);
+
+		/*
+		 * Follow any redirections until we find something useful.
+		 */
+		while (ItemIdIsRedirected(hitemid))
+		{
+			hoffnum = ItemIdGetRedirect(hitemid);
+			hitemid = PageGetItemId(hpage, hoffnum);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/*
+		 * If the heap item has storage, then read the header and use that to
+		 * set latestRemovedXid.
+		 *
+		 * Some LP_DEAD items may not be accessible, so we ignore them.
+		 */
+		if (ItemIdHasStorage(hitemid))
+		{
+			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+		}
+		else if (ItemIdIsDead(hitemid))
+		{
+			/*
+			 * Conjecture: if hitemid is dead then it had xids before the xids
+			 * marked on LP_NORMAL items. So we just ignore this item and move
+			 * onto the next, for the purposes of calculating
+			 * latestRemovedxids.
+			 */
+		}
+		else
+			Assert(!ItemIdIsUsed(hitemid));
+
+		UnlockReleaseBuffer(hbuffer);
+	}
+
+	UnlockReleaseBuffer(ibuffer);
+
+	/*
+	 * If all heap tuples were LP_DEAD then we will be returning
+	 * InvalidTransactionId here, which avoids conflicts. This matches
+	 * existing logic which assumes that LP_DEAD tuples must already be older
+	 * than the latestRemovedXid on the cleanup record that set them as
+	 * LP_DEAD, hence must already have generated a conflict.
+	 */
+	return latestRemovedXid;
+}
+
+/*
+ * replay delete operation in hash index to remove
+ * tuples marked as DEAD during index tuple insertion.
+ */
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+	XLogRecPtr lsn = record->EndRecPtr;
+	xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+	Buffer buffer;
+	Buffer metabuf;
+	Page page;
+	XLogRedoAction action;
+
+	/*
+	 * If we have any conflict processing to do, it must happen before we
+	 * update the page.
+	 *
+	 * Hash Index records that are marked as LP_DEAD and being removed during
+	 * hash index tuple insertion can conflict with standby queries.You might
+	 * think that vacuum records would conflict as well, but we've handled
+	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
+	 * just once when that arrives.  After that we know that no conflicts
+	 * exist from individual hash index vacuum records on that index.
+	 */
+	if (InHotStandby)
+	{
+		TransactionId latestRemovedXid =
+					hash_xlog_vacuum_get_latestRemovedXid(record);
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+	}
+
+	action = XLogReadBufferForRedo(record, 0, &buffer);
+
+	if (!IsBufferCleanupOK(buffer))
+		elog(PANIC, "hash_xlog_vacuum_one_page: failed to acquire cleanup lock");
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		char *ptr;
+		Size len;
+
+		ptr = XLogRecGetBlockData(record, 0, &len);
+
+		page = (Page) BufferGetPage(buffer);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page metapage;
+		HashMetaPage metap;
+
+		metapage = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(metapage);
+
+		metap->hashm_ntuples -= xldata->ntuples;
+
+		PageSetLSN(metapage, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
 void
 hash_redo(XLogReaderState *record)
 {
@@ -958,6 +1195,9 @@ hash_redo(XLogReaderState *record)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			hash_xlog_update_meta_page(record);
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			hash_xlog_vacuum_one_page(record);
+			break;
 		default:
 			elog(PANIC, "hash_redo: unknown op code %u", info);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 241728f..80d9c05 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,9 +17,14 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/heapam.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
+#include "storage/lwlock.h"
+#include "storage/buf_internals.h"
 
+static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+								  RelFileNode hnode);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -43,6 +48,8 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	uint32		hashkey;
 	Bucket		bucket;
 	OffsetNumber itup_off;
+	Relation	hrel;
+	RelFileNode	hnode;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -118,10 +125,35 @@ restart_insert:
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
+		BlockNumber nextblkno;
+
+		/*
+		 * Check if current page has any DEAD tuples. If yes,
+		 * delete these tuples and see if we can get a space for
+		 * the new item to be inserted before moving to the next
+		 * page in the bucket chain.
+		 */
+		if (H_HAS_DEAD_TUPLES(pageopaque))
+		{
+
+			if (IsBufferCleanupOK(buf))
+			{
+				/* Get RelfileNode from relation OID */
+				hrel = relation_open(rel->rd_index->indrelid, NoLock);
+				hnode = hrel->rd_node;
+				relation_close(hrel, NoLock);
+
+				_hash_vacuum_one_page(rel, metabuf, buf, hnode);
+
+				if (PageGetFreeSpace(page) >= itemsz)
+					break;				/* OK, now we have enough space */
+			}
+		}
+
 		/*
 		 * no space on this page; check for an overflow page
 		 */
-		BlockNumber nextblkno = pageopaque->hasho_nextblkno;
+		nextblkno = pageopaque->hasho_nextblkno;
 
 		if (BlockNumberIsValid(nextblkno))
 		{
@@ -157,7 +189,8 @@ restart_insert:
 			Assert(PageGetFreeSpace(page) >= itemsz);
 		}
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
-		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE);
+		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE ||
+			   pageopaque->hasho_flag == (LH_OVERFLOW_PAGE | LH_PAGE_HAS_DEAD_TUPLES));
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
@@ -300,3 +333,93 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 				 RelationGetRelationName(rel));
 	}
 }
+
+/*
+ * _hash_vacuum_one_page - vacuum just one index page.
+ * Try to remove LP_DEAD items from the given page.  We
+ * must acquire cleanup lock on the page being modified
+ * before calling this function.
+ */
+
+static void
+_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+					  RelFileNode hnode)
+{
+	OffsetNumber	deletable[MaxOffsetNumber];
+	int ndeletable = 0;
+	OffsetNumber offnum,
+				 maxoff;
+	Page	page = BufferGetPage(buf);
+	HashPageOpaque	pageopaque;
+	HashMetaPage	metap;
+	double tuples_removed = 0;
+
+	/* Scan each tuple in page to see if it is marked as LP_DEAD */
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemId))
+		{
+			deletable[ndeletable++] = offnum;
+			tuples_removed += 1;
+		}
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Write-lock the meta page so that we can decrement
+		 * tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
+		PageIndexMultiDelete(page, deletable, ndeletable);
+
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
+		metap = HashPageGetMeta(BufferGetPage(metabuf));
+		metap->hashm_ntuples -= tuples_removed;
+
+		MarkBufferDirty(buf);
+		MarkBufferDirty(metabuf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			xl_hash_vacuum	xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.hnode = hnode;
+			xlrec.ntuples = tuples_removed;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHashVacuum);
+
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(0, (char *) deletable,
+						ndeletable * sizeof(OffsetNumber));
+
+			XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+		}
+
+		END_CRIT_SECTION();
+		/*
+		 * Releasing write lock on meta page as we have updated
+		 * the tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+	}
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index d733770..2d92049 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -465,6 +465,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
@@ -518,6 +522,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..2e99719 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -19,6 +19,7 @@
 #include "access/relscan.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
+#include "storage/buf_internals.h"
 
 #define CALC_NEW_BUCKET(old_bucket, lowmask) \
 			old_bucket | (lowmask + 1)
@@ -446,3 +447,70 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 
 	return new_bucket;
 }
+
+/*
+ * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0).
+ *
+ * We match items by heap TID before assuming they are the right ones to
+ * delete.
+ */
+void
+_hash_kill_items(IndexScanDesc scan)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Page	page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum, maxoff;
+	int	numKilled = so->numKilled;
+	int		i;
+	bool	killedsomething = false;
+
+	Assert(so->numKilled > 0);
+	Assert(so->killedItems != NULL);
+
+	/*
+	 * Always reset the scan state, so we don't look for same
+	 * items on other pages.
+	 */
+	so->numKilled = 0;
+
+	page = BufferGetPage(so->hashso_curbuf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		offnum = so->killedItems[i].indexOffset;
+
+		while (offnum <= maxoff)
+		{
+			ItemId	iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			{
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;		/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		MarkBufferDirtyHint(so->hashso_curbuf, true);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index f1cc9ff..5bd5c8d 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -154,6 +154,8 @@ hash_identify(uint8 info)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			id = "UPDATE_META_PAGE";
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			id = "VACUUM_ONE_PAGE";
 	}
 
 	return id;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index bfdfed8..02434e2 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -57,6 +57,7 @@ typedef uint32 Bucket;
 #define LH_BUCKET_BEING_POPULATED	(1 << 4)
 #define LH_BUCKET_BEING_SPLIT	(1 << 5)
 #define LH_BUCKET_NEEDS_SPLIT_CLEANUP	(1 << 6)
+#define LH_PAGE_HAS_DEAD_TUPLES	(1 << 7)
 
 #define LH_PAGE_TYPE \
 	(LH_OVERFLOW_PAGE|LH_BUCKET_PAGE|LH_BITMAP_PAGE|LH_META_PAGE)
@@ -86,6 +87,7 @@ typedef HashPageOpaqueData *HashPageOpaque;
 #define H_NEEDS_SPLIT_CLEANUP(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
 #define H_BUCKET_BEING_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
 #define H_BUCKET_BEING_POPULATED(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
+#define H_HAS_DEAD_TUPLES(opaque)		((opaque)->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
 
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
@@ -95,6 +97,13 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
+typedef struct HashScanPosItem    /* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+} HashScanPosItem;
+
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -135,6 +144,9 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
+	/* info about killed items if any (killedItems is NULL if never used) */
+	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			numKilled;			/* number of currently stored items */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -381,6 +393,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
+extern void _hash_kill_items(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 552d642..4e505cf 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -44,6 +44,7 @@
 #define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
 												 * vacuum */
 
+#define XLOG_HASH_VACUUM_ONE_PAGE	0xC0	/* remove dead tuples from index page */
 
 /*
  * xl_hash_split_allocate_page flag values, 8 bits are available.
@@ -250,6 +251,25 @@ typedef struct xl_hash_init_bitmap_page
 #define SizeOfHashInitBitmapPage	\
 	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
+/*
+ * This is what we need for index tuple deletion and to
+ * update the meta page.
+ *
+ * This data record is used for XLOG_HASH_VACUUM_ONE_PAGE
+ *
+ * Backup Blk 0/1: bucket page
+ * Backup Blk 2: meta page
+ */
+typedef struct xl_hash_vacuum
+{
+	RelFileNode	hnode;
+	double		ntuples;
+	bool		is_primary_bucket_page;
+}	xl_hash_vacuum;
+
+#define SizeOfHashVacuum	\
+	(offsetof(xl_hash_vacuum, is_primary_bucket_page) + sizeof(bool))
+
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
-- 
1.8.3.1

#28

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#26)

Re: Microvacuum support for Hash Index

I think one possibility is to get it using
indexrel->rd_index->indrelid in _hash_doinsert().

Thanks. I have tried the same in the v7 patch shared upthread.

But they're not called delete records in a hash index. The function
is called hash_xlog_vacuum_one_page. The corresponding btree function
is btree_xlog_delete. So this comment needs a little more updating.
+            if (IsBufferCleanupOK(bucket_buf))
+            {
+                _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
+                                      (buf == bucket_buf) ? true : false,
+                                      hnode);
+                if (bucket_buf != buf)
+                    LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
+
+                if (PageGetFreeSpace(page) >= itemsz)
+                    break;              /* OK, now we have enough space */
+            }
I might be missing something, but I don't quite see why this needs a
cleanup lock on the primary bucket page. I would think a cleanup lock
on the page we're actually modifying would suffice, and I think if
that's correct it would be a better way to go.
Offhand, I also don't see any problem with it.

I too found no problem with that...

Few other comments:
1.
+ if (ndeletable > 0)
+ {
+ /* No ereport(ERROR) until changes are logged */
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;

You clearing this flag while logging the action, but same is not taken
care during replay. Any reasons?

That's because we conditionally WAL Log this flag status and when we
do so, we take a it's FPI.

2.
+ /* No ereport(ERROR) until changes are logged */
+ START_CRIT_SECTION
();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ pageopaque =
(HashPageOpaque) PageGetSpecialPointer(page);
+ pageopaque->hasho_flag &=
~LH_PAGE_HAS_DEAD_TUPLES;
+
+ /*
+ * Write-lock the meta page so that we can
decrement
+ * tuple count.
+ */
+ LockBuffer(metabuf,
BUFFER_LOCK_EXCLUSIVE);

The lock on buffer should be acquired before critical section.

Point taken. I have taken care of it in the v7 patch.

3.
-
/*
- * Since this can be redone later if needed, mark as a
hint.
- */
- MarkBufferDirtyHint(buf, true);
+
if (so->numKilled < MaxIndexTuplesPerPage)
+ {
+ so-
killedItems[so->numKilled].heapTid = so->hashso_heappos;

+ so-

killedItems[so->numKilled].indexOffset =
+
ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+ so->numKilled++;
+
}
By looking at above code, the first thing that comes to mind is when
numKilled can become greater than MaxIndexTuplesPerPage and why we are
ignoring the marking of dead tuples when it becomes greater than
MaxIndexTuplesPerPage. After looking at similar btree code, I realize
that it could
happen if user reverses the scan direction. I think you should
mention in comments that see btgettuple to know the reason of
numKilled overun test or something like that.

Added comment. Please refer to v7 patch.

4.
+ * We match items by heap TID before assuming they are the right ones to
+ * delete. If an item has
moved off the current page due to a split, we'll
+ * fail to find it and do nothing (this is not an
error case --- we assume
+ * the item will eventually get marked in a future indexscan).
+ */
+void
+_hash_kill_items(IndexScanDesc scan)
I think this comment doesn't make much sense for hash index because
item won't move off from the current page due to split, only later
cleanup can remove it.

Yes. The reason being, no cleanup will happen when scan in progress.
Corrected it .

5.

+
/*
* Maximum size of a hash index item (it's okay to have only one per page)

Spurious white space change.

Fixed.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Ashutosh Sharma (#27)

Re: Microvacuum support for Hash Index

On Wed, Mar 15, 2017 at 11:37 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

+        /* Get RelfileNode from relation OID */
+        rel = relation_open(htup->t_tableOid, NoLock);
+        rnode = rel->rd_node;
+        relation_close(rel, NoLock);
itup->t_tid = htup->t_self;
-        _hash_doinsert(index, itup);
+        _hash_doinsert(index, itup, rnode);
This is an awfully low-level place to be doing something like this.
I'm not sure exactly where this should be happening, but not in the
per-tuple callback.
Okay, Now I have done this inside _hash_doinsert() instead of callback
function. Please have a look into the attached v7 patch.

In the btree case, the heap relation isn't re-opened from anywhere in
the btree code. I think we should try to do the same thing here. If
we add an argument for the heap relation to _hash_doinsert(),
hashinsert() can easily it down; it's already got that value
available. There are two other calls to _hash_doinsert:

1. _h_indexbuild() calls _hash_doinsert(). It's called only from
hashbuild(), which has the heap relation available. So we can just
add that as an extra argument to _h_indexbuild() and then from there
pass it to _hash_doinsert.

2. hashbuildCallback calls _hash_doinsert(). It's sixth argument is a
HashBuildState which is set up by hashbuild(), which has the heap
relation available. So we can just add an extra member to the
HashBuildState and have hashbuild() set it before calling
IndexBuildHeapScan. hashbuildCallback can then fish it out of the
HashBuildState and pass it to _hash_doinsert().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Robert Haas (#29)

1 attachment(s)

Re: Microvacuum support for Hash Index

On Wed, Mar 15, 2017 at 9:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 15, 2017 at 11:37 AM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
+        /* Get RelfileNode from relation OID */
+        rel = relation_open(htup->t_tableOid, NoLock);
+        rnode = rel->rd_node;
+        relation_close(rel, NoLock);
itup->t_tid = htup->t_self;
-        _hash_doinsert(index, itup);
+        _hash_doinsert(index, itup, rnode);
This is an awfully low-level place to be doing something like this.
I'm not sure exactly where this should be happening, but not in the
per-tuple callback.
Okay, Now I have done this inside _hash_doinsert() instead of callback
function. Please have a look into the attached v7 patch.
In the btree case, the heap relation isn't re-opened from anywhere in
the btree code. I think we should try to do the same thing here. If
we add an argument for the heap relation to _hash_doinsert(),
hashinsert() can easily it down; it's already got that value
available. There are two other calls to _hash_doinsert:

1. _h_indexbuild() calls _hash_doinsert(). It's called only from
hashbuild(), which has the heap relation available. So we can just
add that as an extra argument to _h_indexbuild() and then from there
pass it to _hash_doinsert.

2. hashbuildCallback calls _hash_doinsert(). It's sixth argument is a
HashBuildState which is set up by hashbuild(), which has the heap
relation available. So we can just add an extra member to the
HashBuildState and have hashbuild() set it before calling
IndexBuildHeapScan. hashbuildCallback can then fish it out of the
HashBuildState and pass it to _hash_doinsert().

Okay, I have done the changes as suggested by you. Please refer to the
attached v8 patch. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

microvacuum_hash_index_v8.patchapplication/x-download; name=microvacuum_hash_index_v8.patchDownload

From 436e35537324f17d4b825ad6d60c17ab4f3469ba Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Wed, 15 Mar 2017 23:21:12 +0530
Subject: [PATCH] microvacuum_hash_index_v7.patch

---
 src/backend/access/hash/README         |   5 +-
 src/backend/access/hash/hash.c         |  53 ++++++--
 src/backend/access/hash/hash_xlog.c    | 240 +++++++++++++++++++++++++++++++++
 src/backend/access/hash/hashinsert.c   | 122 ++++++++++++++++-
 src/backend/access/hash/hashsearch.c   |   8 ++
 src/backend/access/hash/hashsort.c     |   4 +-
 src/backend/access/hash/hashutil.c     |  68 ++++++++++
 src/backend/access/rmgrdesc/hashdesc.c |   2 +
 src/include/access/hash.h              |  17 ++-
 src/include/access/hash_xlog.h         |  20 +++
 10 files changed, 522 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 53b0e0d..1541438 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -284,7 +284,10 @@ The insertion algorithm is rather similar:
 		if we get the lock on both the buckets
 			finish the split using algorithm mentioned below for split
 		release the pin on old bucket and restart the insert from beginning.
-	if current page is full, release lock but not pin, read/exclusive-lock
+	if current page is full, first check if this page contains any dead tuples.
+	if yes, remove dead tuples from the current page and again check for the
+	availability of the space. If enough space found, insert the tuple else
+	release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
 	take buffer content lock in exclusive mode on metapage
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6416769..4dedab4 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -36,6 +36,7 @@ typedef struct
 {
 	HSpool	   *spool;			/* NULL if not using spooling */
 	double		indtuples;		/* # tuples accepted into index */
+	Relation	heapRel;		/* heap relation descriptor */
 } HashBuildState;
 
 static void hashbuildCallback(Relation index,
@@ -154,6 +155,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	/* prepare to build the index */
 	buildstate.indtuples = 0;
+	buildstate.heapRel = heap;
 
 	/* do the heap scan */
 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
@@ -162,7 +164,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	if (buildstate.spool)
 	{
 		/* sort the tuples and insert them into the index */
-		_h_indexbuild(buildstate.spool);
+		_h_indexbuild(buildstate.spool, buildstate.heapRel);
 		_h_spooldestroy(buildstate.spool);
 	}
 
@@ -218,7 +220,7 @@ hashbuildCallback(Relation index,
 		itup = index_form_tuple(RelationGetDescr(index),
 								index_values, index_isnull);
 		itup->t_tid = htup->t_self;
-		_hash_doinsert(index, itup);
+		_hash_doinsert(index, itup, buildstate->heapRel);
 		pfree(itup);
 	}
 
@@ -251,7 +253,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 	itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
 	itup->t_tid = *ht_ctid;
 
-	_hash_doinsert(rel, itup);
+	_hash_doinsert(rel, itup, heapRel);
 
 	pfree(itup);
 
@@ -331,14 +333,24 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.) In case if caller reverses the indexscan
+			 * direction it is quite possible that the same item might
+			 * get entered multiple times. But, we don't detect that
+			 * instead we just forget any excess entries.
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+				so->killedItems = palloc(MaxIndexTuplesPerPage *
+										 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+				so->killedItems[so->numKilled].indexOffset =
+							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->numKilled++;
+			}
 		}
 
 		/*
@@ -446,6 +458,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -461,6 +476,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
@@ -488,8 +507,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
+	if (so->killedItems != NULL)
+		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
@@ -848,6 +873,16 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
+
+			/*
+			 * Let us mark the page as clean if vacuum removes the DEAD tuples
+			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
+			 * flag.
+			 */
+			if (tuples_removed && *tuples_removed > 0 &&
+				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 0c830ab..adaab28 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -14,10 +14,15 @@
  */
 #include "postgres.h"
 
+#include "access/heapam_xlog.h"
 #include "access/bufmask.h"
 #include "access/hash.h"
 #include "access/hash_xlog.h"
 #include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "access/transam.h"
+#include "storage/procarray.h"
+#include "miscadmin.h"
 
 /*
  * replay a hash index meta page
@@ -915,6 +920,238 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 		UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted. This puts the work for calculating latestRemovedXid
+ * into the recovery path rather than the primary path.
+ *
+ * It's possible that this generates a fair amount of I/O, since an index
+ * block may have hundreds of tuples being deleted. Repeat accesses to the
+ * same heap blocks are common, though are not yet optimised.
+ */
+static TransactionId
+hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
+{
+	xl_hash_vacuum	*xlrec = (xl_hash_vacuum *) XLogRecGetData(record);
+	OffsetNumber	*unused;
+	Buffer		ibuffer,
+				hbuffer;
+	Page		ipage,
+				hpage;
+	RelFileNode	rnode;
+	BlockNumber	blkno;
+	ItemId		iitemid,
+				hitemid;
+	IndexTuple	itup;
+	HeapTupleHeader	htuphdr;
+	BlockNumber	hblkno;
+	OffsetNumber	hoffnum;
+	TransactionId	latestRemovedXid = InvalidTransactionId;
+	int		i;
+	char *ptr;
+	Size len;
+
+	/*
+	 * If there's nothing running on the standby we don't need to derive a
+	 * full latestRemovedXid value, so use a fast path out of here.  This
+	 * returns InvalidTransactionId, and so will conflict with all HS
+	 * transactions; but since we just worked out that that's zero people,
+	 * it's OK.
+	 *
+	 * XXX There is a race condition here, which is that a new backend might
+	 * start just after we look.  If so, it cannot need to conflict, but this
+	 * coding will result in throwing a conflict anyway.
+	 */
+	if (CountDBBackends(InvalidOid) == 0)
+		return latestRemovedXid;
+
+	/*
+	 * Get index page.  If the DB is consistent, this should not fail, nor
+	 * should any of the heap page fetches below.  If one does, we return
+	 * InvalidTransactionId to cancel all HS transactions.  That's probably
+	 * overkill, but it's safe, and certainly better than panicking here.
+	 */
+	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
+
+	if (!BufferIsValid(ibuffer))
+		return InvalidTransactionId;
+	LockBuffer(ibuffer, HASH_READ);
+	ipage = (Page) BufferGetPage(ibuffer);
+
+	/*
+	 * Loop through the deleted index items to obtain the TransactionId from
+	 * the heap items they point to.
+	 */
+	ptr = XLogRecGetBlockData(record, 1, &len);
+
+	unused = (OffsetNumber *) ptr;
+
+	for (i = 0; i < xlrec->ntuples; i++)
+	{
+		/*
+		 * Identify the index tuple about to be deleted.
+		 */
+		iitemid = PageGetItemId(ipage, unused[i]);
+		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+
+		/*
+		 * Locate the heap page that the index tuple points at
+		 */
+		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM,
+										 hblkno, RBM_NORMAL);
+
+		if (!BufferIsValid(hbuffer))
+		{
+			UnlockReleaseBuffer(ibuffer);
+			return InvalidTransactionId;
+		}
+		LockBuffer(hbuffer, HASH_READ);
+		hpage = (Page) BufferGetPage(hbuffer);
+
+		/*
+		 * Look up the heap tuple header that the index tuple points at by
+		 * using the heap node supplied with the xlrec. We can't use
+		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
+		 * Note that we are not looking at tuple data here, just headers.
+		 */
+		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
+		hitemid = PageGetItemId(hpage, hoffnum);
+
+		/*
+		 * Follow any redirections until we find something useful.
+		 */
+		while (ItemIdIsRedirected(hitemid))
+		{
+			hoffnum = ItemIdGetRedirect(hitemid);
+			hitemid = PageGetItemId(hpage, hoffnum);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/*
+		 * If the heap item has storage, then read the header and use that to
+		 * set latestRemovedXid.
+		 *
+		 * Some LP_DEAD items may not be accessible, so we ignore them.
+		 */
+		if (ItemIdHasStorage(hitemid))
+		{
+			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+		}
+		else if (ItemIdIsDead(hitemid))
+		{
+			/*
+			 * Conjecture: if hitemid is dead then it had xids before the xids
+			 * marked on LP_NORMAL items. So we just ignore this item and move
+			 * onto the next, for the purposes of calculating
+			 * latestRemovedxids.
+			 */
+		}
+		else
+			Assert(!ItemIdIsUsed(hitemid));
+
+		UnlockReleaseBuffer(hbuffer);
+	}
+
+	UnlockReleaseBuffer(ibuffer);
+
+	/*
+	 * If all heap tuples were LP_DEAD then we will be returning
+	 * InvalidTransactionId here, which avoids conflicts. This matches
+	 * existing logic which assumes that LP_DEAD tuples must already be older
+	 * than the latestRemovedXid on the cleanup record that set them as
+	 * LP_DEAD, hence must already have generated a conflict.
+	 */
+	return latestRemovedXid;
+}
+
+/*
+ * replay delete operation in hash index to remove
+ * tuples marked as DEAD during index tuple insertion.
+ */
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+	XLogRecPtr lsn = record->EndRecPtr;
+	xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+	Buffer buffer;
+	Buffer metabuf;
+	Page page;
+	XLogRedoAction action;
+
+	/*
+	 * If we have any conflict processing to do, it must happen before we
+	 * update the page.
+	 *
+	 * Hash Index records that are marked as LP_DEAD and being removed during
+	 * hash index tuple insertion can conflict with standby queries.You might
+	 * think that vacuum records would conflict as well, but we've handled
+	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
+	 * just once when that arrives.  After that we know that no conflicts
+	 * exist from individual hash index vacuum records on that index.
+	 */
+	if (InHotStandby)
+	{
+		TransactionId latestRemovedXid =
+					hash_xlog_vacuum_get_latestRemovedXid(record);
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+	}
+
+	action = XLogReadBufferForRedo(record, 0, &buffer);
+
+	if (!IsBufferCleanupOK(buffer))
+		elog(PANIC, "hash_xlog_vacuum_one_page: failed to acquire cleanup lock");
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		char *ptr;
+		Size len;
+
+		ptr = XLogRecGetBlockData(record, 0, &len);
+
+		page = (Page) BufferGetPage(buffer);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page metapage;
+		HashMetaPage metap;
+
+		metapage = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(metapage);
+
+		metap->hashm_ntuples -= xldata->ntuples;
+
+		PageSetLSN(metapage, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
 void
 hash_redo(XLogReaderState *record)
 {
@@ -958,6 +1195,9 @@ hash_redo(XLogReaderState *record)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			hash_xlog_update_meta_page(record);
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			hash_xlog_vacuum_one_page(record);
+			break;
 		default:
 			elog(PANIC, "hash_redo: unknown op code %u", info);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 241728f..b2faf03 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,9 +17,14 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/heapam.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
+#include "storage/lwlock.h"
+#include "storage/buf_internals.h"
 
+static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+								  RelFileNode hnode);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -28,7 +33,7 @@
  *		and hashinsert.  By here, itup is completely filled in.
  */
 void
-_hash_doinsert(Relation rel, IndexTuple itup)
+_hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel)
 {
 	Buffer		buf = InvalidBuffer;
 	Buffer		bucket_buf;
@@ -118,10 +123,30 @@ restart_insert:
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
+		BlockNumber nextblkno;
+
+		/*
+		 * Check if current page has any DEAD tuples. If yes,
+		 * delete these tuples and see if we can get a space for
+		 * the new item to be inserted before moving to the next
+		 * page in the bucket chain.
+		 */
+		if (H_HAS_DEAD_TUPLES(pageopaque))
+		{
+
+			if (IsBufferCleanupOK(buf))
+			{
+				_hash_vacuum_one_page(rel, metabuf, buf, heapRel->rd_node);
+
+				if (PageGetFreeSpace(page) >= itemsz)
+					break;				/* OK, now we have enough space */
+			}
+		}
+
 		/*
 		 * no space on this page; check for an overflow page
 		 */
-		BlockNumber nextblkno = pageopaque->hasho_nextblkno;
+		nextblkno = pageopaque->hasho_nextblkno;
 
 		if (BlockNumberIsValid(nextblkno))
 		{
@@ -157,7 +182,8 @@ restart_insert:
 			Assert(PageGetFreeSpace(page) >= itemsz);
 		}
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
-		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE);
+		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE ||
+			   pageopaque->hasho_flag == (LH_OVERFLOW_PAGE | LH_PAGE_HAS_DEAD_TUPLES));
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
@@ -300,3 +326,93 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 				 RelationGetRelationName(rel));
 	}
 }
+
+/*
+ * _hash_vacuum_one_page - vacuum just one index page.
+ * Try to remove LP_DEAD items from the given page.  We
+ * must acquire cleanup lock on the page being modified
+ * before calling this function.
+ */
+
+static void
+_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+					  RelFileNode hnode)
+{
+	OffsetNumber	deletable[MaxOffsetNumber];
+	int ndeletable = 0;
+	OffsetNumber offnum,
+				 maxoff;
+	Page	page = BufferGetPage(buf);
+	HashPageOpaque	pageopaque;
+	HashMetaPage	metap;
+	double tuples_removed = 0;
+
+	/* Scan each tuple in page to see if it is marked as LP_DEAD */
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemId))
+		{
+			deletable[ndeletable++] = offnum;
+			tuples_removed += 1;
+		}
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Write-lock the meta page so that we can decrement
+		 * tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
+		PageIndexMultiDelete(page, deletable, ndeletable);
+
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
+		metap = HashPageGetMeta(BufferGetPage(metabuf));
+		metap->hashm_ntuples -= tuples_removed;
+
+		MarkBufferDirty(buf);
+		MarkBufferDirty(metabuf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			xl_hash_vacuum	xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.hnode = hnode;
+			xlrec.ntuples = tuples_removed;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHashVacuum);
+
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(0, (char *) deletable,
+						ndeletable * sizeof(OffsetNumber));
+
+			XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+		}
+
+		END_CRIT_SECTION();
+		/*
+		 * Releasing write lock on meta page as we have updated
+		 * the tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+	}
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index d733770..2d92049 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -465,6 +465,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
@@ -518,6 +522,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index ea8f109..0e0f393 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -101,7 +101,7 @@ _h_spool(HSpool *hspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire index.
  */
 void
-_h_indexbuild(HSpool *hspool)
+_h_indexbuild(HSpool *hspool, Relation heapRel)
 {
 	IndexTuple	itup;
 #ifdef USE_ASSERT_CHECKING
@@ -126,6 +126,6 @@ _h_indexbuild(HSpool *hspool)
 		Assert(hashkey >= lasthashkey);
 #endif
 
-		_hash_doinsert(hspool->index, itup);
+		_hash_doinsert(hspool->index, itup, heapRel);
 	}
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..2e99719 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -19,6 +19,7 @@
 #include "access/relscan.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
+#include "storage/buf_internals.h"
 
 #define CALC_NEW_BUCKET(old_bucket, lowmask) \
 			old_bucket | (lowmask + 1)
@@ -446,3 +447,70 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 
 	return new_bucket;
 }
+
+/*
+ * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0).
+ *
+ * We match items by heap TID before assuming they are the right ones to
+ * delete.
+ */
+void
+_hash_kill_items(IndexScanDesc scan)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Page	page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum, maxoff;
+	int	numKilled = so->numKilled;
+	int		i;
+	bool	killedsomething = false;
+
+	Assert(so->numKilled > 0);
+	Assert(so->killedItems != NULL);
+
+	/*
+	 * Always reset the scan state, so we don't look for same
+	 * items on other pages.
+	 */
+	so->numKilled = 0;
+
+	page = BufferGetPage(so->hashso_curbuf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		offnum = so->killedItems[i].indexOffset;
+
+		while (offnum <= maxoff)
+		{
+			ItemId	iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			{
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;		/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		MarkBufferDirtyHint(so->hashso_curbuf, true);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index f1cc9ff..5bd5c8d 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -154,6 +154,8 @@ hash_identify(uint8 info)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			id = "UPDATE_META_PAGE";
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			id = "VACUUM_ONE_PAGE";
 	}
 
 	return id;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index bfdfed8..eb1df57 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -57,6 +57,7 @@ typedef uint32 Bucket;
 #define LH_BUCKET_BEING_POPULATED	(1 << 4)
 #define LH_BUCKET_BEING_SPLIT	(1 << 5)
 #define LH_BUCKET_NEEDS_SPLIT_CLEANUP	(1 << 6)
+#define LH_PAGE_HAS_DEAD_TUPLES	(1 << 7)
 
 #define LH_PAGE_TYPE \
 	(LH_OVERFLOW_PAGE|LH_BUCKET_PAGE|LH_BITMAP_PAGE|LH_META_PAGE)
@@ -86,6 +87,7 @@ typedef HashPageOpaqueData *HashPageOpaque;
 #define H_NEEDS_SPLIT_CLEANUP(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
 #define H_BUCKET_BEING_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
 #define H_BUCKET_BEING_POPULATED(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
+#define H_HAS_DEAD_TUPLES(opaque)		((opaque)->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
 
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
@@ -95,6 +97,13 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
+typedef struct HashScanPosItem    /* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+} HashScanPosItem;
+
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -135,6 +144,9 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
+	/* info about killed items if any (killedItems is NULL if never used) */
+	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			numKilled;			/* number of currently stored items */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -300,7 +312,7 @@ extern Datum hash_uint32(uint32 k);
 /* private routines */
 
 /* hashinsert.c */
-extern void _hash_doinsert(Relation rel, IndexTuple itup);
+extern void _hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
@@ -361,7 +373,7 @@ extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
 extern void _h_spooldestroy(HSpool *hspool);
 extern void _h_spool(HSpool *hspool, ItemPointer self,
 		 Datum *values, bool *isnull);
-extern void _h_indexbuild(HSpool *hspool);
+extern void _h_indexbuild(HSpool *hspool, Relation heapRel);
 
 /* hashutil.c */
 extern bool _hash_checkqual(IndexScanDesc scan, IndexTuple itup);
@@ -381,6 +393,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
+extern void _hash_kill_items(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 552d642..4e505cf 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -44,6 +44,7 @@
 #define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
 												 * vacuum */
 
+#define XLOG_HASH_VACUUM_ONE_PAGE	0xC0	/* remove dead tuples from index page */
 
 /*
  * xl_hash_split_allocate_page flag values, 8 bits are available.
@@ -250,6 +251,25 @@ typedef struct xl_hash_init_bitmap_page
 #define SizeOfHashInitBitmapPage	\
 	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
+/*
+ * This is what we need for index tuple deletion and to
+ * update the meta page.
+ *
+ * This data record is used for XLOG_HASH_VACUUM_ONE_PAGE
+ *
+ * Backup Blk 0/1: bucket page
+ * Backup Blk 2: meta page
+ */
+typedef struct xl_hash_vacuum
+{
+	RelFileNode	hnode;
+	double		ntuples;
+	bool		is_primary_bucket_page;
+}	xl_hash_vacuum;
+
+#define SizeOfHashVacuum	\
+	(offsetof(xl_hash_vacuum, is_primary_bucket_page) + sizeof(bool))
+
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
-- 
1.8.3.1

#31

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Ashutosh Sharma (#30)

Re: Microvacuum support for Hash Index

On Wed, Mar 15, 2017 at 2:10 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Okay, I have done the changes as suggested by you. Please refer to the
attached v8 patch. Thanks.

Cool, but this doesn't look right:

+    action = XLogReadBufferForRedo(record, 0, &buffer);
+
+    if (!IsBufferCleanupOK(buffer))
+        elog(PANIC, "hash_xlog_vacuum_one_page: failed to acquire
cleanup lock");

That could fail, I think, because of a pin from a Hot Standby backend.
You want to call XLogReadBufferForRedoExtended() with a third argument
of true. Come to think of it, shouldn't hash_xlog_split_allocate_page
be changed the same way?

+            /*
+             * Let us mark the page as clean if vacuum removes the DEAD tuples
+             * from an index page. We do this by clearing
LH_PAGE_HAS_DEAD_TUPLES
+             * flag.
+             */

Maybe add: Clearing this flag is just a hint; replay won't redo this.

+     * Hash Index records that are marked as LP_DEAD and being removed during
+     * hash index tuple insertion can conflict with standby queries.You might

The word Index shouldn't be capitalized here. There should be a space
before "You".

The formatting of this comment is oddly narrow:

+ * _hash_vacuum_one_page - vacuum just one index page.
+ * Try to remove LP_DEAD items from the given page.  We
+ * must acquire cleanup lock on the page being modified
+ * before calling this function.

I'd add a blank line after the first line and reflow the rest to be
something more like 75 characters. pgindent evidently doesn't think
this needs reformatting, but it's oddly narrow.

I suggest changing the header comment of
hash_xlog_vacuum_get_latestRemovedXid like this:

+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted.  See also btree_xlog_delete_get_latestRemovedXid,
+ * on which this function is based.

This is looking good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Robert Haas (#31)

1 attachment(s)

Re: Microvacuum support for Hash Index

+    action = XLogReadBufferForRedo(record, 0, &buffer);
+
+    if (!IsBufferCleanupOK(buffer))
+        elog(PANIC, "hash_xlog_vacuum_one_page: failed to acquire
cleanup lock");
That could fail, I think, because of a pin from a Hot Standby backend.
You want to call XLogReadBufferForRedoExtended() with a third argument
of true.

Yes, there is a possibility that a new backend may start in standby
after we kill the conflicting backends. If the new backend has pin on
the buffer which the startup process is trying to read then
'IsBufferCleanupOK' will fail thereby causing a startup process to
PANIC.

Come to think of it, shouldn't hash_xlog_split_allocate_page

be changed the same way?

No, the reason being we are trying to allocate a new bucket page on
standby so there can't be any backend which could have pin on a page
that is yet to initialised.

+            /*
+             * Let us mark the page as clean if vacuum removes the DEAD tuples
+             * from an index page. We do this by clearing
LH_PAGE_HAS_DEAD_TUPLES
+             * flag.
+             */

Maybe add: Clearing this flag is just a hint; replay won't redo this.

Added. Please check the attached v9 patch.

+     * Hash Index records that are marked as LP_DEAD and being removed during
+     * hash index tuple insertion can conflict with standby queries.You might
The word Index shouldn't be capitalized here. There should be a space
before "You".

Corrected.

The formatting of this comment is oddly narrow:
+ * _hash_vacuum_one_page - vacuum just one index page.
+ * Try to remove LP_DEAD items from the given page.  We
+ * must acquire cleanup lock on the page being modified
+ * before calling this function.
I'd add a blank line after the first line and reflow the rest to be
something more like 75 characters. pgindent evidently doesn't think
this needs reformatting, but it's oddly narrow.

Corrected.

I suggest changing the header comment of
hash_xlog_vacuum_get_latestRemovedXid like this:
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted.  See also btree_xlog_delete_get_latestRemovedXid,
+ * on which this function is based.
This is looking good.

Changed as per suggestions. Attached v9 patch. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

microvacuum_hash_index_v9.patchapplication/x-download; name=microvacuum_hash_index_v9.patchDownload

From c19b4f97afc4acc22dfb07aea7011097b85ef4f3 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Thu, 16 Mar 2017 00:54:38 +0530
Subject: [PATCH] microvacuum_hash_index_v9.patch

---
 src/backend/access/hash/README         |   5 +-
 src/backend/access/hash/hash.c         |  53 ++++++--
 src/backend/access/hash/hash_xlog.c    | 233 +++++++++++++++++++++++++++++++++
 src/backend/access/hash/hashinsert.c   | 122 ++++++++++++++++-
 src/backend/access/hash/hashsearch.c   |   8 ++
 src/backend/access/hash/hashsort.c     |   4 +-
 src/backend/access/hash/hashutil.c     |  68 ++++++++++
 src/backend/access/rmgrdesc/hashdesc.c |   2 +
 src/include/access/hash.h              |  17 ++-
 src/include/access/hash_xlog.h         |  20 +++
 10 files changed, 515 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 53b0e0d..1541438 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -284,7 +284,10 @@ The insertion algorithm is rather similar:
 		if we get the lock on both the buckets
 			finish the split using algorithm mentioned below for split
 		release the pin on old bucket and restart the insert from beginning.
-	if current page is full, release lock but not pin, read/exclusive-lock
+	if current page is full, first check if this page contains any dead tuples.
+	if yes, remove dead tuples from the current page and again check for the
+	availability of the space. If enough space found, insert the tuple else
+	release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
 	take buffer content lock in exclusive mode on metapage
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6416769..a293683 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -36,6 +36,7 @@ typedef struct
 {
 	HSpool	   *spool;			/* NULL if not using spooling */
 	double		indtuples;		/* # tuples accepted into index */
+	Relation	heapRel;		/* heap relation descriptor */
 } HashBuildState;
 
 static void hashbuildCallback(Relation index,
@@ -154,6 +155,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	/* prepare to build the index */
 	buildstate.indtuples = 0;
+	buildstate.heapRel = heap;
 
 	/* do the heap scan */
 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
@@ -162,7 +164,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	if (buildstate.spool)
 	{
 		/* sort the tuples and insert them into the index */
-		_h_indexbuild(buildstate.spool);
+		_h_indexbuild(buildstate.spool, buildstate.heapRel);
 		_h_spooldestroy(buildstate.spool);
 	}
 
@@ -218,7 +220,7 @@ hashbuildCallback(Relation index,
 		itup = index_form_tuple(RelationGetDescr(index),
 								index_values, index_isnull);
 		itup->t_tid = htup->t_self;
-		_hash_doinsert(index, itup);
+		_hash_doinsert(index, itup, buildstate->heapRel);
 		pfree(itup);
 	}
 
@@ -251,7 +253,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 	itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
 	itup->t_tid = *ht_ctid;
 
-	_hash_doinsert(rel, itup);
+	_hash_doinsert(rel, itup, heapRel);
 
 	pfree(itup);
 
@@ -331,14 +333,24 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.) In case if caller reverses the indexscan
+			 * direction it is quite possible that the same item might
+			 * get entered multiple times. But, we don't detect that
+			 * instead we just forget any excess entries.
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+				so->killedItems = palloc(MaxIndexTuplesPerPage *
+										 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+				so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+				so->killedItems[so->numKilled].indexOffset =
+							ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+				so->numKilled++;
+			}
 		}
 
 		/*
@@ -446,6 +458,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 	so->hashso_buc_populated = false;
 	so->hashso_buc_split = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -461,6 +476,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
@@ -488,8 +507,14 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_hash_kill_items(scan);
+
 	_hash_dropscanbuf(rel, so);
 
+	if (so->killedItems != NULL)
+		pfree(so->killedItems);
 	pfree(so);
 	scan->opaque = NULL;
 }
@@ -848,6 +873,16 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
+
+			/*
+			 * Let us mark the page as clean if vacuum removes the DEAD tuples
+			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
+			 * flag. Clearing this flag is just a hint; replay won't redo this.
+			 */
+			if (tuples_removed && *tuples_removed > 0 &&
+				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 0c830ab..9ca53f7 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -14,10 +14,15 @@
  */
 #include "postgres.h"
 
+#include "access/heapam_xlog.h"
 #include "access/bufmask.h"
 #include "access/hash.h"
 #include "access/hash_xlog.h"
 #include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "access/transam.h"
+#include "storage/procarray.h"
+#include "miscadmin.h"
 
 /*
  * replay a hash index meta page
@@ -915,6 +920,231 @@ hash_xlog_update_meta_page(XLogReaderState *record)
 		UnlockReleaseBuffer(metabuf);
 }
 
+/*
+ * Get the latestRemovedXid from the heap pages pointed at by the index
+ * tuples being deleted. See also btree_xlog_delete_get_latestRemovedXid,
+ * on which this function is based.
+ */
+static TransactionId
+hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
+{
+	xl_hash_vacuum	*xlrec = (xl_hash_vacuum *) XLogRecGetData(record);
+	OffsetNumber	*unused;
+	Buffer		ibuffer,
+				hbuffer;
+	Page		ipage,
+				hpage;
+	RelFileNode	rnode;
+	BlockNumber	blkno;
+	ItemId		iitemid,
+				hitemid;
+	IndexTuple	itup;
+	HeapTupleHeader	htuphdr;
+	BlockNumber	hblkno;
+	OffsetNumber	hoffnum;
+	TransactionId	latestRemovedXid = InvalidTransactionId;
+	int		i;
+	char *ptr;
+	Size len;
+
+	/*
+	 * If there's nothing running on the standby we don't need to derive a
+	 * full latestRemovedXid value, so use a fast path out of here.  This
+	 * returns InvalidTransactionId, and so will conflict with all HS
+	 * transactions; but since we just worked out that that's zero people,
+	 * it's OK.
+	 *
+	 * XXX There is a race condition here, which is that a new backend might
+	 * start just after we look.  If so, it cannot need to conflict, but this
+	 * coding will result in throwing a conflict anyway.
+	 */
+	if (CountDBBackends(InvalidOid) == 0)
+		return latestRemovedXid;
+
+	/*
+	 * Get index page.  If the DB is consistent, this should not fail, nor
+	 * should any of the heap page fetches below.  If one does, we return
+	 * InvalidTransactionId to cancel all HS transactions.  That's probably
+	 * overkill, but it's safe, and certainly better than panicking here.
+	 */
+	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
+
+	if (!BufferIsValid(ibuffer))
+		return InvalidTransactionId;
+	LockBuffer(ibuffer, HASH_READ);
+	ipage = (Page) BufferGetPage(ibuffer);
+
+	/*
+	 * Loop through the deleted index items to obtain the TransactionId from
+	 * the heap items they point to.
+	 */
+	ptr = XLogRecGetBlockData(record, 1, &len);
+
+	unused = (OffsetNumber *) ptr;
+
+	for (i = 0; i < xlrec->ntuples; i++)
+	{
+		/*
+		 * Identify the index tuple about to be deleted.
+		 */
+		iitemid = PageGetItemId(ipage, unused[i]);
+		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+
+		/*
+		 * Locate the heap page that the index tuple points at
+		 */
+		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+		hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM,
+										 hblkno, RBM_NORMAL);
+
+		if (!BufferIsValid(hbuffer))
+		{
+			UnlockReleaseBuffer(ibuffer);
+			return InvalidTransactionId;
+		}
+		LockBuffer(hbuffer, HASH_READ);
+		hpage = (Page) BufferGetPage(hbuffer);
+
+		/*
+		 * Look up the heap tuple header that the index tuple points at by
+		 * using the heap node supplied with the xlrec. We can't use
+		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
+		 * Note that we are not looking at tuple data here, just headers.
+		 */
+		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
+		hitemid = PageGetItemId(hpage, hoffnum);
+
+		/*
+		 * Follow any redirections until we find something useful.
+		 */
+		while (ItemIdIsRedirected(hitemid))
+		{
+			hoffnum = ItemIdGetRedirect(hitemid);
+			hitemid = PageGetItemId(hpage, hoffnum);
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		/*
+		 * If the heap item has storage, then read the header and use that to
+		 * set latestRemovedXid.
+		 *
+		 * Some LP_DEAD items may not be accessible, so we ignore them.
+		 */
+		if (ItemIdHasStorage(hitemid))
+		{
+			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+			HeapTupleHeaderAdvanceLatestRemovedXid(htuphdr, &latestRemovedXid);
+		}
+		else if (ItemIdIsDead(hitemid))
+		{
+			/*
+			 * Conjecture: if hitemid is dead then it had xids before the xids
+			 * marked on LP_NORMAL items. So we just ignore this item and move
+			 * onto the next, for the purposes of calculating
+			 * latestRemovedxids.
+			 */
+		}
+		else
+			Assert(!ItemIdIsUsed(hitemid));
+
+		UnlockReleaseBuffer(hbuffer);
+	}
+
+	UnlockReleaseBuffer(ibuffer);
+
+	/*
+	 * If all heap tuples were LP_DEAD then we will be returning
+	 * InvalidTransactionId here, which avoids conflicts. This matches
+	 * existing logic which assumes that LP_DEAD tuples must already be older
+	 * than the latestRemovedXid on the cleanup record that set them as
+	 * LP_DEAD, hence must already have generated a conflict.
+	 */
+	return latestRemovedXid;
+}
+
+/*
+ * replay delete operation in hash index to remove
+ * tuples marked as DEAD during index tuple insertion.
+ */
+static void
+hash_xlog_vacuum_one_page(XLogReaderState *record)
+{
+	XLogRecPtr lsn = record->EndRecPtr;
+	xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
+	Buffer buffer;
+	Buffer metabuf;
+	Page page;
+	XLogRedoAction action;
+
+	/*
+	 * If we have any conflict processing to do, it must happen before we
+	 * update the page.
+	 *
+	 * Hash index records that are marked as LP_DEAD and being removed during
+	 * hash index tuple insertion can conflict with standby queries. You might
+	 * think that vacuum records would conflict as well, but we've handled
+	 * that already.  XLOG_HEAP2_CLEANUP_INFO records provide the highest xid
+	 * cleaned by the vacuum of the heap and so we can resolve any conflicts
+	 * just once when that arrives.  After that we know that no conflicts
+	 * exist from individual hash index vacuum records on that index.
+	 */
+	if (InHotStandby)
+	{
+		TransactionId latestRemovedXid =
+					hash_xlog_vacuum_get_latestRemovedXid(record);
+		RelFileNode rnode;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, NULL);
+		ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rnode);
+	}
+
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &buffer);
+
+	if (action == BLK_NEEDS_REDO)
+	{
+		char *ptr;
+		Size len;
+
+		ptr = XLogRecGetBlockData(record, 0, &len);
+
+		page = (Page) BufferGetPage(buffer);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page metapage;
+		HashMetaPage metap;
+
+		metapage = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(metapage);
+
+		metap->hashm_ntuples -= xldata->ntuples;
+
+		PageSetLSN(metapage, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
 void
 hash_redo(XLogReaderState *record)
 {
@@ -958,6 +1188,9 @@ hash_redo(XLogReaderState *record)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			hash_xlog_update_meta_page(record);
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			hash_xlog_vacuum_one_page(record);
+			break;
 		default:
 			elog(PANIC, "hash_redo: unknown op code %u", info);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 241728f..b8969d9 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -17,9 +17,14 @@
 
 #include "access/hash.h"
 #include "access/hash_xlog.h"
+#include "access/heapam.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
+#include "storage/lwlock.h"
+#include "storage/buf_internals.h"
 
+static void _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+								  RelFileNode hnode);
 
 /*
  *	_hash_doinsert() -- Handle insertion of a single index tuple.
@@ -28,7 +33,7 @@
  *		and hashinsert.  By here, itup is completely filled in.
  */
 void
-_hash_doinsert(Relation rel, IndexTuple itup)
+_hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel)
 {
 	Buffer		buf = InvalidBuffer;
 	Buffer		bucket_buf;
@@ -118,10 +123,30 @@ restart_insert:
 	/* Do the insertion */
 	while (PageGetFreeSpace(page) < itemsz)
 	{
+		BlockNumber nextblkno;
+
+		/*
+		 * Check if current page has any DEAD tuples. If yes,
+		 * delete these tuples and see if we can get a space for
+		 * the new item to be inserted before moving to the next
+		 * page in the bucket chain.
+		 */
+		if (H_HAS_DEAD_TUPLES(pageopaque))
+		{
+
+			if (IsBufferCleanupOK(buf))
+			{
+				_hash_vacuum_one_page(rel, metabuf, buf, heapRel->rd_node);
+
+				if (PageGetFreeSpace(page) >= itemsz)
+					break;				/* OK, now we have enough space */
+			}
+		}
+
 		/*
 		 * no space on this page; check for an overflow page
 		 */
-		BlockNumber nextblkno = pageopaque->hasho_nextblkno;
+		nextblkno = pageopaque->hasho_nextblkno;
 
 		if (BlockNumberIsValid(nextblkno))
 		{
@@ -157,7 +182,8 @@ restart_insert:
 			Assert(PageGetFreeSpace(page) >= itemsz);
 		}
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
-		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE);
+		Assert(pageopaque->hasho_flag == LH_OVERFLOW_PAGE ||
+			   pageopaque->hasho_flag == (LH_OVERFLOW_PAGE | LH_PAGE_HAS_DEAD_TUPLES));
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
@@ -300,3 +326,93 @@ _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
 				 RelationGetRelationName(rel));
 	}
 }
+
+/*
+ * _hash_vacuum_one_page - vacuum just one index page.
+ *
+ * Try to remove LP_DEAD items from the given page. We must acquire cleanup
+ * lock on the page being modified before calling this function.
+ */
+
+static void
+_hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
+					  RelFileNode hnode)
+{
+	OffsetNumber	deletable[MaxOffsetNumber];
+	int ndeletable = 0;
+	OffsetNumber offnum,
+				 maxoff;
+	Page	page = BufferGetPage(buf);
+	HashPageOpaque	pageopaque;
+	HashMetaPage	metap;
+	double tuples_removed = 0;
+
+	/* Scan each tuple in page to see if it is marked as LP_DEAD */
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (offnum = FirstOffsetNumber;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	itemId = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemId))
+		{
+			deletable[ndeletable++] = offnum;
+			tuples_removed += 1;
+		}
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Write-lock the meta page so that we can decrement
+		 * tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
+		PageIndexMultiDelete(page, deletable, ndeletable);
+
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
+		metap = HashPageGetMeta(BufferGetPage(metabuf));
+		metap->hashm_ntuples -= tuples_removed;
+
+		MarkBufferDirty(buf);
+		MarkBufferDirty(metabuf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			xl_hash_vacuum	xlrec;
+			XLogRecPtr	recptr;
+
+			xlrec.hnode = hnode;
+			xlrec.ntuples = tuples_removed;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, SizeOfHashVacuum);
+
+			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+			XLogRegisterBufData(0, (char *) deletable,
+						ndeletable * sizeof(OffsetNumber));
+
+			XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_VACUUM_ONE_PAGE);
+
+			PageSetLSN(BufferGetPage(buf), recptr);
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+		}
+
+		END_CRIT_SECTION();
+		/*
+		 * Releasing write lock on meta page as we have updated
+		 * the tuple count.
+		 */
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+	}
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index d733770..2d92049 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -465,6 +465,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
@@ -518,6 +522,10 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							break;		/* yes, so exit for-loop */
 					}
 
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_hash_kill_items(scan);
+
 					/*
 					 * ran off the end of this page, try the next
 					 */
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index ea8f109..0e0f393 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -101,7 +101,7 @@ _h_spool(HSpool *hspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire index.
  */
 void
-_h_indexbuild(HSpool *hspool)
+_h_indexbuild(HSpool *hspool, Relation heapRel)
 {
 	IndexTuple	itup;
 #ifdef USE_ASSERT_CHECKING
@@ -126,6 +126,6 @@ _h_indexbuild(HSpool *hspool)
 		Assert(hashkey >= lasthashkey);
 #endif
 
-		_hash_doinsert(hspool->index, itup);
+		_hash_doinsert(hspool->index, itup, heapRel);
 	}
 }
diff --git a/src/backend/access/hash/hashutil.c b/src/backend/access/hash/hashutil.c
index c705531..2e99719 100644
--- a/src/backend/access/hash/hashutil.c
+++ b/src/backend/access/hash/hashutil.c
@@ -19,6 +19,7 @@
 #include "access/relscan.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
+#include "storage/buf_internals.h"
 
 #define CALC_NEW_BUCKET(old_bucket, lowmask) \
 			old_bucket | (lowmask + 1)
@@ -446,3 +447,70 @@ _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 
 	return new_bucket;
 }
+
+/*
+ * _hash_kill_items - set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0).
+ *
+ * We match items by heap TID before assuming they are the right ones to
+ * delete.
+ */
+void
+_hash_kill_items(IndexScanDesc scan)
+{
+	HashScanOpaque	so = (HashScanOpaque) scan->opaque;
+	Page	page;
+	HashPageOpaque	opaque;
+	OffsetNumber	offnum, maxoff;
+	int	numKilled = so->numKilled;
+	int		i;
+	bool	killedsomething = false;
+
+	Assert(so->numKilled > 0);
+	Assert(so->killedItems != NULL);
+
+	/*
+	 * Always reset the scan state, so we don't look for same
+	 * items on other pages.
+	 */
+	so->numKilled = 0;
+
+	page = BufferGetPage(so->hashso_curbuf);
+	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	for (i = 0; i < numKilled; i++)
+	{
+		offnum = so->killedItems[i].indexOffset;
+
+		while (offnum <= maxoff)
+		{
+			ItemId	iid = PageGetItemId(page, offnum);
+			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+
+			if (ItemPointerEquals(&ituple->t_tid, &so->killedItems[i].heapTid))
+			{
+				/* found the item */
+				ItemIdMarkDead(iid);
+				killedsomething = true;
+				break;		/* out of inner search loop */
+			}
+			offnum = OffsetNumberNext(offnum);
+		}
+	}
+
+	/*
+	 * Since this can be redone later if needed, mark as dirty hint.
+	 * Whenever we mark anything LP_DEAD, we also set the page's
+	 * LH_PAGE_HAS_DEAD_TUPLES flag, which is likewise just a hint.
+	 */
+	if (killedsomething)
+	{
+		opaque->hasho_flag |= LH_PAGE_HAS_DEAD_TUPLES;
+		MarkBufferDirtyHint(so->hashso_curbuf, true);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index f1cc9ff..5bd5c8d 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -154,6 +154,8 @@ hash_identify(uint8 info)
 		case XLOG_HASH_UPDATE_META_PAGE:
 			id = "UPDATE_META_PAGE";
 			break;
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			id = "VACUUM_ONE_PAGE";
 	}
 
 	return id;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index bfdfed8..eb1df57 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -57,6 +57,7 @@ typedef uint32 Bucket;
 #define LH_BUCKET_BEING_POPULATED	(1 << 4)
 #define LH_BUCKET_BEING_SPLIT	(1 << 5)
 #define LH_BUCKET_NEEDS_SPLIT_CLEANUP	(1 << 6)
+#define LH_PAGE_HAS_DEAD_TUPLES	(1 << 7)
 
 #define LH_PAGE_TYPE \
 	(LH_OVERFLOW_PAGE|LH_BUCKET_PAGE|LH_BITMAP_PAGE|LH_META_PAGE)
@@ -86,6 +87,7 @@ typedef HashPageOpaqueData *HashPageOpaque;
 #define H_NEEDS_SPLIT_CLEANUP(opaque)	((opaque)->hasho_flag & LH_BUCKET_NEEDS_SPLIT_CLEANUP)
 #define H_BUCKET_BEING_SPLIT(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_SPLIT)
 #define H_BUCKET_BEING_POPULATED(opaque)	((opaque)->hasho_flag & LH_BUCKET_BEING_POPULATED)
+#define H_HAS_DEAD_TUPLES(opaque)		((opaque)->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
 
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
@@ -95,6 +97,13 @@ typedef HashPageOpaqueData *HashPageOpaque;
  */
 #define HASHO_PAGE_ID		0xFF80
 
+typedef struct HashScanPosItem    /* what we remember about each match */
+{
+	ItemPointerData heapTid;	/* TID of referenced heap item */
+	OffsetNumber indexOffset;	/* index item's location within page */
+} HashScanPosItem;
+
+
 /*
  *	HashScanOpaqueData is private state for a hash index scan.
  */
@@ -135,6 +144,9 @@ typedef struct HashScanOpaqueData
 	 * referred only when hashso_buc_populated is true.
 	 */
 	bool		hashso_buc_split;
+	/* info about killed items if any (killedItems is NULL if never used) */
+	HashScanPosItem	*killedItems;	/* tids and offset numbers of killed items */
+	int			numKilled;			/* number of currently stored items */
 } HashScanOpaqueData;
 
 typedef HashScanOpaqueData *HashScanOpaque;
@@ -300,7 +312,7 @@ extern Datum hash_uint32(uint32 k);
 /* private routines */
 
 /* hashinsert.c */
-extern void _hash_doinsert(Relation rel, IndexTuple itup);
+extern void _hash_doinsert(Relation rel, IndexTuple itup, Relation heapRel);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
 extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
@@ -361,7 +373,7 @@ extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
 extern void _h_spooldestroy(HSpool *hspool);
 extern void _h_spool(HSpool *hspool, ItemPointer self,
 		 Datum *values, bool *isnull);
-extern void _h_indexbuild(HSpool *hspool);
+extern void _h_indexbuild(HSpool *hspool, Relation heapRel);
 
 /* hashutil.c */
 extern bool _hash_checkqual(IndexScanDesc scan, IndexTuple itup);
@@ -381,6 +393,7 @@ extern BlockNumber _hash_get_oldblock_from_newbucket(Relation rel, Bucket new_bu
 extern BlockNumber _hash_get_newblock_from_oldbucket(Relation rel, Bucket old_bucket);
 extern Bucket _hash_get_newbucket_from_oldbucket(Relation rel, Bucket old_bucket,
 								   uint32 lowmask, uint32 maxbucket);
+extern void _hash_kill_items(IndexScanDesc scan);
 
 /* hash.c */
 extern void hashbucketcleanup(Relation rel, Bucket cur_bucket,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 552d642..4e505cf 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -44,6 +44,7 @@
 #define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
 												 * vacuum */
 
+#define XLOG_HASH_VACUUM_ONE_PAGE	0xC0	/* remove dead tuples from index page */
 
 /*
  * xl_hash_split_allocate_page flag values, 8 bits are available.
@@ -250,6 +251,25 @@ typedef struct xl_hash_init_bitmap_page
 #define SizeOfHashInitBitmapPage	\
 	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
+/*
+ * This is what we need for index tuple deletion and to
+ * update the meta page.
+ *
+ * This data record is used for XLOG_HASH_VACUUM_ONE_PAGE
+ *
+ * Backup Blk 0/1: bucket page
+ * Backup Blk 2: meta page
+ */
+typedef struct xl_hash_vacuum
+{
+	RelFileNode	hnode;
+	double		ntuples;
+	bool		is_primary_bucket_page;
+}	xl_hash_vacuum;
+
+#define SizeOfHashVacuum	\
+	(offsetof(xl_hash_vacuum, is_primary_bucket_page) + sizeof(bool))
+
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
 extern const char *hash_identify(uint8 info);
-- 
1.8.3.1

#33

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Ashutosh Sharma (#32)

Re: Microvacuum support for Hash Index

On Wed, Mar 15, 2017 at 3:54 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Changed as per suggestions. Attached v9 patch. Thanks.

Wow, when do you sleep? Will have a look.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Robert Haas (#33)

Re: Microvacuum support for Hash Index

On Wed, Mar 15, 2017 at 4:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 15, 2017 at 3:54 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Changed as per suggestions. Attached v9 patch. Thanks.

Wow, when do you sleep? Will have a look.

Committed with a few corrections.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Robert Haas (#33)

Re: Microvacuum support for Hash Index

On 2017-03-15 16:31:11 -0400, Robert Haas wrote:

On Wed, Mar 15, 2017 at 3:54 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Changed as per suggestions. Attached v9 patch. Thanks.

Wow, when do you sleep?

I think that applies to a bunch of people, including yourself ;)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Robert Haas (#34)

Re: Microvacuum support for Hash Index

On Mar 16, 2017 7:49 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:

On Wed, Mar 15, 2017 at 4:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 15, 2017 at 3:54 PM, Ashutosh Sharma <ashu.coek88@gmail.com>

wrote:

Changed as per suggestions. Attached v9 patch. Thanks.

Wow, when do you sleep? Will have a look.

Committed with a few corrections.

Thanks Robert for the commit. Thank you Amit and Jesper for reviewing this
patch.

With Regards,
Ashutosh Sharma

#37

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Ashutosh Sharma (#28)

Re: Microvacuum support for Hash Index

On Wed, Mar 15, 2017 at 9:23 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Few other comments:
1.
+ if (ndeletable > 0)
+ {
+ /* No ereport(ERROR) until changes are logged */
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
You clearing this flag while logging the action, but same is not taken
care during replay. Any reasons?
That's because we conditionally WAL Log this flag status and when we
do so, we take a it's FPI.

Sure, but we are not clearing in conditionally. I am not sure, how
after recovery it will be cleared it gets set during normal operation.
Moreover, btree already clears similar flag during replay (refer
btree_xlog_delete).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#37)

1 attachment(s)

Re: Microvacuum support for Hash Index

On Thu, Mar 16, 2017 at 11:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 15, 2017 at 9:23 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Few other comments:
1.
+ if (ndeletable > 0)
+ {
+ /* No ereport(ERROR) until changes are logged */
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
You clearing this flag while logging the action, but same is not taken
care during replay. Any reasons?
That's because we conditionally WAL Log this flag status and when we
do so, we take a it's FPI.
Sure, but we are not clearing in conditionally. I am not sure, how
after recovery it will be cleared it gets set during normal operation.
Moreover, btree already clears similar flag during replay (refer
btree_xlog_delete).

You were right. In case datachecksum is enabled or wal_log_hint is set
to true, 'LH_PAGE_HAS_DEAD_TUPLES' will get wal logged and therefore
needs to be cleared on the standby as well. Attached is the patch that
clears this flag on standby during replay.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-on-standby-when.patchapplication/x-download; name=0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-on-standby-when.patchDownload

From aa9e0c0fbb3d15b440b53d8e038e381f673f7fda Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Thu, 16 Mar 2017 12:05:59 +0530
Subject: [PATCH] Reset LH_PAGE_HAS_DEAD_TUPLES flag on standby when replaying
 XLOG_HASH_VACUUM_ONE_PAGE record.

---
 src/backend/access/hash/hash_xlog.c  | 8 ++++++++
 src/backend/access/hash/hashinsert.c | 8 ++++++++
 2 files changed, 16 insertions(+)

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index cabf0fd..53f2dbc 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1078,6 +1078,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Buffer metabuf;
 	Page page;
 	XLogRedoAction action;
+	HashPageOpaque pageopaque;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
 
@@ -1126,6 +1127,13 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 				PageIndexMultiDelete(page, unused, unend - unused);
 		}
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items. See comments
+		 * in _hash_vacuum_one_page().
+		 */
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 8b6d0a0..8640e85 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -374,6 +374,14 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items. This is not
+		 * certainly true (there might be some that have recently been marked,
+		 * but weren't included in our target-item list), but it will almost
+		 * always be true and it doesn't seem worth an additional page scan
+		 * to check it. Remember that LH_PAGE_HAS_DEAD_TUPLES is only a hint
+		 * anyway.
+		 */
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
 
-- 
1.8.3.1

#39

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Ashutosh Sharma (#38)

Re: Microvacuum support for Hash Index

On Thu, Mar 16, 2017 at 1:02 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Thu, Mar 16, 2017 at 11:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Mar 15, 2017 at 9:23 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Few other comments:
1.
+ if (ndeletable > 0)
+ {
+ /* No ereport(ERROR) until changes are logged */
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
You clearing this flag while logging the action, but same is not taken
care during replay. Any reasons?
That's because we conditionally WAL Log this flag status and when we
do so, we take a it's FPI.
Sure, but we are not clearing in conditionally. I am not sure, how
after recovery it will be cleared it gets set during normal operation.
Moreover, btree already clears similar flag during replay (refer
btree_xlog_delete).
You were right. In case datachecksum is enabled or wal_log_hint is set
to true, 'LH_PAGE_HAS_DEAD_TUPLES' will get wal logged and therefore
needs to be cleared on the standby as well.

I was thinking what bad can happen if we don't clear this flag during
replay, the main thing that comes to mind is that after crash
recovery, if the flag is set the inserts on that page might need to
traverse all the tuples on that page once the page is full even if
there are no dead tuples in that page. It can be later cleared when
there are dead tuples in that page and we actually delete them, but I
don't think it is worth the price to pay for not clearing the flag
during replay.

Attached is the patch that
clears this flag on standby during replay.

Don't you think, we should also clear it during the replay of
XLOG_HASH_DELETE? We might want to log the clear of flag along with
WAL record for XLOG_HASH_DELETE.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#39)

1 attachment(s)

Re: Microvacuum support for Hash Index

Sure, but we are not clearing in conditionally. I am not sure, how
after recovery it will be cleared it gets set during normal operation.
Moreover, btree already clears similar flag during replay (refer
btree_xlog_delete).

You were right. In case datachecksum is enabled or wal_log_hint is set
to true, 'LH_PAGE_HAS_DEAD_TUPLES' will get wal logged and therefore
needs to be cleared on the standby as well.

I was thinking what bad can happen if we don't clear this flag during
replay, the main thing that comes to mind is that after crash
recovery, if the flag is set the inserts on that page might need to
traverse all the tuples on that page once the page is full even if
there are no dead tuples in that page. It can be later cleared when
there are dead tuples in that page and we actually delete them, but I
don't think it is worth the price to pay for not clearing the flag
during replay.

Yes, you are absolutely correct. If we do not clear this flag during
replay then there is a possibility of _hash_doinsert() unnecessarily
scanning the page with no space assuming that the page has got some
dead tuples in it which is not true.

Attached is the patch that
clears this flag on standby during replay.

Don't you think, we should also clear it during the replay of
XLOG_HASH_DELETE? We might want to log the clear of flag along with
WAL record for XLOG_HASH_DELETE.

Yes, it should be cleared. I completely missed this part in a hurry.
Thanks for informing. I have taken care of it in the attached v2
patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-on-standby-during.patchbinary/octet-stream; name=0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-on-standby-during.patchDownload

From 53c7e71e9d293d1a5e623343601d76a26ba2cbb1 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Thu, 16 Mar 2017 21:10:55 +0530
Subject: [PATCH] Reset LH_PAGE_HAS_DEAD_TUPLES flag on standby during replay
 v2

---
 src/backend/access/hash/hash_xlog.c  | 16 ++++++++++++++++
 src/backend/access/hash/hashinsert.c |  8 ++++++++
 2 files changed, 24 insertions(+)

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 8647e8c..265a3b8 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -815,6 +815,7 @@ hash_xlog_delete(XLogReaderState *record)
 	Buffer		bucketbuf = InvalidBuffer;
 	Buffer		deletebuf;
 	Page		page;
+	HashPageOpaque pageopaque;
 	XLogRedoAction action;
 
 	/*
@@ -859,6 +860,13 @@ hash_xlog_delete(XLogReaderState *record)
 				PageIndexMultiDelete(page, unused, unend - unused);
 		}
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items. See comments
+		 * in hashbucketcleanup() for details.
+		 */
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(deletebuf);
 	}
@@ -1078,6 +1086,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Buffer metabuf;
 	Page page;
 	XLogRedoAction action;
+	HashPageOpaque pageopaque;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
 
@@ -1126,6 +1135,13 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 				PageIndexMultiDelete(page, unused, unend - unused);
 		}
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items. See comments
+		 * in _hash_vacuum_one_page() for details.
+		 */
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 8b6d0a0..8640e85 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -374,6 +374,14 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items. This is not
+		 * certainly true (there might be some that have recently been marked,
+		 * but weren't included in our target-item list), but it will almost
+		 * always be true and it doesn't seem worth an additional page scan
+		 * to check it. Remember that LH_PAGE_HAS_DEAD_TUPLES is only a hint
+		 * anyway.
+		 */
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
 
-- 
1.8.3.1

#41

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Ashutosh Sharma (#40)

Re: Microvacuum support for Hash Index

On Thu, Mar 16, 2017 at 9:39 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Don't you think, we should also clear it during the replay of
XLOG_HASH_DELETE? We might want to log the clear of flag along with
WAL record for XLOG_HASH_DELETE.

Yes, it should be cleared. I completely missed this part in a hurry.
Thanks for informing. I have taken care of it in the attached v2
patch.

+ /*
+ * Mark the page as not containing any LP_DEAD items. See comments
+ * in hashbucketcleanup() for details.
+ */
+ pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;

Your comment here says, refer hashbucketcleanup and in that function,
the comment says "Clearing this flag is just a hint; replay won't redo
this.". Both seems contradictory. You need to change the comment in
hashbucketcleanup. As I said in my previous e-mail, I think you need
to record clearing of this flag in WAL record XLOG_HASH_DELETE as you
are not doing this unconditionally and then during replay clear it
only when the WAL record indicates the same.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#41)

1 attachment(s)

Re: Microvacuum support for Hash Index

On Fri, Mar 17, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 16, 2017 at 9:39 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Don't you think, we should also clear it during the replay of
XLOG_HASH_DELETE? We might want to log the clear of flag along with
WAL record for XLOG_HASH_DELETE.

Yes, it should be cleared. I completely missed this part in a hurry.
Thanks for informing. I have taken care of it in the attached v2
patch.
+ /*
+ * Mark the page as not containing any LP_DEAD items. See comments
+ * in hashbucketcleanup() for details.
+ */
+ pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+ pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
Your comment here says, refer hashbucketcleanup and in that function,
the comment says "Clearing this flag is just a hint; replay won't redo
this.". Both seems contradictory. You need to change the comment in
hashbucketcleanup.

Done. Please check the attached v3 patch.

As I said in my previous e-mail, I think you need

to record clearing of this flag in WAL record XLOG_HASH_DELETE as you
are not doing this unconditionally and then during replay clear it
only when the WAL record indicates the same.

Thank you so much for putting that point. I too think that we should
record the flag status in the WAL record and clear it only when
required during replay.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-during-replay-v3.patchbinary/octet-stream; name=0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-during-replay-v3.patchDownload

From 085a3d52d743703db2034674e9571161e852a594 Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Fri, 17 Mar 2017 11:37:16 +0530
Subject: [PATCH] Reset LH_PAGE_HAS_DEAD_TUPLES flag during replay v3

Patch by Ashutosh Sharma
---
 src/backend/access/hash/hash.c       |  8 +++++++-
 src/backend/access/hash/hash_xlog.c  | 21 +++++++++++++++++++++
 src/backend/access/hash/hashinsert.c |  8 ++++++++
 src/include/access/hash_xlog.h       |  2 ++
 4 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index cfcec34..869ddce 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -790,6 +790,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
+		bool		clear_dead_marking = false;
 
 		vacuum_delay_point();
 
@@ -877,11 +878,15 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			/*
 			 * Let us mark the page as clean if vacuum removes the DEAD tuples
 			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
-			 * flag. Clearing this flag is just a hint; replay won't redo this.
+			 * flag. Clearing this flag is just a hint; replay will check the
+			 * status of clear_dead_marking flag before redo it.
 			 */
 			if (tuples_removed && *tuples_removed > 0 &&
 				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+			{
 				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+				clear_dead_marking = true;
+			}
 
 			MarkBufferDirty(buf);
 
@@ -891,6 +896,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
 
+				xlrec.clear_dead_marking = clear_dead_marking;
 				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
 
 				XLogBeginInsert();
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 8647e8c..ac82092 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -859,6 +859,19 @@ hash_xlog_delete(XLogReaderState *record)
 				PageIndexMultiDelete(page, unused, unend - unused);
 		}
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items only if
+		 * clear_dead_marking flag is set to true. See comments in
+		 * hashbucketcleanup() for details.
+		 */
+		if (xldata->clear_dead_marking)
+		{
+			HashPageOpaque pageopaque;
+
+			pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+			pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+		}
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(deletebuf);
 	}
@@ -1078,6 +1091,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Buffer metabuf;
 	Page page;
 	XLogRedoAction action;
+	HashPageOpaque pageopaque;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
 
@@ -1126,6 +1140,13 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 				PageIndexMultiDelete(page, unused, unend - unused);
 		}
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items. See comments
+		 * in _hash_vacuum_one_page() for details.
+		 */
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 8b6d0a0..8640e85 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -374,6 +374,14 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items. This is not
+		 * certainly true (there might be some that have recently been marked,
+		 * but weren't included in our target-item list), but it will almost
+		 * always be true and it doesn't seem worth an additional page scan
+		 * to check it. Remember that LH_PAGE_HAS_DEAD_TUPLES is only a hint
+		 * anyway.
+		 */
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
 
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index dfd9237..4db40b4 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -197,6 +197,8 @@ typedef struct xl_hash_squeeze_page
  */
 typedef struct xl_hash_delete
 {
+	bool		clear_dead_marking;		/* TRUE if VACUUM clears
+										 * LH_PAGE_HAS_DEAD_TUPLES flag */
 	bool		is_primary_bucket_page; /* TRUE if the operation is for
 										 * primary bucket page */
 }	xl_hash_delete;
-- 
1.8.3.1

#43

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Ashutosh Sharma (#42)

Re: Microvacuum support for Hash Index

On Fri, Mar 17, 2017 at 12:27 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Fri, Mar 17, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As I said in my previous e-mail, I think you need

to record clearing of this flag in WAL record XLOG_HASH_DELETE as you
are not doing this unconditionally and then during replay clear it
only when the WAL record indicates the same.

Thank you so much for putting that point. I too think that we should
record the flag status in the WAL record and clear it only when
required during replay.

I think hashdesc.c needs an update (refer case XLOG_HASH_DELETE:).

- * flag. Clearing this flag is just a hint; replay won't redo this.
+ * flag. Clearing this flag is just a hint; replay will check the
+ * status of clear_dead_marking flag before redo it.
  */
  if (tuples_removed && *tuples_removed > 0 &&
  opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+ {
  opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+ clear_dead_marking = true;
+ }

I feel the above comment is not required as you are logging this
action explicitly.

+ bool clear_dead_marking; /* TRUE if VACUUM clears

No need to write VACUUM explicitly, you can simply say "TRUE if this
operation clears ...".

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Ashutosh Sharma

ashu.coek88@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#43)

1 attachment(s)

Re: Microvacuum support for Hash Index

On Fri, Mar 17, 2017 at 6:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 17, 2017 at 12:27 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Fri, Mar 17, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As I said in my previous e-mail, I think you need

to record clearing of this flag in WAL record XLOG_HASH_DELETE as you
are not doing this unconditionally and then during replay clear it
only when the WAL record indicates the same.

Thank you so much for putting that point. I too think that we should
record the flag status in the WAL record and clear it only when
required during replay.

I think hashdesc.c needs an update (refer case XLOG_HASH_DELETE:).

Done. Thanks!

- * flag. Clearing this flag is just a hint; replay won't redo this.
+ * flag. Clearing this flag is just a hint; replay will check the
+ * status of clear_dead_marking flag before redo it.
*/
if (tuples_removed && *tuples_removed > 0 &&
opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+ {
opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+ clear_dead_marking = true;
+ }

I feel the above comment is not required as you are logging this
action explicitly.

That's right. I have removed it in the attached v4 patch.

+ bool clear_dead_marking; /* TRUE if VACUUM clears

No need to write VACUUM explicitly, you can simply say "TRUE if this
operation clears ...".

Corrected. Please find the attached v4 patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachments:

0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-during-replay-v4.patchbinary/octet-stream; name=0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-during-replay-v4.patchDownload

From b550eb4f6f968bbc93f1e39622a53b53ecd4923f Mon Sep 17 00:00:00 2001
From: ashu <ashu@localhost.localdomain>
Date: Fri, 17 Mar 2017 20:07:09 +0530
Subject: [PATCH] Reset LH_PAGE_HAS_DEAD_TUPLES flag during replay v4

Patch by Ashutosh Sharma
---
 src/backend/access/hash/hash.c         |  7 ++++++-
 src/backend/access/hash/hash_xlog.c    | 21 +++++++++++++++++++++
 src/backend/access/hash/hashinsert.c   |  8 ++++++++
 src/backend/access/rmgrdesc/hashdesc.c | 11 ++++++++++-
 src/include/access/hash_xlog.h         |  2 ++
 5 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index cfcec34..34cc08f 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -790,6 +790,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
+		bool		clear_dead_marking = false;
 
 		vacuum_delay_point();
 
@@ -877,11 +878,14 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 			/*
 			 * Let us mark the page as clean if vacuum removes the DEAD tuples
 			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
-			 * flag. Clearing this flag is just a hint; replay won't redo this.
+			 * flag.
 			 */
 			if (tuples_removed && *tuples_removed > 0 &&
 				opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+			{
 				opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+				clear_dead_marking = true;
+			}
 
 			MarkBufferDirty(buf);
 
@@ -891,6 +895,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 				xl_hash_delete xlrec;
 				XLogRecPtr	recptr;
 
+				xlrec.clear_dead_marking = clear_dead_marking;
 				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
 
 				XLogBeginInsert();
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index 8647e8c..ac82092 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -859,6 +859,19 @@ hash_xlog_delete(XLogReaderState *record)
 				PageIndexMultiDelete(page, unused, unend - unused);
 		}
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items only if
+		 * clear_dead_marking flag is set to true. See comments in
+		 * hashbucketcleanup() for details.
+		 */
+		if (xldata->clear_dead_marking)
+		{
+			HashPageOpaque pageopaque;
+
+			pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+			pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+		}
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(deletebuf);
 	}
@@ -1078,6 +1091,7 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 	Buffer metabuf;
 	Page page;
 	XLogRedoAction action;
+	HashPageOpaque pageopaque;
 
 	xldata = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
 
@@ -1126,6 +1140,13 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 				PageIndexMultiDelete(page, unused, unend - unused);
 		}
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items. See comments
+		 * in _hash_vacuum_one_page() for details.
+		 */
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
 	}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 8b6d0a0..8640e85 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -374,6 +374,14 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 
 		PageIndexMultiDelete(page, deletable, ndeletable);
 
+		/*
+		 * Mark the page as not containing any LP_DEAD items. This is not
+		 * certainly true (there might be some that have recently been marked,
+		 * but weren't included in our target-item list), but it will almost
+		 * always be true and it doesn't seem worth an additional page scan
+		 * to check it. Remember that LH_PAGE_HAS_DEAD_TUPLES is only a hint
+		 * anyway.
+		 */
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 		pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
 
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 5bd5c8d..5f5f4a0 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -96,7 +96,8 @@ hash_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
 
-				appendStringInfo(buf, "is_primary %c",
+				appendStringInfo(buf, "clear_dead_marking %c, is_primary %c",
+								 xlrec->clear_dead_marking ? 'T' : 'F',
 								 xlrec->is_primary_bucket_page ? 'T' : 'F');
 				break;
 			}
@@ -108,6 +109,14 @@ hash_desc(StringInfo buf, XLogReaderState *record)
 								 xlrec->ntuples);
 				break;
 			}
+		case XLOG_HASH_VACUUM_ONE_PAGE:
+			{
+				xl_hash_vacuum_one_page *xlrec = (xl_hash_vacuum_one_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
 	}
 }
 
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index dfd9237..2e64cfa 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -197,6 +197,8 @@ typedef struct xl_hash_squeeze_page
  */
 typedef struct xl_hash_delete
 {
+	bool		clear_dead_marking;		/* TRUE if this operation clears
+										 * LH_PAGE_HAS_DEAD_TUPLES flag */
 	bool		is_primary_bucket_page; /* TRUE if the operation is for
 										 * primary bucket page */
 }	xl_hash_delete;
-- 
1.8.3.1

#45

Bruce Momjian

bruce@momjian.us

almost 9 years ago

In reply to: Andres Freund (#35)

Re: Microvacuum support for Hash Index

On Wed, Mar 15, 2017 at 07:26:45PM -0700, Andres Freund wrote:

On 2017-03-15 16:31:11 -0400, Robert Haas wrote:

On Wed, Mar 15, 2017 at 3:54 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Changed as per suggestions. Attached v9 patch. Thanks.

Wow, when do you sleep?

I think that applies to a bunch of people, including yourself ;)

Gee, no one asks when I sleep. I wonder why. ;-)

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Ashutosh Sharma (#44)

Re: Microvacuum support for Hash Index

On Fri, Mar 17, 2017 at 8:34 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Fri, Mar 17, 2017 at 6:13 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 17, 2017 at 12:27 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Fri, Mar 17, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As I said in my previous e-mail, I think you need

to record clearing of this flag in WAL record XLOG_HASH_DELETE as you
are not doing this unconditionally and then during replay clear it
only when the WAL record indicates the same.

Thank you so much for putting that point. I too think that we should
record the flag status in the WAL record and clear it only when
required during replay.

I think hashdesc.c needs an update (refer case XLOG_HASH_DELETE:).

Done. Thanks!

This version looks good to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#46)

Re: Microvacuum support for Hash Index

On Sat, Mar 18, 2017 at 4:35 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

This version looks good to me.

Committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers