Write Ahead Logging for Hash Indexes

Started by Amit Kapilaover 9 years ago124 messages

amit.kapila16@gmail.com

over 9 years ago

1 attachment(s)

$SUBJECT will make hash indexes reliable and usable on standby.
AFAIU, currently hash indexes are not recommended to be used in
production mainly because they are not crash-safe and with this patch,
I hope we can address that limitation and recommend them for use in
production.

This patch is built on my earlier patch [1]https://commitfest.postgresql.org/10/647/ of making hash indexes
concurrent. The main reason for doing so is that the earlier patch
allows to complete the split operation and used light-weight locking
due to which operations can be logged at granular level.

WAL for different operations:

This has been explained in README as well, but I am again writing it
here for the ease of people.
=================================

Multiple WAL records are being written for create index operation,
first for initializing the metapage, followed by one for each new
bucket created during operation followed by one for initializing the
bitmap page. If the system crashes after any operation, the whole
operation is rolledback. I have considered to write a single WAL
record for the whole operation, but for that we need to limit the
number of initial buckets that can be created during the operation. As
we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with
current XLog machinery, it is better to write multiple WAL records for
this operation. The downside of restricting the number of buckets is
that we need to perform split operation if the number of tuples are
more than what can be accommodated in initial set of buckets and it is
not unusual to have large number of tuples during create index
operation.

Ordinary item insertions (that don't force a page split or need a new
overflow page) are single WAL entries. They touch a single bucket
page and meta page, metapage is updated during replay as it is updated
during original operation.

An insertion that causes an addition of an overflow page is logged as
a single WAL entry preceded by a WAL entry for a new overflow page
required to insert a tuple. There is a corner case where by the time
we try to use newly allocated overflow page, it already gets used by
concurrent insertions, for such a case, a new overflow page will be
allocated and a separate WAL entry will be made for the same.

An insertion that causes a bucket split is logged as a single WAL
entry, followed by a WAL entry for allocating a new bucket, followed
by a WAL entry for each overflow bucket page in the new bucket to
which the tuples are moved from old bucket, followed by a WAL entry to
indicate that split is complete for both old and new buckets.

A split operation which requires overflow pages to complete the
operation will need to write a WAL record for each new allocation of
an overflow page. As splitting involves multiple atomic actions, it's
possible that the system crashes between moving tuples from bucket
pages of old bucket to new bucket. After recovery, both the old and
new buckets will be marked with in_complete split flag. The reader
algorithm works correctly, as it will scan both the old and new
buckets.

We finish the split at next insert or split operation on old bucket.
It could be done during searches, too, but it seems best not to put
any extra updates in what would otherwise be a read-only operation
(updating is not possible in hot standby mode anyway). It would seem
natural to complete the split in VACUUM, but since splitting a bucket
might require to allocate a new page, it might fail if you run out of
disk space. That would be bad during VACUUM - the reason for running
VACUUM in the first place might be that you run out of disk space, and
now VACUUM won't finish because you're out of disk space. In
contrast, an insertion can require enlarging the physical file anyway.

Deletion of tuples from a bucket is performed for two reasons, one for
removing the dead tuples and other for removing the tuples that are
moved by split. WAL entry is made for each bucket page from which
tuples are removed, followed by a WAL entry to clear the garbage flag
if the tuples moved by split are removed. Another separate WAL entry
is made for updating the metapage if the deletion is performed for
removing the dead tuples by vaccum.

As deletion involves multiple atomic operations, it is quite possible
that system crashes after (a) removing tuples from some of the bucket
pages (b) before clearing the garbage flag (c) before updating the
metapage. If the system crashes before completing (b), it will again
try to clean the bucket during next vacuum or insert after recovery
which can have some performance impact, but it will work fine. If the
system crashes before completing (c), after recovery there could be
some additional splits till the next vacuum updates the metapage, but
the other operations like insert, delete and scan will work correctly.
We can fix this problem by actually updating the metapage based on
delete operation during replay, but not sure if it is worth the
complication.

Squeeze operation moves tuples from one of the buckets later in the
chain to one of the bucket earlier in chain and writes WAL record when
either the bucket to which it is writing tuples is filled or bucket
from which it is removing the tuples becomes empty.

As Squeeze operation involves writing multiple atomic operations, it
is quite possible, that system crashes before completing the operation
on entire bucket. After recovery, the operations will work correctly,
but the index will remain bloated and can impact performance of read
and insert operations until the next vacuum squeezes the bucket
completely.

=====================================

One of the challenge in writing this patch was that the current code
was not written with a mindset that we need to write WAL for different
operations. Typical example is _hash_addovflpage() where pages are
modified across different function calls and all modifications needs
to be done atomically, so I have to refactor some code so that the
operations can be logged sensibly.

Thanks to Ashutosh Sharma who has helped me in completing the patch by
writing WAL for create index and delete operation and done the
detailed testing of patch by using pg_filedump tool. I think it is
better if he himself explains the testing he has done to ensure
correctness of patch.

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2]/messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com.

[1]: https://commitfest.postgresql.org/10/647/
[2]: /messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal_hash_index_v1.patchapplication/octet-stream; name=wal_hash_index_v1.patchDownload

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0f09d82..2d0d3bd 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1520,19 +1520,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 06f49db..f285795 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2323,12 +2323,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f8e55..e075ede 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e9f47c4..a2dc212 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index 5d3bd94..dae82be 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
-       hashsearch.o hashsort.o hashutil.o hashvalidate.o
+       hashsearch.o hashsort.o hashutil.o hashvalidate.o hashxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index a0feb2f..f4b5521 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -531,6 +531,97 @@ All the freespace operations should be called while holding no buffer
 locks.  Since they need no lmgr locks, deadlock is not possible.
 
 
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which gets
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolledback.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with
+current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform split operation if the number of tuples are more than what can be
+accomodated in initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of old bucket to new
+bucket.  After recovery, both the old and new buckets will be marked with
+in_complete split flag.  The reader algorithm works correctly, as it will scan
+both the old and new buckets as explained in the reader algorithm section
+above.
+
+We finish the split at next insert or split operation on old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require to allocate a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vaccum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeezes the bucket completely.
+
+
 Other Notes
 -----------
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d869d35..8eb84d2 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -26,6 +26,7 @@
 #include "optimizer/plancat.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -114,7 +115,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -176,7 +177,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
@@ -603,6 +604,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -628,7 +630,28 @@ loop_top:
 		num_index_tuples = metap->hashm_ntuples;
 	}
 
-	_hash_wrtbuf(rel, metabuf);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
 	if (stats == NULL)
@@ -786,9 +809,42 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 			curr_page_dirty = true;
+
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -808,7 +864,7 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 			if (retain_pin)
 				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
 			else
-				_hash_wrtbuf(rel, buf);
+				_hash_relbuf(rel, buf);
 			curr_page_dirty = false;
 		}
 		else if (retain_pin)
@@ -843,10 +899,36 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+
+		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_CLEAR_GARBAGE);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
+	 * we need to release and reacquire the lock on bucket buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
 	 * If we deleted anything, try to compact free space.  For squeezing the
 	 * bucket, we must have a cleanup lock, else it can impact the ordering of
 	 * tuples for a scan that has started before it.
@@ -855,9 +937,3 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 							bstrategy);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index b1e79b5..907c553 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -44,6 +45,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -248,31 +250,63 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * write and release the modified page and ensure to release the pin on
-	 * primary page.
-	 */
-	_hash_wrtbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap->hashm_ntuples += 1;
 
 	/* Make sure this stays in sync with _hash_expandtable() */
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/* Attempt to split if a split is needed */
 	if (do_expand)
@@ -314,3 +348,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 760563a..a10671c 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,10 +18,10 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -84,7 +84,9 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -102,88 +104,22 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
-
-	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
-	 */
-	_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_WRITE);
-
-	/* probably redundant... */
-	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
-
-	/* loop to find current tail page, in case someone else inserted too */
-	for (;;)
-	{
-		BlockNumber nextblkno;
-
-		page = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
-		nextblkno = pageopaque->hasho_nextblkno;
-
-		if (!BlockNumberIsValid(nextblkno))
-			break;
-
-		/* we assume we do not need to write the unmodified page */
-		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
-		else
-			_hash_relbuf(rel, buf);
-
-		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
-	}
-
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
 	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
 	BlockNumber blkno;
 	uint32		orig_firstfree;
 	uint32		splitnum;
 	uint32	   *freep = NULL;
 	uint32		max_ovflpg;
 	uint32		bit;
+	uint32		bitmap_page_bit;
 	uint32		first_page;
 	uint32		last_bit;
 	uint32		last_page;
 	uint32		i,
 				j;
+	bool		page_found = false;
 
 	/* Get exclusive lock on the meta page */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
@@ -233,11 +169,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -261,8 +217,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -282,38 +245,76 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Write-lock the tail page.  It is okay to hold two buffer locks here
+	 * since there cannot be anyone else contending for access to ovflbuf.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
+	_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_WRITE);
 
-	/* Write updated metapage and release lock, but not pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	/* probably redundant... */
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 
-	return newbuf;
+	/* loop to find current tail page, in case someone else inserted too */
+	for (;;)
+	{
+		BlockNumber nextblkno;
 
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+		page = BufferGetPage(buf);
+		pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		nextblkno = pageopaque->hasho_nextblkno;
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	_hash_wrtbuf(rel, mapbuf);
+		if (!BlockNumberIsValid(nextblkno))
+			break;
 
-	/* Reacquire exclusive lock on the meta page */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+		/* we assume we do not need to write the unmodified page */
+		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		else
+			_hash_relbuf(rel, buf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
+	}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+	/*
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case new page is added for them.
+	 */
+	START_CRIT_SECTION();
+
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
+
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
+
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
+
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
+
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -322,18 +323,97 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
-		_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* We didn't change the metapage, so no need to write */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
 	}
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	END_CRIT_SECTION();
+
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, buf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, ovflbuf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, ovflbuf, HASH_NOLOCK, HASH_WRITE);
+
+	return ovflbuf;
 }
 
 /*
@@ -366,6 +446,12 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -377,7 +463,9 @@ _hash_firstfreebit(uint32 map)
  *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -395,6 +483,10 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	BlockNumber bucketblkno;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -404,14 +496,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	nextblkno = ovflopaque->hasho_nextblkno;
 	prevblkno = ovflopaque->hasho_prevblkno;
 	bucket = ovflopaque->hasho_bucket;
-
-	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	_hash_wrtbuf(rel, ovflbuf);
+	bucketblkno = BufferGetBlockNumber(bucketbuf);
 
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
@@ -422,50 +507,25 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		if (prevblkno == bucket_blkno)
-		{
-			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
-													 prevblkno,
-													 RBM_NORMAL,
-													 bstrategy);
-
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
-			MarkBufferDirty(prevbuf);
-			ReleaseBuffer(prevbuf);
-		}
+		if (prevblkno == bucketblkno)
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_NOLOCK,
+												 LH_BUCKET_PAGE,
+												 bstrategy);
 		else
-		{
-			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-															 prevblkno,
-															 HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-															 bstrategy);
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
-			_hash_wrtbuf(rel, prevbuf);
-		}
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_WRITE,
+												 LH_OVERFLOW_PAGE,
+												 bstrategy);
 	}
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		_hash_wrtbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -491,60 +551,185 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	_hash_wrtbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/*
+	 * Zero the page for debugging's sake; then write it.
+	 */
+	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
-		_hash_wrtbuf(rel, metabuf);
+		update_metap = true;
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* no need to change metapage */
-		_hash_relbuf(rel, metabuf);
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf))
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
 
+	END_CRIT_SECTION();
+
+	/*
+	 * Release the lock, caller will decide whether to release the pin or
+	 * reacquire the lock.  Caller is responsibile for locking and unlocking
+	 * bucket buffer.
+	 */
+	if (BufferGetBlockNumber(wbuf) != bucketblkno)
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	/* release buffer for non-primary bucket and pin for primary bucket. */
+	if (BufferIsValid(prevbuf) && prevblkno == bucketblkno)
+		_hash_dropbuf(rel, prevbuf);
+	else
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
+
 	return nextblkno;
 }
 
-
 /*
- *	_hash_initbitmap()
- *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
+ *	_hash_initbitmapbuffer()
  *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
  */
 void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 {
-	Buffer		buf;
 	Page		pg;
 	HashPageOpaque op;
 	uint32	   *freep;
 
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
 	pg = BufferGetPage(buf);
 
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
 	/* initialize the page's special space */
 	op = (HashPageOpaque) PageGetSpecialPointer(pg);
 	op->hasho_prevblkno = InvalidBlockNumber;
@@ -555,25 +740,9 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 	/* set all of the bits to 1 */
 	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* write out the new bitmap page (releasing write lock and pin) */
-	_hash_wrtbuf(rel, buf);
-
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
+	MemSet(freep, 0xFF, bmsize);
 }
 
-
 /*
  *	_hash_squeezebucket(rel, bucket)
  *
@@ -613,8 +782,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
-	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
@@ -657,13 +824,22 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		IndexTuple	itup;
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		Size		itemsz;
+		int			i;
+
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
 
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
@@ -671,9 +847,6 @@ _hash_squeezebucket(Relation rel,
 			 roffnum <= maxroffnum;
 			 roffnum = OffsetNumberNext(roffnum))
 		{
-			IndexTuple	itup;
-			Size		itemsz;
-
 			itup = (IndexTuple) PageGetItem(rpage,
 											PageGetItemId(rpage, roffnum));
 			itemsz = IndexTupleDSize(*itup);
@@ -681,39 +854,111 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMulTups(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
+				bool		release_wbuf = false;
+
 				Assert(!PageIsEmpty(wpage));
 
+				/*
+				 * caller is responsibile for locking and unlocking bucket
+				 * buffer
+				 */
 				if (wblkno != bucket_blkno)
-					release_buf = true;
+					release_wbuf = true;
 
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty && release_buf)
-					_hash_wrtbuf(rel, wbuf);
-				else if (wbuf_dirty)
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
 					MarkBufferDirty(wbuf);
-				else if (release_buf)
-					_hash_relbuf(rel, wbuf);
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
+					if (release_wbuf)
+						_hash_relbuf(rel, wbuf);
+
+					/*
+					 * We need to release and if required reacquire the lock
+					 * on rbuf to ensure that standby shouldn't see an
+					 * intermediate state of it.  If we don't release the
+					 * lock, after replay of XLOG_HASH_SQUEEZE_PAGE on standby
+					 * users will be able to view the results of partial
+					 * deletion on rblkno.
+					 */
+					_hash_chgbufaccess(rel, rbuf, HASH_READ, HASH_NOLOCK);
+				}
 
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						_hash_wrtbuf(rel, rbuf);
-					}
-					else
-						_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				_hash_chgbufaccess(rel, rbuf, HASH_NOLOCK, HASH_WRITE);
+
 				wbuf = _hash_getbuf_with_strategy(rel,
 												  wblkno,
 												  HASH_WRITE,
@@ -722,21 +967,26 @@ _hash_squeezebucket(Relation rel,
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
-				release_buf = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
+			}
 
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -748,34 +998,36 @@ _hash_squeezebucket(Relation rel,
 		 * Tricky point here: if our read and write pages are adjacent in the
 		 * bucket chain, our write lock on wbuf will conflict with
 		 * _hash_freeovflpage's attempt to update the sibling links of the
-		 * removed page.  However, in that case we are done anyway, so we can
-		 * simply drop the write lock before calling _hash_freeovflpage.
+		 * removed page.  However, in that case we are ensuring that
+		 * _hash_freeovflpage doesn't take lock on that page again.  Releasing
+		 * the lock is not an option, because before that we need to write WAL
+		 * for the change in this page.
 		 */
 		rblkno = ropaque->hasho_prevblkno;
 		Assert(BlockNumberIsValid(rblkno));
 
+		/* free this overflow page (releases rbuf) */
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
+
+		pfree(itup_offsets);
+
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
+			/* Caller is responsibile for locking and unlocking bucket buffer */
 			if (wblkno != bucket_blkno)
-				release_buf = true;
-
-			/* yes, so release wbuf lock first if needed */
-			if (wbuf_dirty && release_buf)
-				_hash_wrtbuf(rel, wbuf);
-			else if (wbuf_dirty)
-				MarkBufferDirty(wbuf);
-			else if (release_buf)
-				_hash_relbuf(rel, wbuf);
-
-			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
-			/* done */
+				_hash_dropbuf(rel, wbuf);
 			return;
 		}
 
-		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
+		/* Caller is responsibile for locking and unlocking bucket buffer */
+		if (wblkno != bucket_blkno)
+			_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index bb43aaa..5a285be 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -40,12 +40,11 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -160,6 +159,29 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag, bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	pageopaque->hasho_prevblkno = InvalidBlockNumber;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -286,25 +308,6 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 }
 
 /*
- *	_hash_wrtbuf() -- write a hash page to disk.
- *
- *		This routine releases the lock held on the buffer and our refcount
- *		for it.  It is an error to call _hash_wrtbuf() without a write lock
- *		and a pin on the buffer.
- *
- * NOTE: this routine should go away when/if hash indexes are WAL-ified.
- * The correct sequence of operations is to mark the buffer dirty, then
- * write the WAL record, then release the lock and pin; so marking dirty
- * can't be combined with releasing.
- */
-void
-_hash_wrtbuf(Relation rel, Buffer buf)
-{
-	MarkBufferDirty(buf);
-	UnlockReleaseBuffer(buf);
-}
-
-/*
  * _hash_chgbufaccess() -- Change the lock type on a buffer, without
  *			dropping our pin on it.
  *
@@ -332,7 +335,7 @@ _hash_chgbufaccess(Relation rel,
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -344,19 +347,18 @@ _hash_chgbufaccess(Relation rel,
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -378,6 +380,154 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -397,30 +547,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -437,7 +582,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -456,44 +601,11 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_firstfree = 0;
 
 	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
-	 */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
-
-	/*
-	 * Initialize the first N buckets
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
 	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-		pageopaque->hasho_prevblkno = InvalidBlockNumber;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		_hash_wrtbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
-
-	/*
-	 * Initialize first bitmap page
-	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	_hash_wrtbuf(rel, metabuf);
-
-	return num_buckets;
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -502,7 +614,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
@@ -530,10 +641,14 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -569,7 +684,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
@@ -703,7 +818,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -728,18 +847,44 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = InvalidBlockNumber;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -747,6 +892,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -759,10 +905,10 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -775,15 +921,71 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_NOLOCK, HASH_WRITE);
 
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -835,20 +1037,31 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 
 	MemSet(zerobuf, 0, sizeof(zerobuf));
 
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
+
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
 
 	return true;
 }
 
-
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold exclusive locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -874,70 +1087,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress and it has
-	 * deletable tuples. At operation end, we clear split in progress flag and
-	 * vacuum will clear page_has_garbage flag after deleting such tuples.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = InvalidBlockNumber;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * is used to finish the incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -1026,7 +1180,7 @@ _hash_splitbucket_guts(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
 					bool		retain_pin = false;
 
@@ -1036,8 +1190,12 @@ _hash_splitbucket_guts(Relation rel,
 					 */
 					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
 
-					/* write out nbuf and drop lock, but keep pin */
-					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
+					/* drop lock, but keep pin */
+					_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
@@ -1045,6 +1203,13 @@ _hash_splitbucket_guts(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -1053,6 +1218,8 @@ _hash_splitbucket_guts(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1075,7 +1242,16 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
+			if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1091,15 +1267,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
-		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, nbuf);
-
-	/*
-	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
-	 * there is no pending scan that has seen the flag after it is cleared.
-	 */
 	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1108,6 +1275,8 @@ _hash_splitbucket_guts(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	/* indicate that split is finished */
 	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
@@ -1118,6 +1287,29 @@ _hash_splitbucket_guts(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1222,9 +1414,41 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
 	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	obucket = opageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	hash_destroy(tidhtab);
 }
+
+/*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/access/hash/hashxlog.c b/src/backend/access/hash/hashxlog.c
new file mode 100644
index 0000000..f9e77c0
--- /dev/null
+++ b/src/backend/access/hash/hashxlog.c
@@ -0,0 +1,970 @@
+/*-------------------------------------------------------------------------
+ *
+ * hashxlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hashxlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, *num_bucket, LH_OVERFLOW_PAGE, true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 0);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bukcetbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bukcetbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bukcetbuf));
+		LockBufferForCleanup(bukcetbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bukcetbuf))
+		UnlockReleaseBuffer(bukcetbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bukcetbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bukcetbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bukcetbuf));
+		LockBufferForCleanup(bukcetbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bukcetbuf))
+		UnlockReleaseBuffer(bukcetbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf));
+		LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay clear garbage flag operation for primary bucket page.
+ */
+static void
+hash_xlog_clear_garbage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_CLEAR_GARBAGE:
+			hash_xlog_clear_garbage(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index d37c9b1..7b32fbb 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,143 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "split_complete_old_bucket %c, split_complete_new_bucket %c",
+								 (xlrec->old_bucket_flag & LH_BUCKET_OLD_PAGE_SPLIT) ? 'F' : 'T',
+								 (xlrec->new_bucket_flag & LH_BUCKET_NEW_PAGE_SPLIT) ? 'F' : 'T');
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "MOVE_SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_CLEAR_GARBAGE:
+			id = "CLEAR_GARBAGE";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d14d540..fac16cc 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -499,11 +499,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..e1ae4e3 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -615,6 +615,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMulTups
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMulTups(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 6e8fc4c..2b43ddf 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -183,6 +183,246 @@ typedef struct HashMetaPageData
 
 typedef HashMetaPageData *HashMetaPage;
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_CLEAR_GARBAGE 0xA0	/* clear garbage flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * Data to regenerate the meta-data page
+ */
+typedef struct xl_hash_metadata
+{
+	HashMetaPageData metadata;
+}	xl_hash_metadata;
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (xl_hash_metadata)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
+
 /*
  * Maximum size of a hash index item (it's okay to have only one per page)
  */
@@ -312,13 +552,16 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
+				   BufferAccessStrategy bstrategy);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
@@ -330,6 +573,8 @@ extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
 								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag,
+			  bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -338,11 +583,12 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..c1ac1d9 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -423,6 +423,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMulTups(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 76593e1..80089b9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 1a61a5b..3682642 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index 39a60a5..62f2ad7 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 59cb1e0..d907519 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#1)

Re: Write Ahead Logging for Hash Indexes

On Tue, Aug 23, 2016 at 8:54 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

$SUBJECT will make hash indexes reliable and usable on standby.
AFAIU, currently hash indexes are not recommended to be used in
production mainly because they are not crash-safe and with this patch,
I hope we can address that limitation and recommend them for use in
production.

This patch is built on my earlier patch [1] of making hash indexes
concurrent. The main reason for doing so is that the earlier patch
allows to complete the split operation and used light-weight locking
due to which operations can be logged at granular level.

WAL for different operations:

This has been explained in README as well, but I am again writing it
here for the ease of people.

One of the challenge in writing this patch was that the current code
was not written with a mindset that we need to write WAL for different
operations. Typical example is _hash_addovflpage() where pages are
modified across different function calls and all modifications needs
to be done atomically, so I have to refactor some code so that the
operations can be logged sensibly.

This patch has not done handling for OldSnapshot. Previously, we
haven't done TestForOldSnapshot() checks in hash index as they were
not logged, but now with this patch, it makes sense to perform such
checks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ashutosh Sharma

ashu.coek88@gmail.com

over 9 years ago

In reply to: Amit Kapila (#2)

Re: Write Ahead Logging for Hash Indexes

Hi All,

Following are the steps that i have followed to verify the WAL Logging
of hash index,

1. I used Mithun's patch to improve coverage of hash index code [1]/messages/by-id/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com to
verify the WAL Logging of hash index. Firstly i have confirmed if all
the XLOG records associated with hash index are being covered or not
using this patch. In case if any of the XLOG record for hash index
operation is not being covered i have added a testcase for it. I have
found that one of the XLOG record 'XLOG_HASH_MOVE_PAGE_CONTENTS' was
not being covered and added a small testcase for the same. The patch
for this is available @ [2]/messages/by-id/CAE9k0PkNjryhSiG53mjnKFhi+MipJMjSa=YkH-UeW3bfr1HPJQ@mail.gmail.com.

2. I executed the regression test suite and found all the hash indexes
that are getting created as a part of regression test suite using the
below query.

SELECT t.relname index_name, t.oid FROM pg_class t JOIN pg_am idx
ON idx.oid = t.relam WHERE idx.amname = 'hash';

3. Thirdly, I have calculated the number of pages associated with each
hash index and compared every page of hash index on master and standby
server using pg_filedump tool. As for example if the number of pages
associated with 'con_hash_index' is 10 then here is what i did,

On master:
-----------------
select pg_relation_filepath('con_hash_index');
pg_relation_filepath
----------------------
base/16408/16433
(1 row)

./pg_filedump -if -R 0 9
/home/edb/git-clone-postgresql/postgresql/TMP/postgres/master/base/16408/16433

/tmp/file1

On Slave:
---------------
select pg_relation_filepath('con_hash_index');
pg_relation_filepath
----------------------
base/16408/16433
(1 row)

./pg_filedump -if -R 0 9
/home/edb/git-clone-postgresql/postgresql/TMP/postgres/standby/base/16408/16433

/tmp/file2

compared file1 and file2 using some diff tool.

Following are the list of hash indexes that got created inside
regression database when regression test suite was executed on a
master server.

hash_i4_index
hash_name_index
hash_txt_index
hash_f8_index
con_hash_index
hash_idx

In short, this is all i did and found no issues during testing. Please
let me know if you need any further details.

I would like to Thank Amit for his support and guidance during the
testing phase.

[1]: /messages/by-id/CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com
[2]: /messages/by-id/CAE9k0PkNjryhSiG53mjnKFhi+MipJMjSa=YkH-UeW3bfr1HPJQ@mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Amit Kapila (#1)

Re: Write Ahead Logging for Hash Indexes

Hi Amit,

Thanks for working on this.

When building with --enable-cassert, I get compiler warning:

hash.c: In function 'hashbucketcleanup':
hash.c:722: warning: 'new_bucket' may be used uninitialized in this function

After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.

With this backtrace:

(gdb) bt
#0 0x0000003838c325e5 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003838c33dc5 in abort () at abort.c:92
#2 0x000000000081a8fd in ExceptionalCondition (conditionName=<value
optimized out>, errorType=<value optimized out>, fileName=<value optimized
out>,
lineNumber=<value optimized out>) at assert.c:54
#3 0x00000000004a4199 in _hash_freeovflpage (rel=0x7f3f745d86b8,
bucketbuf=198, ovflbuf=199, wbuf=198, itups=0x7ffc258fa090,
itup_offsets=0x126e8a8,
tups_size=0x7ffc258f93d0, nitups=70, bstrategy=0x12ba320) at
hashovfl.c:553
#4 0x00000000004a4c32 in _hash_squeezebucket (rel=<value optimized out>,
bucket=38, bucket_blkno=56, bucket_buf=198, bstrategy=0x12ba320)
at hashovfl.c:1010
#5 0x00000000004a042a in hashbucketcleanup (rel=0x7f3f745d86b8,
bucket_buf=198, bucket_blkno=56, bstrategy=0x12ba320, maxbucket=96,
highmask=127,
lowmask=63, tuples_removed=0x7ffc258fc1c8,
num_index_tuples=0x7ffc258fc1c0, bucket_has_garbage=0 '\000', delay=1
'\001',
callback=0x5e9bd0 <lazy_tid_reaped>, callback_state=0x126e248) at
hash.c:937
#6 0x00000000004a07e7 in hashbulkdelete (info=0x7ffc258fc2b0, stats=0x0,
callback=0x5e9bd0 <lazy_tid_reaped>, callback_state=0x126e248) at hash.c:580
#7 0x00000000005e98c5 in lazy_vacuum_index (indrel=0x7f3f745d86b8,
stats=0x126ecc0, vacrelstats=0x126e248) at vacuumlazy.c:1599
#8 0x00000000005ea7f9 in lazy_scan_heap (onerel=<value optimized out>,
options=<value optimized out>, params=0x12ba290, bstrategy=<value optimized
out>)
at vacuumlazy.c:1291
#9 lazy_vacuum_rel (onerel=<value optimized out>, options=<value optimized
out>, params=0x12ba290, bstrategy=<value optimized out>) at vacuumlazy.c:255
#10 0x00000000005e8939 in vacuum_rel (relid=17329, relation=0x7ffc258fcbd0,
options=99, params=0x12ba290) at vacuum.c:1399
#11 0x00000000005e8d01 in vacuum (options=99, relation=0x7ffc258fcbd0,
relid=<value optimized out>, params=0x12ba290, va_cols=0x0,
bstrategy=<value optimized out>, isTopLevel=1 '\001') at vacuum.c:307
#12 0x00000000006a07f1 in autovacuum_do_vac_analyze () at autovacuum.c:2823
#13 do_autovacuum () at autovacuum.c:2341
#14 0x00000000006a0f9c in AutoVacWorkerMain (argc=<value optimized out>,
argv=<value optimized out>) at autovacuum.c:1656
#15 0x00000000006a1116 in StartAutoVacWorker () at autovacuum.c:1461
#16 0x00000000006afb00 in StartAutovacuumWorker (postgres_signal_arg=<value
optimized out>) at postmaster.c:5323
#17 sigusr1_handler (postgres_signal_arg=<value optimized out>) at
postmaster.c:5009
#18 <signal handler called>
#19 0x0000003838ce1503 in __select_nocancel () at
../sysdeps/unix/syscall-template.S:82
#20 0x00000000006b0ec0 in ServerLoop (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:1657
#21 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>)
at postmaster.c:1301
#22 0x0000000000632e88 in main (argc=4, argv=0x11f4d50) at main.c:228

Cheers,

Jeff

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Amit Kapila (#1)

Re: Write Ahead Logging for Hash Indexes

On 23/08/16 15:24, Amit Kapila wrote:

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] - /messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

Firstly - really nice! Patches applied easily etc to latest version 10
checkout.

I thought I'd test by initializing pgbench schema, adding a standby,
then adding some hash indexes and running pgbench:

Looking on the standby:

they have been replicated there ok.

However I'm seeing a hang on the master after a while:

but no errors in the logs, any thoughts?

Cheers

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Mark Kirkwood (#5)

Re: Write Ahead Logging for Hash Indexes

On 24/08/16 12:09, Mark Kirkwood wrote:

On 23/08/16 15:24, Amit Kapila wrote:

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] -
/messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

Firstly - really nice! Patches applied easily etc to latest version 10
checkout.

I thought I'd test by initializing pgbench schema, adding a standby,
then adding some hash indexes and running pgbench:

Looking on the standby:

bench=# \d pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers
----------+---------------+-----------
aid | integer | not null
bid | integer |
abalance | integer |
filler | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid) <====

bench=# \d pgbench_history
Table "public.pgbench_history"
Column | Type | Modifiers
--------+-----------------------------+-----------
tid | integer |
bid | integer |
aid | integer |
delta | integer |
mtime | timestamp without time zone |
filler | character(22) |
Indexes:
"pgbench_history_bid" hash (bid) <=====

they have been replicated there ok.

However I'm seeing a hang on the master after a while:

bench=# SELECT datname,application_name,state,now()-xact_start AS
wait,query FROM pg_stat_activity ORDER BY wait DESC;
datname | application_name | state | wait | query
---------+------------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------

| walreceiver | idle | |
bench | pgbench | active | 00:31:38.367467 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973,
-3868, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:38.215378 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95,
3954814, 2091, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.991056 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33,
8355237, 3438, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.619798 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16,
4839994, -2443, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.544196 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119,
4053, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.334504 | UPDATE
pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
bench | pgbench | active | 00:31:35.234112 | UPDATE
pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
bench | pgbench | active | 00:31:34.434676 | UPDATE
pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
bench | psql | active | 00:00:00 | SELECT
datname,application_name,state,now()-xact_start AS wait,query FROM
pg_stat_activity ORDER BY wait DESC;
(10 rows)

but no errors in the logs, any thoughts?

FWIW, retesting with the wal logging patch removed (i.e leaving the
concurrent hast one) works fine.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Mark Kirkwood (#6)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 12:09, Mark Kirkwood wrote:

On 23/08/16 15:24, Amit Kapila wrote:

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] -
/messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

Firstly - really nice! Patches applied easily etc to latest version 10
checkout.

I thought I'd test by initializing pgbench schema, adding a standby, then
adding some hash indexes and running pgbench:

Looking on the standby:

bench=# \d pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers
----------+---------------+-----------
aid | integer | not null
bid | integer |
abalance | integer |
filler | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid) <====

bench=# \d pgbench_history
Table "public.pgbench_history"
Column | Type | Modifiers
--------+-----------------------------+-----------
tid | integer |
bid | integer |
aid | integer |
delta | integer |
mtime | timestamp without time zone |
filler | character(22) |
Indexes:
"pgbench_history_bid" hash (bid) <=====

they have been replicated there ok.

However I'm seeing a hang on the master after a while:

bench=# SELECT datname,application_name,state,now()-xact_start AS
wait,query FROM pg_stat_activity ORDER BY wait DESC;
datname | application_name | state | wait | query

---------+------------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------
| walreceiver | idle | |
bench | pgbench | active | 00:31:38.367467 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973,
-3868, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:38.215378 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814,
2091, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.991056 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237,
3438, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.619798 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994,
-2443, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.544196 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 4053,
CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.334504 | UPDATE
pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
bench | pgbench | active | 00:31:35.234112 | UPDATE
pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
bench | pgbench | active | 00:31:34.434676 | UPDATE
pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
bench | psql | active | 00:00:00 | SELECT
datname,application_name,state,now()-xact_start AS wait,query FROM
pg_stat_activity ORDER BY wait DESC;
(10 rows)

but no errors in the logs, any thoughts?

Can you get the call stacks?

FWIW, retesting with the wal logging patch removed (i.e leaving the
concurrent hast one) works fine.

Okay, information noted.

Thanks for testing and showing interest in the patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Amit Kapila (#7)

Re: Write Ahead Logging for Hash Indexes

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 12:09, Mark Kirkwood wrote:

On 23/08/16 15:24, Amit Kapila wrote:

Thoughts?

Note - To use this patch, first apply latest version of concurrent
hash index patch [2].

[1] - https://commitfest.postgresql.org/10/647/
[2] -
/messages/by-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com

Firstly - really nice! Patches applied easily etc to latest version 10
checkout.

I thought I'd test by initializing pgbench schema, adding a standby, then
adding some hash indexes and running pgbench:

Looking on the standby:

bench=# \d pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers
----------+---------------+-----------
aid | integer | not null
bid | integer |
abalance | integer |
filler | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid) <====

bench=# \d pgbench_history
Table "public.pgbench_history"
Column | Type | Modifiers
--------+-----------------------------+-----------
tid | integer |
bid | integer |
aid | integer |
delta | integer |
mtime | timestamp without time zone |
filler | character(22) |
Indexes:
"pgbench_history_bid" hash (bid) <=====

they have been replicated there ok.

However I'm seeing a hang on the master after a while:

bench=# SELECT datname,application_name,state,now()-xact_start AS
wait,query FROM pg_stat_activity ORDER BY wait DESC;
datname | application_name | state | wait | query

---------+------------------+--------+-----------------+----------------------------------------------------------------------------------------------------------------
| walreceiver | idle | |
bench | pgbench | active | 00:31:38.367467 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (921, 38, 251973,
-3868, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:38.215378 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (280, 95, 3954814,
2091, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.991056 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (447, 33, 8355237,
3438, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.619798 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (134, 16, 4839994,
-2443, CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.544196 | INSERT INTO
pgbench_history (tid, bid, aid, delta, mtime) VALUES (37, 73, 9620119, 4053,
CURRENT_TIMESTAMP);
bench | pgbench | active | 00:31:35.334504 | UPDATE
pgbench_branches SET bbalance = bbalance + -2954 WHERE bid = 33;
bench | pgbench | active | 00:31:35.234112 | UPDATE
pgbench_branches SET bbalance = bbalance + -713 WHERE bid = 38;
bench | pgbench | active | 00:31:34.434676 | UPDATE
pgbench_branches SET bbalance = bbalance + -921 WHERE bid = 33;
bench | psql | active | 00:00:00 | SELECT
datname,application_name,state,now()-xact_start AS wait,query FROM
pg_stat_activity ORDER BY wait DESC;
(10 rows)

but no errors in the logs, any thoughts?

Can you get the call stacks?

For every stuck backend? (just double checking)...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Mark Kirkwood (#8)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

Can you get the call stacks?

For every stuck backend? (just double checking)...

Yeah.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Amit Kapila (#9)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On 24/08/16 16:33, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
Can you get the call stacks?

For every stuck backend? (just double checking)...

Yeah.

Cool,

I managed to reproduce with a reduced workload of 4 backends, then
noticed that the traces for 3 of 'em were all the same. So I've attached
the 2 unique ones, plus noted what pg_stat_activity thought the wait
event was, in case that is useful.

Cheers

Mark

#11

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Mark Kirkwood (#10)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On 24/08/16 16:52, Mark Kirkwood wrote:

On 24/08/16 16:33, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 9:53 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 15:36, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 8:54 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:
Can you get the call stacks?

For every stuck backend? (just double checking)...

Yeah.

Cool,

I managed to reproduce with a reduced workload of 4 backends, then
noticed that the traces for 3 of 'em were all the same. So I've
attached the 2 unique ones, plus noted what pg_stat_activity thought
the wait event was, in case that is useful.

...actually I was wrong there, only 2 of them were the same. So I've
attached a new log of bt's of them all.

#12

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#4)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

Hi Amit,

Thanks for working on this.

When building with --enable-cassert, I get compiler warning:

hash.c: In function 'hashbucketcleanup':
hash.c:722: warning: 'new_bucket' may be used uninitialized in this function

This warning is from concurrent index patch. I will fix it and post
the patch on that thread.

After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.

Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set. Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit. Then while freeing
the overflow page it can hit the Assert as mentioned by you. I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue. As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Mark Kirkwood (#11)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On 24/08/16 17:01, Mark Kirkwood wrote:

...actually I was wrong there, only 2 of them were the same. So I've
attached a new log of bt's of them all.

And I can reproduce with only 1 session, figured that might be a helpful
piece of the puzzle (trace attached).

#14

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Mark Kirkwood (#13)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 11:44 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 24/08/16 17:01, Mark Kirkwood wrote:

...actually I was wrong there, only 2 of them were the same. So I've
attached a new log of bt's of them all.

And I can reproduce with only 1 session, figured that might be a helpful
piece of the puzzle (trace attached).

Thanks.

I think I know the problem here. Basically _hash_freeovflpage() is
trying to take a lock on a buffer previous to overflow page to update
the links and it is quite possible that the same buffer is already
locked for moving the tuples while squeezing the bucket. I am working
on a fix for the same.

Coincidently, Ashutosh Sharma a colleague of mine who was also testing
this patch found the same issue by an attached sql script. So we
might be able to inculcate a test case in the regression suite as
well after fix.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#15

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Amit Kapila (#12)

3 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.

Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set. Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit. Then while freeing
the overflow page it can hit the Assert as mentioned by you. I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue. As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.

The test is rather awkward, it might be easier to just have me test it.
But, I've attached it.

There is a patch that needs to applied and compiled (alongside your
patches, of course), to inject the crashes. A perl script which creates
the schema and does the updates. And a shell script which sets up the
cluster with the appropriate parameters, and then calls the perl script in
a loop.

The top of the shell script has some hard coded paths to the binaries, and
to the test data directory (which is automatically deleted)

I run it like "sh do.sh >& do.err &"

It gives two different types of assertion failures:

$ fgrep TRAP: do.err |sort|uniq -c

21 TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)
32 TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c",
Line: 2506)

The second one is related to the intentional crashes, and so is not
relevant to you.

Cheers,

Jeff

Attachments:

count.plapplication/octet-stream; name=count.plDownload

crash_REL10.patchapplication/octet-stream; name=crash_REL10.patchDownload

diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
new file mode 100644
index 2f7e645..3d7e255
*** a/src/backend/access/transam/varsup.c
--- b/src/backend/access/transam/varsup.c
***************
*** 33,38 ****
--- 33,41 ----
  /* pointer to "variable cache" in shared memory (set up by shmem.c) */
  VariableCache ShmemVariableCache = NULL;
  
+ int JJ_xid=0;
+ extern int	JJ_vac;
+ 
  
  /*
   * Allocate the next XID for a new transaction or subtransaction.
*************** GetNewTransactionId(bool isSubXact)
*** 168,173 ****
--- 171,181 ----
  	 *
  	 * Extend pg_subtrans and pg_commit_ts too.
  	 */
+ 	{
+ 	int		incr;
+ 	for (incr=0; incr <=JJ_xid; incr++)
+ 	{
+ 	xid = ShmemVariableCache->nextXid;
  	ExtendCLOG(xid);
  	ExtendCommitTs(xid);
  	ExtendSUBTRANS(xid);
*************** GetNewTransactionId(bool isSubXact)
*** 179,184 ****
--- 187,194 ----
  	 * more XIDs until there is CLOG space for them.
  	 */
  	TransactionIdAdvance(ShmemVariableCache->nextXid);
+ 	}
+ 	}
  
  	/*
  	 * We must store the new XID into the shared ProcArray before releasing
*************** SetTransactionIdLimit(TransactionId olde
*** 342,349 ****
  	LWLockRelease(XidGenLock);
  
  	/* Log the info */
! 	ereport(DEBUG1,
! 			(errmsg("transaction ID wrap limit is %u, limited by database with OID %u",
  					xidWrapLimit, oldest_datoid)));
  
  	/*
--- 352,359 ----
  	LWLockRelease(XidGenLock);
  
  	/* Log the info */
! 	ereport(LOG,
! 			(errmsg("JJ transaction ID wrap limit is %u, limited by database with OID %u",
  					xidWrapLimit, oldest_datoid)));
  
  	/*
*************** ForceTransactionIdLimitUpdate(void)
*** 420,425 ****
--- 430,446 ----
  	oldestXidDB = ShmemVariableCache->oldestXidDB;
  	LWLockRelease(XidGenLock);
  
+ 	if (JJ_vac) {
+ 	elog(LOG,"JJ ForceTransactionIdLimitUpdate in %d: !normal %d, !valid %d, follows %d (%u, %u), !exists %d", MyDatabaseId,
+ 		!TransactionIdIsNormal(oldestXid),
+ 		!TransactionIdIsValid(xidVacLimit),
+ 		TransactionIdFollowsOrEquals(nextXid, xidVacLimit),
+ 		nextXid, xidVacLimit,
+  		!SearchSysCacheExists1(DATABASEOID, ObjectIdGetDatum(oldestXidDB))
+ 	);
+ 	};
+ 
+ 
  	if (!TransactionIdIsNormal(oldestXid))
  		return true;			/* shouldn't happen, but just in case */
  	if (!TransactionIdIsValid(xidVacLimit))
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index f13f9c1..bc92ec6
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** BootStrapXLOG(void)
*** 4821,4826 ****
--- 4821,4827 ----
  	ShmemVariableCache->nextOid = checkPoint.nextOid;
  	ShmemVariableCache->oidCount = 0;
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
+ 	//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
  	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
*************** StartupXLOG(void)
*** 6339,6344 ****
--- 6340,6346 ----
  	ShmemVariableCache->nextOid = checkPoint.nextOid;
  	ShmemVariableCache->oidCount = 0;
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
+ 	//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
  	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
*************** xlog_redo(XLogReaderState *record)
*** 9247,9252 ****
--- 9249,9255 ----
  
  		MultiXactAdvanceOldest(checkPoint.oldestMulti,
  							   checkPoint.oldestMultiDB);
+ 		//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
  		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  
  		/*
*************** xlog_redo(XLogReaderState *record)
*** 9344,9349 ****
--- 9347,9353 ----
  		 */
  		MultiXactAdvanceOldest(checkPoint.oldestMulti,
  							   checkPoint.oldestMultiDB);
+ 		//elog(LOG,"JJ maybe SetTransactionIDLimit %d", checkPoint.oldestXid);
  		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
  								  checkPoint.oldestXid))
  			SetTransactionIdLimit(checkPoint.oldestXid,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
new file mode 100644
index 0563e63..348b855
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
*************** int			vacuum_freeze_min_age;
*** 58,63 ****
--- 58,65 ----
  int			vacuum_freeze_table_age;
  int			vacuum_multixact_freeze_min_age;
  int			vacuum_multixact_freeze_table_age;
+ int			JJ_vac=0;
+ 
  
  
  /* A few variables that don't seem worth passing around as parameters */
*************** vacuum_set_xid_limits(Relation rel,
*** 540,545 ****
--- 542,548 ----
  	}
  
  	*freezeLimit = limit;
+ 	if (JJ_vac) elog(LOG,"JJ freezeLimit %d", *freezeLimit);
  
  	/*
  	 * Compute the multixact age for which freezing is urgent.  This is
*************** vacuum_set_xid_limits(Relation rel,
*** 594,599 ****
--- 597,604 ----
  		 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
  		 * before anti-wraparound autovacuum is launched.
  		 */
+ 		if (JJ_vac) elog(LOG,"JJ freeze_min_age %d vacuum_freeze_table_age %d freeze_table_age %d ReadNew %d", freeze_min_age, 
+                            vacuum_freeze_table_age, freeze_table_age,ReadNewTransactionId());
  		freezetable = freeze_table_age;
  		if (freezetable < 0)
  			freezetable = vacuum_freeze_table_age;
*************** vac_update_datfrozenxid(void)
*** 1031,1036 ****
--- 1036,1042 ----
  	 * truncate pg_clog and/or pg_multixact.  Also do it if the shared
  	 * XID-wrap-limit info is stale, since this action will update that too.
  	 */
+ 	if (JJ_vac && dirty) elog(LOG,"JJ updating in %d without call to ForceTransactionIdLimitUpdate", MyDatabaseId);
  	if (dirty || ForceTransactionIdLimitUpdate())
  		vac_truncate_clog(newFrozenXid, newMinMulti,
  						  lastSaneFrozenXid, lastSaneMinMulti);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
new file mode 100644
index 231e92d..c709b37
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 64,69 ****
--- 64,70 ----
  #include "utils/tqual.h"
  
  
+ extern int JJ_vac;
  /*
   * Space/time tradeoff parameters: do these need to be user-tunable?
   *
*************** lazy_vacuum_rel(Relation onerel, int opt
*** 236,241 ****
--- 237,244 ----
  	if (options & VACOPT_DISABLE_PAGE_SKIPPING)
  		aggressive = true;
  
+ 	if (JJ_vac) elog(LOG,"JJ aggresive %d, relfrozenid %d", aggressive, onerel->rd_rel->relfrozenxid);
+ 
  	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
  
  	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
new file mode 100644
index 3768f50..beca0eb
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
*************** int			autovacuum_vac_cost_delay;
*** 122,127 ****
--- 122,128 ----
  int			autovacuum_vac_cost_limit;
  
  int			Log_autovacuum_min_duration = -1;
+ extern int  JJ_vac;
  
  /* how long to keep pgstat data in the launcher, in milliseconds */
  #define STATS_READ_DELAY 1000
*************** AutoVacWorkerMain(int argc, char *argv[]
*** 1643,1650 ****
  		InitPostgres(NULL, dbid, NULL, InvalidOid, dbname);
  		SetProcessingMode(NormalProcessing);
  		set_ps_display(dbname, false);
- 		ereport(DEBUG1,
- 				(errmsg("autovacuum: processing database \"%s\"", dbname)));
  
  		if (PostAuthDelay)
  			pg_usleep(PostAuthDelay * 1000000L);
--- 1644,1649 ----
*************** AutoVacWorkerMain(int argc, char *argv[]
*** 1652,1658 ****
--- 1651,1661 ----
  		/* And do an appropriate amount of work */
  		recentXid = ReadNewTransactionId();
  		recentMulti = ReadNextMultiXactId();
+ 		if (JJ_vac) ereport(LOG,
+ 				(errmsg("autovacuum: processing database \"%s\" at recent Xid of %u recent mxid of %u", dbname,recentXid,recentMulti)));
  		do_autovacuum();
+ 		if (JJ_vac) ereport(LOG,
+ 				(errmsg("autovacuum: done processing database \"%s\" at recent Xid of %u recent mxid of %u", dbname,ReadNewTransactionId(),ReadNextMultiXactId())));
  	}
  
  	/*
*************** relation_needs_vacanalyze(Oid relid,
*** 2773,2785 ****
  		 * reset, because if that happens, the last vacuum and analyze counts
  		 * will be reset too.
  		 */
- 		elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f)",
- 			 NameStr(classForm->relname),
- 			 vactuples, vacthresh, anltuples, anlthresh);
  
  		/* Determine if this table needs vacuum or analyze. */
  		*dovacuum = force_vacuum || (vactuples > vacthresh);
  		*doanalyze = (anltuples > anlthresh);
  	}
  	else
  	{
--- 2776,2789 ----
  		 * reset, because if that happens, the last vacuum and analyze counts
  		 * will be reset too.
  		 */
  
  		/* Determine if this table needs vacuum or analyze. */
  		*dovacuum = force_vacuum || (vactuples > vacthresh);
  		*doanalyze = (anltuples > anlthresh);
+ 
+ 		if (JJ_vac) elog(LOG, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f) wraparound %d dovaccum %d doanalyze %d",
+ 			 NameStr(classForm->relname),
+ 			 vactuples, vacthresh, anltuples, anlthresh, *wraparound, *dovacuum, *doanalyze);
  	}
  	else
  	{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
new file mode 100644
index f329d15..8fe51f9
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 67,72 ****
--- 67,74 ----
  #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
  #endif
  
+ int JJ_torn_page=0;
+ 
  /*
   *	The magnetic disk storage manager keeps track of open file
   *	descriptors in its own descriptor pool.  This is done to make it
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 806,811 ****
--- 808,814 ----
  	off_t		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
+         static int counter=0;
  
  	/* This assert is too expensive to have on normally ... */
  #ifdef CHECK_WRITE_VS_EXTEND
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 831,837 ****
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
! 	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
  
  	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
  										reln->smgr_rnode.node.spcNode,
--- 834,851 ----
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
!         if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress()) {
! 	  nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! 		ereport(FATAL,
! 				(errcode(ERRCODE_DISK_FULL),
! 				 errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! 						blocknum,
! 						relpath(reln->smgr_rnode, forknum),
! 						nbytes, BLCKSZ),
! 				 errhint("JJ is screwing with the database.")));
!         } else {
! 	  nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! 	}
  
  	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
  										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index c5178f7..c2d8e70
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 106,111 ****
--- 106,114 ----
  /* XXX these should appear in other modules' header files */
  extern bool Log_disconnections;
  extern int	CommitDelay;
+ int	JJ_torn_page;
+ extern int	JJ_xid;
+ extern int	JJ_vac;
  extern int	CommitSiblings;
  extern char *default_tablespace;
  extern char *temp_tablespaces;
*************** static struct config_int ConfigureNamesI
*** 2359,2364 ****
--- 2362,2394 ----
  	},
  
  	{
+ 		{"JJ_torn_page", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("Simulate a torn-page crash after this number of page writes (0 to turn off)"),
+ 			NULL
+ 		},
+ 		&JJ_torn_page,
+ 		0, 0, 100000, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"JJ_xid", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("Skip this many xid every time we acquire one"),
+ 			NULL
+ 		},
+ 		&JJ_xid,
+ 		0, 0, 1000000, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"JJ_vac", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("turn on verbose logging"),
+ 			NULL
+ 		},
+ 		&JJ_vac,
+ 		0, 0, 1000000, NULL, NULL
+ 	},
+ 
+ 	{
  		{"commit_siblings", PGC_USERSET, WAL_SETTINGS,
  			gettext_noop("Sets the minimum concurrent open transactions before performing "
  						 "commit_delay."),
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
new file mode 100644
index a2b2b61..fe4c806
*** a/src/include/pg_config_manual.h
--- b/src/include/pg_config_manual.h
***************
*** 283,289 ****
  /*
   * Enable debugging print statements for lock-related operations.
   */
! /* #define LOCK_DEBUG */
  
  /*
   * Enable debugging print statements for WAL-related operations; see
--- 283,289 ----
  /*
   * Enable debugging print statements for lock-related operations.
   */
! #define LOCK_DEBUG 1
  
  /*
   * Enable debugging print statements for WAL-related operations; see

do.shapplication/x-sh; name=do.shDownload

#16

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Amit Kapila (#1)

Re: Write Ahead Logging for Hash Indexes

Amit Kapila wrote:

$SUBJECT will make hash indexes reliable and usable on standby.

Nice work.

Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
instead of cramming it in hash.h? Removing the existing #include
"xlogreader.h" from hash.h would be nice. I volunteer for pushing any
preliminary header cleanup commits.

I think it's silly that access/hash.h is the go-to header for hash
usage, including hash_any(). Not this patch's fault, of course, just
venting.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#16)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Amit Kapila wrote:

$SUBJECT will make hash indexes reliable and usable on standby.

Nice work.

Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
instead of cramming it in hash.h? Removing the existing #include
"xlogreader.h" from hash.h would be nice. I volunteer for pushing any
preliminary header cleanup commits.

So, what you are expecting here is that we create hash_xlog.h and move
necessary functions like hash_redo(), hash_desc() and hash_identify()
to it. Then do whatever else is required to build it successfully.
Once that is done, I can build my patches on top of it. Is that
right? If yes, I can work on it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Amit Kapila (#17)

Re: Write Ahead Logging for Hash Indexes

Amit Kapila wrote:

On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
instead of cramming it in hash.h? Removing the existing #include
"xlogreader.h" from hash.h would be nice. I volunteer for pushing any
preliminary header cleanup commits.

So, what you are expecting here is that we create hash_xlog.h and move
necessary functions like hash_redo(), hash_desc() and hash_identify()
to it. Then do whatever else is required to build it successfully.
Once that is done, I can build my patches on top of it. Is that
right? If yes, I can work on it.

Yes, thanks.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#18)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Thu, Aug 25, 2016 at 6:54 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Amit Kapila wrote:

On Wed, Aug 24, 2016 at 11:46 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Can you split the new xlog-related stuff to a new file, say hash_xlog.h,
instead of cramming it in hash.h? Removing the existing #include
"xlogreader.h" from hash.h would be nice. I volunteer for pushing any
preliminary header cleanup commits.

So, what you are expecting here is that we create hash_xlog.h and move
necessary functions like hash_redo(), hash_desc() and hash_identify()
to it. Then do whatever else is required to build it successfully.
Once that is done, I can build my patches on top of it. Is that
right? If yes, I can work on it.

Yes, thanks.

How about attached? If you want, I think we can one step further and
move hash_redo to a new file hash_xlog.c which is required for main
patch, but we can leave it for later as well.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

add_hash_xlog-v1.patchapplication/octet-stream; name=add_hash_xlog-v1.patchDownload

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 07496f8..e3b1eef 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -19,6 +19,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "access/relscan.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index d37c9b1..12e1818 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -14,7 +14,7 @@
  */
 #include "postgres.h"
 
-#include "access/hash.h"
+#include "access/hash_xlog.h"
 
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 31c5fd1..9bb1362 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -12,7 +12,7 @@
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/generic_xlog.h"
-#include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "access/heapam_xlog.h"
 #include "access/brin_xlog.h"
 #include "access/multixact.h"
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 017b9c5..8fe20ce 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -14,7 +14,7 @@
 #include "access/generic_xlog.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
-#include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "access/heapam_xlog.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index ce31418..d9df904 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -20,7 +20,6 @@
 #include "access/amapi.h"
 #include "access/itup.h"
 #include "access/sdir.h"
-#include "access/xlogreader.h"
 #include "fmgr.h"
 #include "lib/stringinfo.h"
 #include "storage/bufmgr.h"
@@ -365,9 +364,4 @@ extern bool _hash_convert_tuple(Relation index,
 extern OffsetNumber _hash_binsearch(Page page, uint32 hash_value);
 extern OffsetNumber _hash_binsearch_last(Page page, uint32 hash_value);
 
-/* hash.c */
-extern void hash_redo(XLogReaderState *record);
-extern void hash_desc(StringInfo buf, XLogReaderState *record);
-extern const char *hash_identify(uint8 info);
-
 #endif   /* HASH_H */
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
new file mode 100644
index 0000000..8a46c07
--- /dev/null
+++ b/src/include/access/hash_xlog.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.h
+ *	  header file for postgres hash access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/hash_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef HASH_XLOG_H
+#define HASH_XLOG_H
+
+#include "access/amapi.h"
+#include "access/itup.h"
+#include "access/sdir.h"
+#include "access/xlogreader.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "storage/bufmgr.h"
+#include "storage/lockdefs.h"
+#include "utils/relcache.h"
+
+
+extern void hash_redo(XLogReaderState *record);
+extern void hash_desc(StringInfo buf, XLogReaderState *record);
+extern const char *hash_identify(uint8 info);
+
+#endif   /* HASH_XLOG_H */

#20

Alvaro Herrera

alvherre@2ndquadrant.com

over 9 years ago

In reply to: Amit Kapila (#19)

Re: Write Ahead Logging for Hash Indexes

Amit Kapila wrote:

How about attached?

That works; pushed. (I removed a few #includes from the new file.)

If you want, I think we can one step further and move hash_redo to a
new file hash_xlog.c which is required for main patch, but we can
leave it for later as well.

I think that can be a part of the main patch.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#20)

Re: Write Ahead Logging for Hash Indexes

On Tue, Aug 30, 2016 at 3:40 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Amit Kapila wrote:

How about attached?

That works; pushed.

Thanks.

(I removed a few #includes from the new file.)

oops, copied from hash.h and forgot to remove those.

If you want, I think we can one step further and move hash_redo to a
new file hash_xlog.c which is required for main patch, but we can
leave it for later as well.

I think that can be a part of the main patch.

Okay, makes sense. Any suggestions on splitting the main patch or in
general on design/implementation is welcome.

I will rebase the patches on the latest commit. Current status is
that, I have fixed the problems reported by Jeff Janes and Mark
Kirkwood. Currently, we are doing the stress testing of the patch
using pgbench, once that is done, will post a new version. I am
hoping that it will be complete in this week.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#15)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Wed, Aug 24, 2016 at 10:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.

Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set. Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit. Then while freeing
the overflow page it can hit the Assert as mentioned by you. I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue. As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.

The test is rather awkward, it might be easier to just have me test it.

Okay, I have fixed this issue as explained above. Apart from that, I
have fixed another issue reported by Mark Kirkwood upthread and few
other issues found during internal testing by Ashutosh Sharma.

The locking issue reported by Mark and Ashutosh is that the patch
didn't maintain the locking order while adding overflow page as it
maintains in other write operations (lock the bucket pages first and
then metapage to perform the write operation). I have added the
comments in _hash_addovflpage() to explain the locking order used in
modified patch.

During stress testing with pgbench using master-standby setup, we
found an issue which indicates that WAL replay machinery doesn't
expect completely zeroed pages (See explanation of RBM_NORMAL mode
atop XLogReadBufferExtended). Previously before freeing the overflow
page we were zeroing it, now I have changed it to just initialize the
page such that the page will be empty.

Apart from above, I have added support for old snapshot threshold in
the hash index code.

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

I forgot to mention in my initial mail that Robert and I had some
off-list discussions about the design of this patch, many thanks to
him for providing inputs.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal_hash_index_v2.patchapplication/octet-stream; name=wal_hash_index_v2.patchDownload

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0f09d82..2d0d3bd 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1520,19 +1520,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7c483c6..588e3b0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2107,10 +2107,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 06f49db..f285795 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2323,12 +2323,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f8e55..e075ede 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e9f47c4..a2dc212 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index 5d3bd94..dae82be 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
-       hashsearch.o hashsort.o hashutil.o hashvalidate.o
+       hashsearch.o hashsort.o hashutil.o hashvalidate.o hashxlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index a0feb2f..f4b5521 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -531,6 +531,97 @@ All the freespace operations should be called while holding no buffer
 locks.  Since they need no lmgr locks, deadlock is not possible.
 
 
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which gets
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolledback.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with
+current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform split operation if the number of tuples are more than what can be
+accomodated in initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of old bucket to new
+bucket.  After recovery, both the old and new buckets will be marked with
+in_complete split flag.  The reader algorithm works correctly, as it will scan
+both the old and new buckets as explained in the reader algorithm section
+above.
+
+We finish the split at next insert or split operation on old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require to allocate a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vaccum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeezes the bucket completely.
+
+
 Other Notes
 -----------
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index a12a830..6553165 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -27,6 +27,7 @@
 #include "optimizer/plancat.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -115,7 +116,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -177,7 +178,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
@@ -297,6 +298,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -604,6 +610,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -629,7 +636,28 @@ loop_top:
 		num_index_tuples = metap->hashm_ntuples;
 	}
 
-	_hash_wrtbuf(rel, metabuf);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
 	if (stats == NULL)
@@ -720,7 +748,6 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
-		bool		curr_page_dirty = false;
 
 		if (delay)
 			vacuum_delay_point();
@@ -787,9 +814,41 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
-			curr_page_dirty = true;
+
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -804,15 +863,7 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		 * release the lock on previous page after acquiring the lock on next
 		 * page
 		 */
-		if (curr_page_dirty)
-		{
-			if (retain_pin)
-				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-			else
-				_hash_wrtbuf(rel, buf);
-			curr_page_dirty = false;
-		}
-		else if (retain_pin)
+		if (retain_pin)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, buf);
@@ -844,10 +895,36 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+
+		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_CLEAR_GARBAGE);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
+	 * we need to release and reacquire the lock on bucket buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
 	 * If we deleted anything, try to compact free space.  For squeezing the
 	 * bucket, we must have a cleanup lock, else it can impact the ordering of
 	 * tuples for a scan that has started before it.
@@ -856,9 +933,3 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 							bstrategy);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5cfd0aa..a04fe2d 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -44,6 +45,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -248,31 +250,63 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * write and release the modified page and ensure to release the pin on
-	 * primary page.
-	 */
-	_hash_wrtbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap->hashm_ntuples += 1;
 
 	/* Make sure this stays in sync with _hash_expandtable() */
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/* Attempt to split if a split is needed */
 	if (do_expand)
@@ -314,3 +348,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 760563a..84cb339 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,10 +18,10 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -84,7 +84,9 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -102,13 +104,37 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
+	HashMetaPage metap;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
+	BlockNumber blkno;
+	uint32		orig_firstfree;
+	uint32		splitnum;
+	uint32	   *freep = NULL;
+	uint32		max_ovflpg;
+	uint32		bit;
+	uint32		bitmap_page_bit;
+	uint32		first_page;
+	uint32		last_bit;
+	uint32		last_page;
+	uint32		i,
+				j;
+	bool		page_found = false;
 
 	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
+	 * Write-lock the tail page.  Here, we need to maintain locking order such
+	 * that, first acquire the lock on tail page of bucket, then on meta page
+	 * to find and lock the bitmap page and if it is found, then lock on meta
+	 * page is released, then finally acquire the lock on new overflow buffer.
+	 * We need this locking order to avoid deadlock with backends that are
+	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here, if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_WRITE);
 
@@ -136,55 +162,6 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
 
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
-	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
-	BlockNumber blkno;
-	uint32		orig_firstfree;
-	uint32		splitnum;
-	uint32	   *freep = NULL;
-	uint32		max_ovflpg;
-	uint32		bit;
-	uint32		first_page;
-	uint32		last_bit;
-	uint32		last_page;
-	uint32		i,
-				j;
-
 	/* Get exclusive lock on the meta page */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
@@ -233,11 +210,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -261,8 +258,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -273,7 +277,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	}
 
 	/* Calculate address of the new overflow page */
-	bit = metap->hashm_spares[splitnum];
+	bit = BufferIsValid(newmapbuf) ?
+		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -281,39 +286,51 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * relation length stays in sync with ours.  XXX It's annoying to do this
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
+	 *
+	 * It is okay to hold two buffer locks here (one on tail page of bucket
+	 * and other on new overflow page) since there cannot be anyone else
+	 * contending for access to ovflbuf.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case new page is added for them.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
-
-	/* Write updated metapage and release lock, but not pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	START_CRIT_SECTION();
 
-	return newbuf;
-
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	_hash_wrtbuf(rel, mapbuf);
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
 
-	/* Reacquire exclusive lock on the meta page */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -322,18 +339,101 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
-		_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* We didn't change the metapage, so no need to write */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
 	}
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	END_CRIT_SECTION();
+
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, buf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, ovflbuf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, ovflbuf, HASH_NOLOCK, HASH_WRITE);
+
+	return ovflbuf;
 }
 
 /*
@@ -366,6 +466,12 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -377,7 +483,9 @@ _hash_firstfreebit(uint32 map)
  *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -387,6 +495,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	BlockNumber prevblkno;
 	BlockNumber blkno;
 	BlockNumber nextblkno;
+	BlockNumber writeblkno;
 	HashPageOpaque ovflopaque;
 	Page		ovflpage;
 	Page		mappage;
@@ -395,6 +504,10 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	BlockNumber bucketblkno;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -403,69 +516,37 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
 	nextblkno = ovflopaque->hasho_nextblkno;
 	prevblkno = ovflopaque->hasho_prevblkno;
+	writeblkno = BufferGetBlockNumber(wbuf);
 	bucket = ovflopaque->hasho_bucket;
-
-	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	_hash_wrtbuf(rel, ovflbuf);
+	bucketblkno = BufferGetBlockNumber(bucketbuf);
 
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  No concurrency issues since we hold the cleanup lock on
 	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
-	 * primary bucket, as we already have that lock.
+	 * primary bucket or if the previous bucket is same as write bucket, as we
+	 * already have lock on those buckets.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		if (prevblkno == bucket_blkno)
-		{
-			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
-													 prevblkno,
-													 RBM_NORMAL,
-													 bstrategy);
-
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
-			MarkBufferDirty(prevbuf);
-			ReleaseBuffer(prevbuf);
-		}
+		if (prevblkno == bucketblkno)
+			prevbuf = bucketbuf;
+		else if (prevblkno == writeblkno)
+			prevbuf = wbuf;
 		else
-		{
-			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-															 prevblkno,
-															 HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-															 bstrategy);
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
-			_hash_wrtbuf(rel, prevbuf);
-		}
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_WRITE,
+												 LH_OVERFLOW_PAGE,
+												 bstrategy);
 	}
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		_hash_wrtbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -491,60 +572,191 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	_hash_wrtbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed
+	 * the page as WAL replay routines expect pages to be initialized.
+	 * See explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
-		_hash_wrtbuf(rel, metabuf);
+		update_metap = true;
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* no need to change metapage */
-		_hash_relbuf(rel, metabuf);
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		/*
+		 * As bitmap page doesn't have standard page layout, so this will
+		 * allow us to log the data.
+		 */
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
 
+	END_CRIT_SECTION();
+
+	/*
+	 * Release the lock, caller will decide whether to release the pin or
+	 * reacquire the lock.  Caller is responsibile for locking and unlocking
+	 * bucket buffer.
+	 */
+	if (BufferGetBlockNumber(wbuf) != bucketblkno)
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	/* release previous bucket if it is not same as primary or write bucket */
+	if (BufferIsValid(prevbuf) &&
+		prevblkno != bucketblkno &&
+		prevblkno != writeblkno)
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
+
 	return nextblkno;
 }
 
-
 /*
- *	_hash_initbitmap()
- *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
+ *	_hash_initbitmapbuffer()
  *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
  */
 void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 {
-	Buffer		buf;
 	Page		pg;
 	HashPageOpaque op;
 	uint32	   *freep;
 
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
 	pg = BufferGetPage(buf);
 
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
 	/* initialize the page's special space */
 	op = (HashPageOpaque) PageGetSpecialPointer(pg);
 	op->hasho_prevblkno = InvalidBlockNumber;
@@ -555,25 +767,9 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 	/* set all of the bits to 1 */
 	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* write out the new bitmap page (releasing write lock and pin) */
-	_hash_wrtbuf(rel, buf);
-
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
+	MemSet(freep, 0xFF, bmsize);
 }
 
-
 /*
  *	_hash_squeezebucket(rel, bucket)
  *
@@ -613,8 +809,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
-	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
@@ -657,23 +851,30 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
-
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		IndexTuple	itup;
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		Size		itemsz;
+		int			i;
+
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
+
+readpage:
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
 		for (roffnum = FirstOffsetNumber;
 			 roffnum <= maxroffnum;
 			 roffnum = OffsetNumberNext(roffnum))
 		{
-			IndexTuple	itup;
-			Size		itemsz;
-
 			itup = (IndexTuple) PageGetItem(rpage,
 											PageGetItemId(rpage, roffnum));
 			itemsz = IndexTupleDSize(*itup);
@@ -681,39 +882,113 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMulTups(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
+				bool		release_wbuf = false;
+				bool		tups_moved = false;
+
 				Assert(!PageIsEmpty(wpage));
 
+				/*
+				 * caller is responsibile for locking and unlocking bucket
+				 * buffer
+				 */
 				if (wblkno != bucket_blkno)
-					release_buf = true;
+					release_wbuf = true;
 
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty && release_buf)
-					_hash_wrtbuf(rel, wbuf);
-				else if (wbuf_dirty)
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
 					MarkBufferDirty(wbuf);
-				else if (release_buf)
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
+					tups_moved = true;
+				}
+
+				if (release_wbuf)
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_SQUEEZE_PAGE on standby users will be able to
+				 * view the results of partial deletion on rblkno.
+				 */
+				_hash_chgbufaccess(rel, rbuf, HASH_READ, HASH_NOLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						_hash_wrtbuf(rel, rbuf);
-					}
-					else
-						_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				_hash_chgbufaccess(rel, rbuf, HASH_NOLOCK, HASH_WRITE);
+
 				wbuf = _hash_getbuf_with_strategy(rel,
 												  wblkno,
 												  HASH_WRITE,
@@ -722,21 +997,33 @@ _hash_squeezebucket(Relation rel,
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
-				release_buf = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
+
+				/*
+				 * after moving the tuples, rpage would have been compacted,
+				 * so we need to rescan it.
+				 */
+				if (tups_moved)
+					goto readpage;
+			}
 
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -748,34 +1035,36 @@ _hash_squeezebucket(Relation rel,
 		 * Tricky point here: if our read and write pages are adjacent in the
 		 * bucket chain, our write lock on wbuf will conflict with
 		 * _hash_freeovflpage's attempt to update the sibling links of the
-		 * removed page.  However, in that case we are done anyway, so we can
-		 * simply drop the write lock before calling _hash_freeovflpage.
+		 * removed page.  However, in that case we are ensuring that
+		 * _hash_freeovflpage doesn't take lock on that page again.  Releasing
+		 * the lock is not an option, because before that we need to write WAL
+		 * for the change in this page.
 		 */
 		rblkno = ropaque->hasho_prevblkno;
 		Assert(BlockNumberIsValid(rblkno));
 
+		/* free this overflow page (releases rbuf) */
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
+
+		pfree(itup_offsets);
+
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
+			/* Caller is responsibile for locking and unlocking bucket buffer */
 			if (wblkno != bucket_blkno)
-				release_buf = true;
-
-			/* yes, so release wbuf lock first if needed */
-			if (wbuf_dirty && release_buf)
-				_hash_wrtbuf(rel, wbuf);
-			else if (wbuf_dirty)
-				MarkBufferDirty(wbuf);
-			else if (release_buf)
-				_hash_relbuf(rel, wbuf);
-
-			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
-			/* done */
+				_hash_dropbuf(rel, wbuf);
 			return;
 		}
 
-		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
+		/* Caller is responsibile for locking and unlocking bucket buffer */
+		if (wblkno != bucket_blkno)
+			_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index f51c313..716b037 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -40,12 +40,11 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -160,6 +159,29 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag, bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	pageopaque->hasho_prevblkno = InvalidBlockNumber;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -286,35 +308,17 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 }
 
 /*
- *	_hash_wrtbuf() -- write a hash page to disk.
- *
- *		This routine releases the lock held on the buffer and our refcount
- *		for it.  It is an error to call _hash_wrtbuf() without a write lock
- *		and a pin on the buffer.
- *
- * NOTE: this routine should go away when/if hash indexes are WAL-ified.
- * The correct sequence of operations is to mark the buffer dirty, then
- * write the WAL record, then release the lock and pin; so marking dirty
- * can't be combined with releasing.
- */
-void
-_hash_wrtbuf(Relation rel, Buffer buf)
-{
-	MarkBufferDirty(buf);
-	UnlockReleaseBuffer(buf);
-}
-
-/*
  * _hash_chgbufaccess() -- Change the lock type on a buffer, without
  *			dropping our pin on it.
  *
- * from_access and to_access may be HASH_READ, HASH_WRITE, or HASH_NOLOCK,
- * the last indicating that no buffer-level lock is held or wanted.
+ * from_access can be HASH_READ or HASH_NOLOCK, the later indicating that
+ * no buffer-level lock is held or wanted.
+ *
+ * to_access can be HASH_READ, HASH_WRITE, or HASH_NOLOCK, the last indicating
+ * that no buffer-level lock is held or wanted.
  *
- * When from_access == HASH_WRITE, we assume the buffer is dirty and tell
- * bufmgr it must be written out.  If the caller wants to release a write
- * lock on a page that's not been modified, it's okay to pass from_access
- * as HASH_READ (a bit ugly, but handy in some places).
+ * If the caller wants to release a write lock on a page, it's okay to pass
+ * from_access as HASH_READ (a bit ugly, but quite handy when required).
  */
 void
 _hash_chgbufaccess(Relation rel,
@@ -322,8 +326,7 @@ _hash_chgbufaccess(Relation rel,
 				   int from_access,
 				   int to_access)
 {
-	if (from_access == HASH_WRITE)
-		MarkBufferDirty(buf);
+	Assert(from_access != HASH_WRITE);
 	if (from_access != HASH_NOLOCK)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	if (to_access != HASH_NOLOCK)
@@ -332,7 +335,7 @@ _hash_chgbufaccess(Relation rel,
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -344,19 +347,18 @@ _hash_chgbufaccess(Relation rel,
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -378,6 +380,154 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -397,30 +547,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -437,7 +582,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -456,44 +601,11 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_firstfree = 0;
 
 	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
 	 */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
-
-	/*
-	 * Initialize the first N buckets
-	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-		pageopaque->hasho_prevblkno = InvalidBlockNumber;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		_hash_wrtbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
-
-	/*
-	 * Initialize first bitmap page
-	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	_hash_wrtbuf(rel, metabuf);
-
-	return num_buckets;
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -502,7 +614,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
@@ -530,10 +641,14 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -569,7 +684,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
@@ -703,7 +818,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -728,18 +847,44 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = InvalidBlockNumber;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -747,6 +892,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -759,10 +905,10 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -775,15 +921,71 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_NOLOCK, HASH_WRITE);
 
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -835,20 +1037,31 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 
 	MemSet(zerobuf, 0, sizeof(zerobuf));
 
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
+
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
 
 	return true;
 }
 
-
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold exclusive locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -874,70 +1087,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress and it has
-	 * deletable tuples. At operation end, we clear split in progress flag and
-	 * vacuum will clear page_has_garbage flag after deleting such tuples.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = InvalidBlockNumber;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * is used to finish the incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -1026,7 +1180,7 @@ _hash_splitbucket_guts(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
 					bool		retain_pin = false;
 
@@ -1036,8 +1190,12 @@ _hash_splitbucket_guts(Relation rel,
 					 */
 					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
 
-					/* write out nbuf and drop lock, but keep pin */
-					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
+					/* drop lock, but keep pin */
+					_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
@@ -1045,6 +1203,13 @@ _hash_splitbucket_guts(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -1053,6 +1218,8 @@ _hash_splitbucket_guts(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1075,7 +1242,16 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
+			if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1091,15 +1267,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
-		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, nbuf);
-
-	/*
-	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
-	 * there is no pending scan that has seen the flag after it is cleared.
-	 */
 	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1108,6 +1275,8 @@ _hash_splitbucket_guts(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	/* indicate that split is finished */
 	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
@@ -1118,6 +1287,29 @@ _hash_splitbucket_guts(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1222,9 +1414,41 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
 	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	obucket = opageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	hash_destroy(tidhtab);
 }
+
+/*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index e3a99cf..3d8b464 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -306,6 +306,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 
 
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
@@ -362,6 +363,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		while (BlockNumberIsValid(opaque->hasho_nextblkno))
 			_hash_readnext(rel, &buf, &page, &opaque);
+
+		if (BufferIsValid(buf))
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	}
 
 	/* Now find the first tuple satisfying the qualification */
@@ -480,6 +484,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readnext(rel, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch(page, so->hashso_sk_hash);
 					}
@@ -503,6 +508,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
 
 							page = BufferGetPage(buf);
+							TestForOldSnapshot(scan->xs_snapshot, rel, page);
 							maxoff = PageGetMaxOffsetNumber(page);
 							offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
@@ -566,6 +572,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(rel, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
@@ -589,6 +596,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
 
 							page = BufferGetPage(buf);
+							TestForOldSnapshot(scan->xs_snapshot, rel, page);
 							maxoff = PageGetMaxOffsetNumber(page);
 							offnum = _hash_binsearch(page, so->hashso_sk_hash);
 
diff --git a/src/backend/access/hash/hashxlog.c b/src/backend/access/hash/hashxlog.c
new file mode 100644
index 0000000..d0ea3ec
--- /dev/null
+++ b/src/backend/access/hash/hashxlog.c
@@ -0,0 +1,971 @@
+/*-------------------------------------------------------------------------
+ *
+ * hashxlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hashxlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, *num_bucket, LH_OVERFLOW_PAGE, true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bukcetbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bukcetbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bukcetbuf));
+		LockBufferForCleanup(bukcetbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bukcetbuf))
+		UnlockReleaseBuffer(bukcetbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bukcetbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bukcetbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bukcetbuf));
+		LockBufferForCleanup(bukcetbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bukcetbuf))
+		UnlockReleaseBuffer(bukcetbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf));
+		LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay clear garbage flag operation for primary bucket page.
+ */
+static void
+hash_xlog_clear_garbage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_CLEAR_GARBAGE:
+			hash_xlog_clear_garbage(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 12e1818..245ce97 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,143 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "split_complete_old_bucket %c, split_complete_new_bucket %c",
+								 (xlrec->old_bucket_flag & LH_BUCKET_OLD_PAGE_SPLIT) ? 'F' : 'T',
+								 (xlrec->new_bucket_flag & LH_BUCKET_NEW_PAGE_SPLIT) ? 'F' : 'T');
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_CLEAR_GARBAGE:
+			id = "CLEAR_GARBAGE";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 85817c6..b868fd8 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -499,11 +499,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..e1ae4e3 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -615,6 +615,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMulTups
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMulTups(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 79e0b1f..a918c6d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5379,13 +5379,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5407,8 +5404,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index bbf822b..d1d30bc 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -184,6 +184,246 @@ typedef struct HashMetaPageData
 
 typedef HashMetaPageData *HashMetaPage;
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_CLEAR_GARBAGE 0xA0	/* clear garbage flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * Data to regenerate the meta-data page
+ */
+typedef struct xl_hash_metadata
+{
+	HashMetaPageData metadata;
+}	xl_hash_metadata;
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (xl_hash_metadata)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
+
 /*
  * Maximum size of a hash index item (it's okay to have only one per page)
  */
@@ -313,13 +553,16 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
+				   BufferAccessStrategy bstrategy);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
@@ -331,6 +574,8 @@ extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
 								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag,
+			  bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -339,11 +584,12 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..c1ac1d9 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -423,6 +423,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMulTups(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 76593e1..80089b9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index d4a45a3..b1fd243 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index 39a60a5..62f2ad7 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 59cb1e0..d907519 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail

#23

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#22)

Re: Write Ahead Logging for Hash Indexes

On Wed, Sep 7, 2016 at 3:28 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I have fixed this issue as explained above. Apart from that, I
have fixed another issue reported by Mark Kirkwood upthread and few
other issues found during internal testing by Ashutosh Sharma.

Forgot to mention that you need to apply the latest concurrent hash
index patch [1]/messages/by-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com before applying this patch.

[1]: /messages/by-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Ashutosh Sharma

ashu.coek88@gmail.com

over 9 years ago

In reply to: Amit Kapila (#22)

Re: Write Ahead Logging for Hash Indexes

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

2) I have also made use of the following tools for analyzing the hash
index tables created on master and standby.

a) pgstattuple : Using this tool, i could verfiy the tuple level
statistics which includes index table physical length, dead tuple
percentageand other infos on both master and standby. However, i could
see below error message when using this tool for one of the hash index
table on both master and standby server.

postgres=# SELECT * FROM pgstattuple('pgbench_history_bid');
ERROR: index "pgbench_history_bid" contains unexpected zero page at
block 2482297

To investigate on this, i made use of pageinspect contrib module and
came to know that above block contains an uninitialised page. But,
this is quite possible in hash index as here the bucket split happens
in the power-of-2 and we may find some of the uninitialised bucket
pages.

b) pg_filedump : Using this tool, i could take a dump of all the hash
index tables on master and standy and compare the dump file to verify
if there is any difference between hash index pages on both master and
standby.

In short, this is all i did to verify the patch for wal logging in hash index.

I would like thank Amit and Robert for their valuable inputs during
the hash index testing phase.

I am also planning to verify wal logging in hash index using Kuntal's
patch for WAL consistency check once it is stablized.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Amit Kapila (#22)

Re: Write Ahead Logging for Hash Indexes

On 07/09/16 21:58, Amit Kapila wrote:

On Wed, Aug 24, 2016 at 10:32 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Aug 23, 2016 at 10:05 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Aug 24, 2016 at 2:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

After an intentionally created crash, I get an Assert triggering:

TRAP: FailedAssertion("!(((freep)[(bitmapbit)/32] &
(1<<((bitmapbit)%32))))", File: "hashovfl.c", Line: 553)

freep[0] is zero and bitmapbit is 16.

Here what is happening is that when it tries to clear the bitmapbit,
it expects it to be set. Now, I think the reason for why it didn't
find the bit as set could be that after the new overflow page is added
and the bit corresponding to it is set, you might have crashed the
system and the replay would not have set the bit. Then while freeing
the overflow page it can hit the Assert as mentioned by you. I think
the problem here could be that I am using REGBUF_STANDARD to log the
bitmap page updates which seems to be causing the issue. As bitmap
page doesn't follow the standard page layout, it would have omitted
the actual contents while taking full page image and then during
replay, it would not have set the bit, because page doesn't need REDO.
I think here the fix is to use REGBUF_NO_IMAGE as we use for vm
buffers.

If you can send me the detailed steps for how you have produced the
problem, then I can verify after fixing whether you are seeing the
same problem or something else.

The test is rather awkward, it might be easier to just have me test it.

Okay, I have fixed this issue as explained above. Apart from that, I
have fixed another issue reported by Mark Kirkwood upthread and few
other issues found during internal testing by Ashutosh Sharma.

The locking issue reported by Mark and Ashutosh is that the patch
didn't maintain the locking order while adding overflow page as it
maintains in other write operations (lock the bucket pages first and
then metapage to perform the write operation). I have added the
comments in _hash_addovflpage() to explain the locking order used in
modified patch.

During stress testing with pgbench using master-standby setup, we
found an issue which indicates that WAL replay machinery doesn't
expect completely zeroed pages (See explanation of RBM_NORMAL mode
atop XLogReadBufferExtended). Previously before freeing the overflow
page we were zeroing it, now I have changed it to just initialize the
page such that the page will be empty.

Apart from above, I have added support for old snapshot threshold in
the hash index code.

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

I forgot to mention in my initial mail that Robert and I had some
off-list discussions about the design of this patch, many thanks to
him for providing inputs.

Repeating my tests with these new patches applied points to the hang
issue being solved. I tested several 10 minute runs (any of which was
enough to elicit the hang previously). I'll do some longer ones, but
looks good!

regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Mark Kirkwood (#25)

Re: Write Ahead Logging for Hash Indexes

On Thu, Sep 8, 2016 at 10:02 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

Repeating my tests with these new patches applied points to the hang issue
being solved. I tested several 10 minute runs (any of which was enough to
elicit the hang previously). I'll do some longer ones, but looks good!

Thanks for verifying the issue and doing further testing of the patch.
It is really helpful.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Ashutosh Sharma (#24)

Re: Write Ahead Logging for Hash Indexes

On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma <ashu.coek88@gmail.com>
wrote:

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

postgres=# \d pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers
----------+---------------+-----------
aid | integer | not null
bid | integer |
abalance | integer |
filler | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)

postgres=# \d pgbench_history
Table "public.pgbench_history"
Column | Type | Modifiers
--------+-----------------------------+-----------
tid | integer |
bid | integer |
aid | integer |
delta | integer |
mtime | timestamp without time zone |
filler | character(22) |
Indexes:
"pgbench_history_bid" hash (bid)

Hi Ashutosh,

This schema will test the maintenance of hash indexes, but it will never
use hash indexes for searching, so it limits the amount of test coverage
you will get. While searching shouldn't generate novel types of WAL
records (that I know of), it will generate locking and timing issues that
might uncover bugs (if there are any left to uncover, of course).

I would drop the primary key on pgbench_accounts and replace it with a hash
index and test it that way (except I don't have a 128 core machine at my
disposal, so really I am suggesting that you do this...)

The lack of primary key and the non-uniqueness of the hash index should not
be an operational problem, because the built in pgbench runs never attempt
to violate the constraints anyway.

In fact, I'd replace all of the indexes on the rest of the pgbench tables
with hash indexes, too, just for additional testing.

I plan to do testing using my own testing harness after changing it to
insert a lot of dummy tuples (ones with negative values in the pseudo-pk
column, which are never queried by the core part of the harness) and
deleting them at random intervals. I think that none of pgbench's built in
tests are likely to give the bucket splitting and squeezing code very much
exercise.

Is there a way to gather statistics on how many of each type of WAL record
are actually getting sent over the replication link? The only way I can
think of is to turn on wal archving as well as replication, then using
pg_xlogdump to gather the stats.

I've run my original test for a while now and have not seen any problems.
But I realized I forgot to compile with enable-casserts, to I will have to
redo it to make sure the assertion failures have been fixed. In my
original testing I did very rarely get a deadlock (or some kind of hang),
and I haven't seen that again so far. It was probably the same source as
the one Mark observed, and so the same fix.

Cheers,

Jeff

#28

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Jeff Janes (#27)

Re: Write Ahead Logging for Hash Indexes

On 09/09/16 07:09, Jeff Janes wrote:

On Wed, Sep 7, 2016 at 3:29 AM, Ashutosh Sharma <ashu.coek88@gmail.com
<mailto:ashu.coek88@gmail.com>> wrote:

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

postgres=# \d pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers
----------+---------------+-----------
aid | integer | not null
bid | integer |
abalance | integer |
filler | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)

postgres=# \d pgbench_history
Table "public.pgbench_history"
Column | Type | Modifiers
--------+-----------------------------+-----------
tid | integer |
bid | integer |
aid | integer |
delta | integer |
mtime | timestamp without time zone |
filler | character(22) |
Indexes:
"pgbench_history_bid" hash (bid)

Hi Ashutosh,

This schema will test the maintenance of hash indexes, but it will
never use hash indexes for searching, so it limits the amount of test
coverage you will get. While searching shouldn't generate novel types
of WAL records (that I know of), it will generate locking and timing
issues that might uncover bugs (if there are any left to uncover, of
course).

I would drop the primary key on pgbench_accounts and replace it with a
hash index and test it that way (except I don't have a 128 core
machine at my disposal, so really I am suggesting that you do this...)

The lack of primary key and the non-uniqueness of the hash index
should not be an operational problem, because the built in pgbench
runs never attempt to violate the constraints anyway.

In fact, I'd replace all of the indexes on the rest of the pgbench
tables with hash indexes, too, just for additional testing.

I plan to do testing using my own testing harness after changing it to
insert a lot of dummy tuples (ones with negative values in the
pseudo-pk column, which are never queried by the core part of the
harness) and deleting them at random intervals. I think that none of
pgbench's built in tests are likely to give the bucket splitting and
squeezing code very much exercise.

Is there a way to gather statistics on how many of each type of WAL
record are actually getting sent over the replication link? The only
way I can think of is to turn on wal archving as well as replication,
then using pg_xlogdump to gather the stats.

I've run my original test for a while now and have not seen any
problems. But I realized I forgot to compile with enable-casserts, to
I will have to redo it to make sure the assertion failures have been
fixed. In my original testing I did very rarely get a deadlock (or
some kind of hang), and I haven't seen that again so far. It was
probably the same source as the one Mark observed, and so the same fix.

Cheers,

Jeff

Yeah, good suggestion about replacing (essentially) all the indexes with
hash ones and testing. I did some short runs with this type of schema
yesterday (actually to get a feel for if hash performance vs btree was
compareable - does seem tp be) - but probably longer ones with higher
concurrency (as high as I can manage on a single socket i7 anyway) is a
good plan. If Ashutosh has access to seriously large numbers of cores
then that is even better :-)

Cheers

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#27)

Re: Write Ahead Logging for Hash Indexes

On Fri, Sep 9, 2016 at 12:39 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

I plan to do testing using my own testing harness after changing it to
insert a lot of dummy tuples (ones with negative values in the pseudo-pk
column, which are never queried by the core part of the harness) and
deleting them at random intervals. I think that none of pgbench's built in
tests are likely to give the bucket splitting and squeezing code very much
exercise.

Hash index tests [1]https://commitfest.postgresql.org/10/716/ written by Mithun does cover part of that code
and we have done testing by extending those tests to cover splitting
and squeezing part of code.

Is there a way to gather statistics on how many of each type of WAL record
are actually getting sent over the replication link? The only way I can
think of is to turn on wal archving as well as replication, then using
pg_xlogdump to gather the stats.

Sounds sensible, but what do you want to know by getting the number of
each type of WAL records? I understand it is important to cover all
the WAL records for hash index (and I think Ashutosh has done that
during his tests [2]/messages/by-id/CAE9k0PkPumi4iWFuD+jHHkpcxn531=DJ8uH0dctsvF+daZY6yQ@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com), but may be sending multiple times same record
could further strengthen the validation.

I've run my original test for a while now and have not seen any problems.
But I realized I forgot to compile with enable-casserts, to I will have to
redo it to make sure the assertion failures have been fixed. In my original
testing I did very rarely get a deadlock (or some kind of hang), and I
haven't seen that again so far. It was probably the same source as the one
Mark observed, and so the same fix.

Thanks for the verification.

[1]: https://commitfest.postgresql.org/10/716/
[2]: /messages/by-id/CAE9k0PkPumi4iWFuD+jHHkpcxn531=DJ8uH0dctsvF+daZY6yQ@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Mark Kirkwood (#28)

Re: Write Ahead Logging for Hash Indexes

On 09/09/16 14:50, Mark Kirkwood wrote:

Yeah, good suggestion about replacing (essentially) all the indexes
with hash ones and testing. I did some short runs with this type of
schema yesterday (actually to get a feel for if hash performance vs
btree was compareable - does seem tp be) - but probably longer ones
with higher concurrency (as high as I can manage on a single socket i7
anyway) is a good plan. If Ashutosh has access to seriously large
numbers of cores then that is even better :-)

I managed to find a slightly bigger server (used our openstack cloud to
run a 16 cpu vm). With the schema modified as follows:

performed several 10 hour runs on size 100 database using 32 and 64
clients. For the last run I rebuilt with assertions enabled. No hangs or
assertion failures.

regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Mark Kirkwood (#30)

Re: Write Ahead Logging for Hash Indexes

On Sun, Sep 11, 2016 at 4:10 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

performed several 10 hour runs on size 100 database using 32 and 64 clients.
For the last run I rebuilt with assertions enabled. No hangs or assertion
failures.

Thanks for verification. Do you think we can do some read-only
workload benchmarking using this server? If yes, then probably you
can use concurrent hash index patch [1]/messages/by-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com and cache the metapage patch
[2]: /messages/by-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com

[1]: /messages/by-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com
[2]: /messages/by-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Amit Kapila (#31)

Re: Write Ahead Logging for Hash Indexes

On 11/09/16 17:01, Amit Kapila wrote:

On Sun, Sep 11, 2016 at 4:10 AM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

performed several 10 hour runs on size 100 database using 32 and 64 clients.
For the last run I rebuilt with assertions enabled. No hangs or assertion
failures.

Thanks for verification. Do you think we can do some read-only
workload benchmarking using this server? If yes, then probably you
can use concurrent hash index patch [1] and cache the metapage patch
[2] (I think Mithun needs to rebase his patch) to do so.

[1] - /messages/by-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com
[2] - /messages/by-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com

I can do - are we checking checking for hangs/assertions or comparing
patched vs unpatched performance (for the metapage patch)?

regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

over 9 years ago

In reply to: Mark Kirkwood (#32)

Re: Write Ahead Logging for Hash Indexes

On 11/09/16 19:16, Mark Kirkwood wrote:

On 11/09/16 17:01, Amit Kapila wrote:

...Do you think we can do some read-only
workload benchmarking using this server? If yes, then probably you
can use concurrent hash index patch [1] and cache the metapage patch
[2] (I think Mithun needs to rebase his patch) to do so.

[1] -
/messages/by-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com
[2] -
/messages/by-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com

I can do - are we checking checking for hangs/assertions or comparing
patched vs unpatched performance (for the metapage patch)?

So, assuming the latter - testing performance with and without the
metapage patch:

For my 1st runs:

- cpus 16, ran 16G
- size 100, clients 32

I'm seeing no difference in performance for read only (-S) pgbench
workload (with everybody using has indexes). I guess not that surprising
as the db fites in ram (1.6G and we have 16G). So I'll retry with a
bigger dataset (suspect size 2000 is needed).

regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Jeff Janes (#27)

3 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

I plan to do testing using my own testing harness after changing it to
insert a lot of dummy tuples (ones with negative values in the pseudo-pk
column, which are never queried by the core part of the harness) and
deleting them at random intervals. I think that none of pgbench's built in
tests are likely to give the bucket splitting and squeezing code very much
exercise.

I've implemented this, by adding lines 197 through 202 to the count.pl
script. (I'm reattaching the test case)

Within a few minutes of testing, I start getting Errors like these:

29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR: buffer 2762 is not
owned by resource owner Portal
29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT: update foo set
count=count+1 where index=$1

In one test, I also got an error from my test harness itself indicating
tuples are transiently missing from the index, starting an hour into a test:

child abnormal exit update did not update 1 row: key 9555 updated 0E0 at
count.pl line 194.\n at count.pl line 208.
child abnormal exit update did not update 1 row: key 8870 updated 0E0 at
count.pl line 194.\n at count.pl line 208.
child abnormal exit update did not update 1 row: key 8453 updated 0E0 at
count.pl line 194.\n at count.pl line 208.

Those key values should always find exactly one row to update.

If the tuples were permanently missing from the index, I would keep getting
errors on the same key values very frequently. But I don't get that, the
errors remain infrequent and are on different value each time, so I think
the tuples are in the index but the scan somehow misses them, either while
the bucket is being split or while it is being squeezed.

This on a build without enable-asserts.

Any ideas on how best to go about investigating this?

Cheers,

Jeff

Attachments:

count.plapplication/octet-stream; name=count.plDownload

crash_REL10.patchapplication/octet-stream; name=crash_REL10.patchDownload

diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
new file mode 100644
index 2f7e645..3d7e255
*** a/src/backend/access/transam/varsup.c
--- b/src/backend/access/transam/varsup.c
***************
*** 33,38 ****
--- 33,41 ----
  /* pointer to "variable cache" in shared memory (set up by shmem.c) */
  VariableCache ShmemVariableCache = NULL;
  
+ int JJ_xid=0;
+ extern int	JJ_vac;
+ 
  
  /*
   * Allocate the next XID for a new transaction or subtransaction.
*************** GetNewTransactionId(bool isSubXact)
*** 168,173 ****
--- 171,181 ----
  	 *
  	 * Extend pg_subtrans and pg_commit_ts too.
  	 */
+ 	{
+ 	int		incr;
+ 	for (incr=0; incr <=JJ_xid; incr++)
+ 	{
+ 	xid = ShmemVariableCache->nextXid;
  	ExtendCLOG(xid);
  	ExtendCommitTs(xid);
  	ExtendSUBTRANS(xid);
*************** GetNewTransactionId(bool isSubXact)
*** 179,184 ****
--- 187,194 ----
  	 * more XIDs until there is CLOG space for them.
  	 */
  	TransactionIdAdvance(ShmemVariableCache->nextXid);
+ 	}
+ 	}
  
  	/*
  	 * We must store the new XID into the shared ProcArray before releasing
*************** SetTransactionIdLimit(TransactionId olde
*** 342,349 ****
  	LWLockRelease(XidGenLock);
  
  	/* Log the info */
! 	ereport(DEBUG1,
! 			(errmsg("transaction ID wrap limit is %u, limited by database with OID %u",
  					xidWrapLimit, oldest_datoid)));
  
  	/*
--- 352,359 ----
  	LWLockRelease(XidGenLock);
  
  	/* Log the info */
! 	ereport(LOG,
! 			(errmsg("JJ transaction ID wrap limit is %u, limited by database with OID %u",
  					xidWrapLimit, oldest_datoid)));
  
  	/*
*************** ForceTransactionIdLimitUpdate(void)
*** 420,425 ****
--- 430,446 ----
  	oldestXidDB = ShmemVariableCache->oldestXidDB;
  	LWLockRelease(XidGenLock);
  
+ 	if (JJ_vac) {
+ 	elog(LOG,"JJ ForceTransactionIdLimitUpdate in %d: !normal %d, !valid %d, follows %d (%u, %u), !exists %d", MyDatabaseId,
+ 		!TransactionIdIsNormal(oldestXid),
+ 		!TransactionIdIsValid(xidVacLimit),
+ 		TransactionIdFollowsOrEquals(nextXid, xidVacLimit),
+ 		nextXid, xidVacLimit,
+  		!SearchSysCacheExists1(DATABASEOID, ObjectIdGetDatum(oldestXidDB))
+ 	);
+ 	};
+ 
+ 
  	if (!TransactionIdIsNormal(oldestXid))
  		return true;			/* shouldn't happen, but just in case */
  	if (!TransactionIdIsValid(xidVacLimit))
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index 2189c22..d360bc6
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** BootStrapXLOG(void)
*** 4822,4827 ****
--- 4822,4828 ----
  	ShmemVariableCache->nextOid = checkPoint.nextOid;
  	ShmemVariableCache->oidCount = 0;
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
+ 	//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
  	SetCommitTsLimit(InvalidTransactionId, InvalidTransactionId);
*************** StartupXLOG(void)
*** 6415,6420 ****
--- 6416,6422 ----
  	ShmemVariableCache->nextOid = checkPoint.nextOid;
  	ShmemVariableCache->oidCount = 0;
  	MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset);
+ 	//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
  	SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
  	SetCommitTsLimit(checkPoint.oldestCommitTsXid,
*************** xlog_redo(XLogReaderState *record)
*** 9329,9334 ****
--- 9331,9337 ----
  
  		MultiXactAdvanceOldest(checkPoint.oldestMulti,
  							   checkPoint.oldestMultiDB);
+ 		//elog(LOG,"JJ SetTransactionIDLimit %d", checkPoint.oldestXid);
  		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
  
  		/*
*************** xlog_redo(XLogReaderState *record)
*** 9426,9431 ****
--- 9429,9435 ----
  		 */
  		MultiXactAdvanceOldest(checkPoint.oldestMulti,
  							   checkPoint.oldestMultiDB);
+ 		//elog(LOG,"JJ maybe SetTransactionIDLimit %d", checkPoint.oldestXid);
  		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
  								  checkPoint.oldestXid))
  			SetTransactionIdLimit(checkPoint.oldestXid,
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
new file mode 100644
index 58bbf55..08c4c85
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
*************** int			vacuum_freeze_min_age;
*** 58,63 ****
--- 58,65 ----
  int			vacuum_freeze_table_age;
  int			vacuum_multixact_freeze_min_age;
  int			vacuum_multixact_freeze_table_age;
+ int			JJ_vac=0;
+ 
  
  
  /* A few variables that don't seem worth passing around as parameters */
*************** vacuum_set_xid_limits(Relation rel,
*** 538,543 ****
--- 540,546 ----
  	}
  
  	*freezeLimit = limit;
+ 	if (JJ_vac) elog(LOG,"JJ freezeLimit %d", *freezeLimit);
  
  	/*
  	 * Compute the multixact age for which freezing is urgent.  This is
*************** vacuum_set_xid_limits(Relation rel,
*** 592,597 ****
--- 595,602 ----
  		 * VACUUM schedule, the nightly VACUUM gets a chance to freeze tuples
  		 * before anti-wraparound autovacuum is launched.
  		 */
+ 		if (JJ_vac) elog(LOG,"JJ freeze_min_age %d vacuum_freeze_table_age %d freeze_table_age %d ReadNew %d", freeze_min_age, 
+                            vacuum_freeze_table_age, freeze_table_age,ReadNewTransactionId());
  		freezetable = freeze_table_age;
  		if (freezetable < 0)
  			freezetable = vacuum_freeze_table_age;
*************** vac_update_datfrozenxid(void)
*** 1029,1034 ****
--- 1034,1040 ----
  	 * truncate pg_clog and/or pg_multixact.  Also do it if the shared
  	 * XID-wrap-limit info is stale, since this action will update that too.
  	 */
+ 	if (JJ_vac && dirty) elog(LOG,"JJ updating in %d without call to ForceTransactionIdLimitUpdate", MyDatabaseId);
  	if (dirty || ForceTransactionIdLimitUpdate())
  		vac_truncate_clog(newFrozenXid, newMinMulti,
  						  lastSaneFrozenXid, lastSaneMinMulti);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
new file mode 100644
index b5fb325..6155516
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 64,69 ****
--- 64,70 ----
  #include "utils/tqual.h"
  
  
+ extern int JJ_vac;
  /*
   * Space/time tradeoff parameters: do these need to be user-tunable?
   *
*************** lazy_vacuum_rel(Relation onerel, int opt
*** 236,241 ****
--- 237,244 ----
  	if (options & VACOPT_DISABLE_PAGE_SKIPPING)
  		aggressive = true;
  
+ 	if (JJ_vac) elog(LOG,"JJ aggresive %d, relfrozenid %d", aggressive, onerel->rd_rel->relfrozenxid);
+ 
  	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
  
  	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
new file mode 100644
index 1a92ca1..6b96885
*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
*************** int			autovacuum_vac_cost_delay;
*** 122,127 ****
--- 122,128 ----
  int			autovacuum_vac_cost_limit;
  
  int			Log_autovacuum_min_duration = -1;
+ extern int  JJ_vac;
  
  /* how long to keep pgstat data in the launcher, in milliseconds */
  #define STATS_READ_DELAY 1000
*************** AutoVacWorkerMain(int argc, char *argv[]
*** 1635,1642 ****
  		InitPostgres(NULL, dbid, NULL, InvalidOid, dbname);
  		SetProcessingMode(NormalProcessing);
  		set_ps_display(dbname, false);
- 		ereport(DEBUG1,
- 				(errmsg("autovacuum: processing database \"%s\"", dbname)));
  
  		if (PostAuthDelay)
  			pg_usleep(PostAuthDelay * 1000000L);
--- 1636,1641 ----
*************** AutoVacWorkerMain(int argc, char *argv[]
*** 1644,1650 ****
--- 1643,1653 ----
  		/* And do an appropriate amount of work */
  		recentXid = ReadNewTransactionId();
  		recentMulti = ReadNextMultiXactId();
+ 		if (JJ_vac) ereport(LOG,
+ 				(errmsg("autovacuum: processing database \"%s\" at recent Xid of %u recent mxid of %u", dbname,recentXid,recentMulti)));
  		do_autovacuum();
+ 		if (JJ_vac) ereport(LOG,
+ 				(errmsg("autovacuum: done processing database \"%s\" at recent Xid of %u recent mxid of %u", dbname,ReadNewTransactionId(),ReadNextMultiXactId())));
  	}
  
  	/*
*************** relation_needs_vacanalyze(Oid relid,
*** 2761,2773 ****
  		 * reset, because if that happens, the last vacuum and analyze counts
  		 * will be reset too.
  		 */
- 		elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f)",
- 			 NameStr(classForm->relname),
- 			 vactuples, vacthresh, anltuples, anlthresh);
  
  		/* Determine if this table needs vacuum or analyze. */
  		*dovacuum = force_vacuum || (vactuples > vacthresh);
  		*doanalyze = (anltuples > anlthresh);
  	}
  	else
  	{
--- 2764,2777 ----
  		 * reset, because if that happens, the last vacuum and analyze counts
  		 * will be reset too.
  		 */
  
  		/* Determine if this table needs vacuum or analyze. */
  		*dovacuum = force_vacuum || (vactuples > vacthresh);
  		*doanalyze = (anltuples > anlthresh);
+ 
+ 		if (JJ_vac) elog(LOG, "%s: vac: %.0f (threshold %.0f), anl: %.0f (threshold %.0f) wraparound %d dovaccum %d doanalyze %d",
+ 			 NameStr(classForm->relname),
+ 			 vactuples, vacthresh, anltuples, anlthresh, *wraparound, *dovacuum, *doanalyze);
  	}
  	else
  	{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
new file mode 100644
index 1360f77..6221148
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 67,72 ****
--- 67,74 ----
  #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
  #endif
  
+ int JJ_torn_page=0;
+ 
  /*
   *	The magnetic disk storage manager keeps track of open file
   *	descriptors in its own descriptor pool.  This is done to make it
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 804,809 ****
--- 806,812 ----
  	off_t		seekpos;
  	int			nbytes;
  	MdfdVec    *v;
+         static int counter=0;
  
  	/* This assert is too expensive to have on normally ... */
  #ifdef CHECK_WRITE_VS_EXTEND
*************** mdwrite(SMgrRelation reln, ForkNumber fo
*** 829,835 ****
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
! 	nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
  
  	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
  										reln->smgr_rnode.node.spcNode,
--- 832,849 ----
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
!         if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress()) {
! 	  nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! 		ereport(FATAL,
! 				(errcode(ERRCODE_DISK_FULL),
! 				 errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! 						blocknum,
! 						relpath(reln->smgr_rnode, forknum),
! 						nbytes, BLCKSZ),
! 				 errhint("JJ is screwing with the database.")));
!         } else {
! 	  nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! 	}
  
  	TRACE_POSTGRESQL_SMGR_MD_WRITE_DONE(forknum, blocknum,
  										reln->smgr_rnode.node.spcNode,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index c5178f7..c2d8e70
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 106,111 ****
--- 106,114 ----
  /* XXX these should appear in other modules' header files */
  extern bool Log_disconnections;
  extern int	CommitDelay;
+ int	JJ_torn_page;
+ extern int	JJ_xid;
+ extern int	JJ_vac;
  extern int	CommitSiblings;
  extern char *default_tablespace;
  extern char *temp_tablespaces;
*************** static struct config_int ConfigureNamesI
*** 2359,2364 ****
--- 2362,2394 ----
  	},
  
  	{
+ 		{"JJ_torn_page", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("Simulate a torn-page crash after this number of page writes (0 to turn off)"),
+ 			NULL
+ 		},
+ 		&JJ_torn_page,
+ 		0, 0, 100000, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"JJ_xid", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("Skip this many xid every time we acquire one"),
+ 			NULL
+ 		},
+ 		&JJ_xid,
+ 		0, 0, 1000000, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"JJ_vac", PGC_USERSET, WAL_SETTINGS,
+ 			gettext_noop("turn on verbose logging"),
+ 			NULL
+ 		},
+ 		&JJ_vac,
+ 		0, 0, 1000000, NULL, NULL
+ 	},
+ 
+ 	{
  		{"commit_siblings", PGC_USERSET, WAL_SETTINGS,
  			gettext_noop("Sets the minimum concurrent open transactions before performing "
  						 "commit_delay."),
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
new file mode 100644
index a2b2b61..fe4c806
*** a/src/include/pg_config_manual.h
--- b/src/include/pg_config_manual.h
***************
*** 283,289 ****
  /*
   * Enable debugging print statements for lock-related operations.
   */
! /* #define LOCK_DEBUG */
  
  /*
   * Enable debugging print statements for WAL-related operations; see
--- 283,289 ----
  /*
   * Enable debugging print statements for lock-related operations.
   */
! #define LOCK_DEBUG 1
  
  /*
   * Enable debugging print statements for WAL-related operations; see

do.shapplication/x-sh; name=do.shDownload

#35

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#34)

Re: Write Ahead Logging for Hash Indexes

On Mon, Sep 12, 2016 at 7:00 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

I plan to do testing using my own testing harness after changing it to
insert a lot of dummy tuples (ones with negative values in the pseudo-pk
column, which are never queried by the core part of the harness) and
deleting them at random intervals. I think that none of pgbench's built in
tests are likely to give the bucket splitting and squeezing code very much
exercise.

I've implemented this, by adding lines 197 through 202 to the count.pl
script. (I'm reattaching the test case)

Within a few minutes of testing, I start getting Errors like these:

29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR: buffer 2762 is not
owned by resource owner Portal
29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT: update foo set
count=count+1 where index=$1

In one test, I also got an error from my test harness itself indicating
tuples are transiently missing from the index, starting an hour into a test:

child abnormal exit update did not update 1 row: key 9555 updated 0E0 at
count.pl line 194.\n at count.pl line 208.
child abnormal exit update did not update 1 row: key 8870 updated 0E0 at
count.pl line 194.\n at count.pl line 208.
child abnormal exit update did not update 1 row: key 8453 updated 0E0 at
count.pl line 194.\n at count.pl line 208.

Those key values should always find exactly one row to update.

If the tuples were permanently missing from the index, I would keep getting
errors on the same key values very frequently. But I don't get that, the
errors remain infrequent and are on different value each time, so I think
the tuples are in the index but the scan somehow misses them, either while
the bucket is being split or while it is being squeezed.

This on a build without enable-asserts.

Any ideas on how best to go about investigating this?

I think these symptoms indicate the bug in concurrent hash index
patch, but it could be that the problem can be only revealed with WAL
patch. Is it possible to just try this with concurrent hash index
patch? In any case, thanks for testing it, I will look into these
issues.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Mark Kirkwood (#33)

Re: Write Ahead Logging for Hash Indexes

On Sun, Sep 11, 2016 at 3:01 PM, Mark Kirkwood
<mark.kirkwood@catalyst.net.nz> wrote:

On 11/09/16 19:16, Mark Kirkwood wrote:

On 11/09/16 17:01, Amit Kapila wrote:

...Do you think we can do some read-only
workload benchmarking using this server? If yes, then probably you
can use concurrent hash index patch [1] and cache the metapage patch
[2] (I think Mithun needs to rebase his patch) to do so.

[1] -
/messages/by-id/CAA4eK1J6b8O4PcEPqRxNYbLVbfToNMJEEm+qn0jZX31-obXrJw@mail.gmail.com
[2] -
/messages/by-id/CAD__OuhJ29CeBif_fLGe4t9Vj_-cFXBwCXhjO+D_16TXbemY+g@mail.gmail.com

I can do - are we checking checking for hangs/assertions or comparing
patched vs unpatched performance (for the metapage patch)?

The first step is to find any bugs and then do performance testing.
What I wanted for performance testing was to compare HEAD against all
three patches (all three together) of Hash Index (concurrent hash
index, cache the meta page, WAL for hash index). For Head, we want
two set of numbers, one with hash indexes and another with btree
indexes. As Jeff has found few problems, I think it is better to
first fix those before going for performance tests.

So, assuming the latter - testing performance with and without the metapage
patch:

For my 1st runs:

- cpus 16, ran 16G
- size 100, clients 32

I'm seeing no difference in performance for read only (-S) pgbench workload
(with everybody using has indexes). I guess not that surprising as the db
fites in ram (1.6G and we have 16G). So I'll retry with a bigger dataset
(suspect size 2000 is needed).

I think you should try with -S -M prepared and with various client
counts (8,16,32,64,100...).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Amit Kapila (#35)

Re: Write Ahead Logging for Hash Indexes

On Sun, Sep 11, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Sep 12, 2016 at 7:00 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Sep 8, 2016 at 12:09 PM, Jeff Janes <jeff.janes@gmail.com>

wrote:

I plan to do testing using my own testing harness after changing it to
insert a lot of dummy tuples (ones with negative values in the pseudo-pk
column, which are never queried by the core part of the harness) and
deleting them at random intervals. I think that none of pgbench's

built in

tests are likely to give the bucket splitting and squeezing code very

much

exercise.

I've implemented this, by adding lines 197 through 202 to the count.pl
script. (I'm reattaching the test case)

Within a few minutes of testing, I start getting Errors like these:

29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:ERROR: buffer 2762 is not
owned by resource owner Portal
29236 UPDATE XX000 2016-09-11 17:21:25.893 PDT:STATEMENT: update foo set
count=count+1 where index=$1

In one test, I also got an error from my test harness itself indicating
tuples are transiently missing from the index, starting an hour into a

test:

child abnormal exit update did not update 1 row: key 9555 updated 0E0 at
count.pl line 194.\n at count.pl line 208.
child abnormal exit update did not update 1 row: key 8870 updated 0E0 at
count.pl line 194.\n at count.pl line 208.
child abnormal exit update did not update 1 row: key 8453 updated 0E0 at
count.pl line 194.\n at count.pl line 208.

Those key values should always find exactly one row to update.

If the tuples were permanently missing from the index, I would keep

getting

errors on the same key values very frequently. But I don't get that, the
errors remain infrequent and are on different value each time, so I think
the tuples are in the index but the scan somehow misses them, either

while

the bucket is being split or while it is being squeezed.

This on a build without enable-asserts.

Any ideas on how best to go about investigating this?

I think these symptoms indicate the bug in concurrent hash index
patch, but it could be that the problem can be only revealed with WAL
patch. Is it possible to just try this with concurrent hash index
patch? In any case, thanks for testing it, I will look into these
issues.

My test program (as posted) injects crashes and then checks the
post-crash-recovery system for consistency, so it cannot be run as-is
without the WAL patch. I also ran the test with crashing turned off (just
change the JJ* variables at the stop of the do.sh to all be set to the
empty string), and in that case I didn't see either problem, but it it
could just be that I that I didn't run it long enough.

It should have been long enough to detect the rather common "buffer <x> is
not owned by resource owner Portal" problem, so that one I think is
specific to the WAL patch (probably the part which tries to complete bucket
splits when it detects one was started but not completed?)

Cheers,

Jeff

#38

Jesper Pedersen

jesper.pedersen@redhat.com

over 9 years ago

In reply to: Amit Kapila (#22)

Re: Write Ahead Logging for Hash Indexes

Hi,

On 09/07/2016 05:58 AM, Amit Kapila wrote:

Okay, I have fixed this issue as explained above. Apart from that, I
have fixed another issue reported by Mark Kirkwood upthread and few
other issues found during internal testing by Ashutosh Sharma.

The locking issue reported by Mark and Ashutosh is that the patch
didn't maintain the locking order while adding overflow page as it
maintains in other write operations (lock the bucket pages first and
then metapage to perform the write operation). I have added the
comments in _hash_addovflpage() to explain the locking order used in
modified patch.

During stress testing with pgbench using master-standby setup, we
found an issue which indicates that WAL replay machinery doesn't
expect completely zeroed pages (See explanation of RBM_NORMAL mode
atop XLogReadBufferExtended). Previously before freeing the overflow
page we were zeroing it, now I have changed it to just initialize the
page such that the page will be empty.

Apart from above, I have added support for old snapshot threshold in
the hash index code.

Thanks to Ashutosh Sharma for doing the testing of the patch and
helping me in analyzing some of the above issues.

I forgot to mention in my initial mail that Robert and I had some
off-list discussions about the design of this patch, many thanks to
him for providing inputs.

Some initial feedback.

README:
+in_complete split flag. The reader algorithm works correctly, as it
will scan

What flag ?

hashxlog.c:

hash_xlog_move_page_contents
hash_xlog_squeeze_page

Both have "bukcetbuf" (-> "bucketbuf"), and

+ if (BufferIsValid(bukcetbuf));

+ if (BufferIsValid(bucketbuf))

and indent following line:

LockBufferForCleanup(bukcetbuf);

hash_xlog_delete

has the "if" issue too.

hash.h:

Move the XLog related content to hash_xlog.h

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#37)

Re: Write Ahead Logging for Hash Indexes

On Mon, Sep 12, 2016 at 11:29 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

My test program (as posted) injects crashes and then checks the
post-crash-recovery system for consistency, so it cannot be run as-is
without the WAL patch. I also ran the test with crashing turned off (just
change the JJ* variables at the stop of the do.sh to all be set to the empty
string), and in that case I didn't see either problem, but it it could just
be that I that I didn't run it long enough.

It should have been long enough to detect the rather common "buffer <x> is
not owned by resource owner Portal" problem, so that one I think is specific
to the WAL patch (probably the part which tries to complete bucket splits
when it detects one was started but not completed?)

Both the problem seems to be due to same reason and in concurrent hash
index patch. So what is happening here is that we are trying to unpin
the old bucket buffer twice. We need to scan the old bucket when
there is a incomplete-split, so the issue here is that during split
the system has crashed and after restart, during old bucket scan it
tries to unpin the old primary bucket buffer twice when the new bucket
has additional overflow pages. I will post the updated patch on
concurrent hash index thread. Once, I post the patch, it is better if
you try to reproduce the issue once more.

Thanks to Ashutosh Sharma who has offlist shared the call stack after
reproducing the problem.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jesper Pedersen (#38)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Mon, Sep 12, 2016 at 10:24 PM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

Some initial feedback.

Thanks for looking into patch.

README:
+in_complete split flag. The reader algorithm works correctly, as it will
scan

What flag ?

in-complete-split flag which indicates that split has to be finished
for that particular bucket. The value of these flags are
LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
bucket respectively. It is set in hasho_flag in special area of page.
I have slightly expanded the definition in README, but not sure if it
is good idea to mention flags defined in hash.h. Let me know, if still
it is unclear or you want some thing additional to be added in README.

hashxlog.c:

hash_xlog_move_page_contents
hash_xlog_squeeze_page

Both have "bukcetbuf" (-> "bucketbuf"), and

+ if (BufferIsValid(bukcetbuf));

->

+ if (BufferIsValid(bucketbuf))

and indent following line:

LockBufferForCleanup(bukcetbuf);

hash_xlog_delete

has the "if" issue too.

Fixed all the above cosmetic issues.

hash.h:

Move the XLog related content to hash_xlog.h

Moved and renamed hashxlog.c to hash_xlog.c to match the name with
corresponding .h file.

Note - This patch is to be applied on top of concurrent hash index v6
version [1]/messages/by-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com. Jeff, use this version of patch for further review and
test.

[1]: /messages/by-id/CAA4eK1+ERbP+7mdKkAhJZWQ_dTdkocbpt7LSWFwCQvUHBXzkmA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal_hash_index_v3.patchapplication/octet-stream; name=wal_hash_index_v3.patchDownload

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0f09d82..2d0d3bd 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1520,19 +1520,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cd66abc..8990bbe 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2107,10 +2107,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 06f49db..f285795 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2323,12 +2323,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f8e55..e075ede 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e9f47c4..a2dc212 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index 5d3bd94..5e24f4b 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
-       hashsearch.o hashsort.o hashutil.o hashvalidate.o
+       hashsearch.o hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index a0feb2f..7221892 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -531,6 +531,98 @@ All the freespace operations should be called while holding no buffer
 locks.  Since they need no lmgr locks, deadlock is not possible.
 
 
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which gets
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolledback.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with
+current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform split operation if the number of tuples are more than what can be
+accomodated in initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with in-complete-split flag which indicates that split has to be
+finished for that particular bucket.  The reader algorithm works correctly, as
+it will scan both the old and new buckets as explained in the reader algorithm
+section above.
+
+We finish the split at next insert or split operation on old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require to allocate a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vaccum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeezes the bucket completely.
+
+
 Other Notes
 -----------
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index a12a830..6553165 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -27,6 +27,7 @@
 #include "optimizer/plancat.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -115,7 +116,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -177,7 +178,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
@@ -297,6 +298,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -604,6 +610,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -629,7 +636,28 @@ loop_top:
 		num_index_tuples = metap->hashm_ntuples;
 	}
 
-	_hash_wrtbuf(rel, metabuf);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
 	if (stats == NULL)
@@ -720,7 +748,6 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
-		bool		curr_page_dirty = false;
 
 		if (delay)
 			vacuum_delay_point();
@@ -787,9 +814,41 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
-			curr_page_dirty = true;
+
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -804,15 +863,7 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		 * release the lock on previous page after acquiring the lock on next
 		 * page
 		 */
-		if (curr_page_dirty)
-		{
-			if (retain_pin)
-				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-			else
-				_hash_wrtbuf(rel, buf);
-			curr_page_dirty = false;
-		}
-		else if (retain_pin)
+		if (retain_pin)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, buf);
@@ -844,10 +895,36 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+
+		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_CLEAR_GARBAGE);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
+	 * we need to release and reacquire the lock on bucket buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
 	 * If we deleted anything, try to compact free space.  For squeezing the
 	 * bucket, we must have a cleanup lock, else it can impact the ordering of
 	 * tuples for a scan that has started before it.
@@ -856,9 +933,3 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 							bstrategy);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..fb1b251
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,970 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, *num_bucket, LH_OVERFLOW_PAGE, true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf));
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf));
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf));
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay clear garbage flag operation for primary bucket page.
+ */
+static void
+hash_xlog_clear_garbage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_CLEAR_GARBAGE:
+			hash_xlog_clear_garbage(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5cfd0aa..3514138 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -44,6 +46,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -248,31 +251,63 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * write and release the modified page and ensure to release the pin on
-	 * primary page.
-	 */
-	_hash_wrtbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap->hashm_ntuples += 1;
 
 	/* Make sure this stays in sync with _hash_expandtable() */
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/* Attempt to split if a split is needed */
 	if (do_expand)
@@ -314,3 +349,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 760563a..a6ae979 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,10 +18,11 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -84,7 +85,9 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -102,13 +105,37 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
+	HashMetaPage metap;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
+	BlockNumber blkno;
+	uint32		orig_firstfree;
+	uint32		splitnum;
+	uint32	   *freep = NULL;
+	uint32		max_ovflpg;
+	uint32		bit;
+	uint32		bitmap_page_bit;
+	uint32		first_page;
+	uint32		last_bit;
+	uint32		last_page;
+	uint32		i,
+				j;
+	bool		page_found = false;
 
 	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
+	 * Write-lock the tail page.  Here, we need to maintain locking order such
+	 * that, first acquire the lock on tail page of bucket, then on meta page
+	 * to find and lock the bitmap page and if it is found, then lock on meta
+	 * page is released, then finally acquire the lock on new overflow buffer.
+	 * We need this locking order to avoid deadlock with backends that are
+	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here, if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_WRITE);
 
@@ -136,55 +163,6 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
 
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
-	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
-	BlockNumber blkno;
-	uint32		orig_firstfree;
-	uint32		splitnum;
-	uint32	   *freep = NULL;
-	uint32		max_ovflpg;
-	uint32		bit;
-	uint32		first_page;
-	uint32		last_bit;
-	uint32		last_page;
-	uint32		i,
-				j;
-
 	/* Get exclusive lock on the meta page */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
@@ -233,11 +211,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -261,8 +259,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -273,7 +278,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	}
 
 	/* Calculate address of the new overflow page */
-	bit = metap->hashm_spares[splitnum];
+	bit = BufferIsValid(newmapbuf) ?
+		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -281,39 +287,51 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * relation length stays in sync with ours.  XXX It's annoying to do this
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
+	 *
+	 * It is okay to hold two buffer locks here (one on tail page of bucket
+	 * and other on new overflow page) since there cannot be anyone else
+	 * contending for access to ovflbuf.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case new page is added for them.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
-
-	/* Write updated metapage and release lock, but not pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	START_CRIT_SECTION();
 
-	return newbuf;
-
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	_hash_wrtbuf(rel, mapbuf);
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
 
-	/* Reacquire exclusive lock on the meta page */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -322,18 +340,101 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
-		_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* We didn't change the metapage, so no need to write */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
 	}
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	END_CRIT_SECTION();
+
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, buf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, ovflbuf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, ovflbuf, HASH_NOLOCK, HASH_WRITE);
+
+	return ovflbuf;
 }
 
 /*
@@ -366,6 +467,12 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -377,7 +484,9 @@ _hash_firstfreebit(uint32 map)
  *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -387,6 +496,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	BlockNumber prevblkno;
 	BlockNumber blkno;
 	BlockNumber nextblkno;
+	BlockNumber writeblkno;
 	HashPageOpaque ovflopaque;
 	Page		ovflpage;
 	Page		mappage;
@@ -395,6 +505,10 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	BlockNumber bucketblkno;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -403,69 +517,37 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
 	nextblkno = ovflopaque->hasho_nextblkno;
 	prevblkno = ovflopaque->hasho_prevblkno;
+	writeblkno = BufferGetBlockNumber(wbuf);
 	bucket = ovflopaque->hasho_bucket;
-
-	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	_hash_wrtbuf(rel, ovflbuf);
+	bucketblkno = BufferGetBlockNumber(bucketbuf);
 
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  No concurrency issues since we hold the cleanup lock on
 	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
-	 * primary bucket, as we already have that lock.
+	 * primary bucket or if the previous bucket is same as write bucket, as we
+	 * already have lock on those buckets.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		if (prevblkno == bucket_blkno)
-		{
-			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
-													 prevblkno,
-													 RBM_NORMAL,
-													 bstrategy);
-
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
-			MarkBufferDirty(prevbuf);
-			ReleaseBuffer(prevbuf);
-		}
+		if (prevblkno == bucketblkno)
+			prevbuf = bucketbuf;
+		else if (prevblkno == writeblkno)
+			prevbuf = wbuf;
 		else
-		{
-			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-															 prevblkno,
-															 HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-															 bstrategy);
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
-			_hash_wrtbuf(rel, prevbuf);
-		}
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_WRITE,
+												 LH_OVERFLOW_PAGE,
+												 bstrategy);
 	}
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		_hash_wrtbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -491,60 +573,191 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	_hash_wrtbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed
+	 * the page as WAL replay routines expect pages to be initialized.
+	 * See explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
-		_hash_wrtbuf(rel, metabuf);
+		update_metap = true;
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* no need to change metapage */
-		_hash_relbuf(rel, metabuf);
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		/*
+		 * As bitmap page doesn't have standard page layout, so this will
+		 * allow us to log the data.
+		 */
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
 
+	END_CRIT_SECTION();
+
+	/*
+	 * Release the lock, caller will decide whether to release the pin or
+	 * reacquire the lock.  Caller is responsibile for locking and unlocking
+	 * bucket buffer.
+	 */
+	if (BufferGetBlockNumber(wbuf) != bucketblkno)
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	/* release previous bucket if it is not same as primary or write bucket */
+	if (BufferIsValid(prevbuf) &&
+		prevblkno != bucketblkno &&
+		prevblkno != writeblkno)
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
+
 	return nextblkno;
 }
 
-
 /*
- *	_hash_initbitmap()
- *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
+ *	_hash_initbitmapbuffer()
  *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
  */
 void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 {
-	Buffer		buf;
 	Page		pg;
 	HashPageOpaque op;
 	uint32	   *freep;
 
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
 	pg = BufferGetPage(buf);
 
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
 	/* initialize the page's special space */
 	op = (HashPageOpaque) PageGetSpecialPointer(pg);
 	op->hasho_prevblkno = InvalidBlockNumber;
@@ -555,25 +768,9 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 	/* set all of the bits to 1 */
 	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* write out the new bitmap page (releasing write lock and pin) */
-	_hash_wrtbuf(rel, buf);
-
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
+	MemSet(freep, 0xFF, bmsize);
 }
 
-
 /*
  *	_hash_squeezebucket(rel, bucket)
  *
@@ -613,8 +810,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
-	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
@@ -657,23 +852,30 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
-
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		IndexTuple	itup;
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		Size		itemsz;
+		int			i;
+
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
+
+readpage:
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
 		for (roffnum = FirstOffsetNumber;
 			 roffnum <= maxroffnum;
 			 roffnum = OffsetNumberNext(roffnum))
 		{
-			IndexTuple	itup;
-			Size		itemsz;
-
 			itup = (IndexTuple) PageGetItem(rpage,
 											PageGetItemId(rpage, roffnum));
 			itemsz = IndexTupleDSize(*itup);
@@ -681,39 +883,113 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMulTups(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
+				bool		release_wbuf = false;
+				bool		tups_moved = false;
+
 				Assert(!PageIsEmpty(wpage));
 
+				/*
+				 * caller is responsibile for locking and unlocking bucket
+				 * buffer
+				 */
 				if (wblkno != bucket_blkno)
-					release_buf = true;
+					release_wbuf = true;
 
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty && release_buf)
-					_hash_wrtbuf(rel, wbuf);
-				else if (wbuf_dirty)
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
 					MarkBufferDirty(wbuf);
-				else if (release_buf)
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
+					tups_moved = true;
+				}
+
+				if (release_wbuf)
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_SQUEEZE_PAGE on standby users will be able to
+				 * view the results of partial deletion on rblkno.
+				 */
+				_hash_chgbufaccess(rel, rbuf, HASH_READ, HASH_NOLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						_hash_wrtbuf(rel, rbuf);
-					}
-					else
-						_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				_hash_chgbufaccess(rel, rbuf, HASH_NOLOCK, HASH_WRITE);
+
 				wbuf = _hash_getbuf_with_strategy(rel,
 												  wblkno,
 												  HASH_WRITE,
@@ -722,21 +998,33 @@ _hash_squeezebucket(Relation rel,
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
-				release_buf = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
+
+				/*
+				 * after moving the tuples, rpage would have been compacted,
+				 * so we need to rescan it.
+				 */
+				if (tups_moved)
+					goto readpage;
+			}
 
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -748,34 +1036,36 @@ _hash_squeezebucket(Relation rel,
 		 * Tricky point here: if our read and write pages are adjacent in the
 		 * bucket chain, our write lock on wbuf will conflict with
 		 * _hash_freeovflpage's attempt to update the sibling links of the
-		 * removed page.  However, in that case we are done anyway, so we can
-		 * simply drop the write lock before calling _hash_freeovflpage.
+		 * removed page.  However, in that case we are ensuring that
+		 * _hash_freeovflpage doesn't take lock on that page again.  Releasing
+		 * the lock is not an option, because before that we need to write WAL
+		 * for the change in this page.
 		 */
 		rblkno = ropaque->hasho_prevblkno;
 		Assert(BlockNumberIsValid(rblkno));
 
+		/* free this overflow page (releases rbuf) */
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
+
+		pfree(itup_offsets);
+
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
+			/* Caller is responsibile for locking and unlocking bucket buffer */
 			if (wblkno != bucket_blkno)
-				release_buf = true;
-
-			/* yes, so release wbuf lock first if needed */
-			if (wbuf_dirty && release_buf)
-				_hash_wrtbuf(rel, wbuf);
-			else if (wbuf_dirty)
-				MarkBufferDirty(wbuf);
-			else if (release_buf)
-				_hash_relbuf(rel, wbuf);
-
-			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
-			/* done */
+				_hash_dropbuf(rel, wbuf);
 			return;
 		}
 
-		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
+		/* Caller is responsibile for locking and unlocking bucket buffer */
+		if (wblkno != bucket_blkno)
+			_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index f51c313..ec83913 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -40,12 +41,11 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -160,6 +160,29 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag, bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	pageopaque->hasho_prevblkno = InvalidBlockNumber;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -286,35 +309,17 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 }
 
 /*
- *	_hash_wrtbuf() -- write a hash page to disk.
- *
- *		This routine releases the lock held on the buffer and our refcount
- *		for it.  It is an error to call _hash_wrtbuf() without a write lock
- *		and a pin on the buffer.
- *
- * NOTE: this routine should go away when/if hash indexes are WAL-ified.
- * The correct sequence of operations is to mark the buffer dirty, then
- * write the WAL record, then release the lock and pin; so marking dirty
- * can't be combined with releasing.
- */
-void
-_hash_wrtbuf(Relation rel, Buffer buf)
-{
-	MarkBufferDirty(buf);
-	UnlockReleaseBuffer(buf);
-}
-
-/*
  * _hash_chgbufaccess() -- Change the lock type on a buffer, without
  *			dropping our pin on it.
  *
- * from_access and to_access may be HASH_READ, HASH_WRITE, or HASH_NOLOCK,
- * the last indicating that no buffer-level lock is held or wanted.
+ * from_access can be HASH_READ or HASH_NOLOCK, the later indicating that
+ * no buffer-level lock is held or wanted.
+ *
+ * to_access can be HASH_READ, HASH_WRITE, or HASH_NOLOCK, the last indicating
+ * that no buffer-level lock is held or wanted.
  *
- * When from_access == HASH_WRITE, we assume the buffer is dirty and tell
- * bufmgr it must be written out.  If the caller wants to release a write
- * lock on a page that's not been modified, it's okay to pass from_access
- * as HASH_READ (a bit ugly, but handy in some places).
+ * If the caller wants to release a write lock on a page, it's okay to pass
+ * from_access as HASH_READ (a bit ugly, but quite handy when required).
  */
 void
 _hash_chgbufaccess(Relation rel,
@@ -322,8 +327,7 @@ _hash_chgbufaccess(Relation rel,
 				   int from_access,
 				   int to_access)
 {
-	if (from_access == HASH_WRITE)
-		MarkBufferDirty(buf);
+	Assert(from_access != HASH_WRITE);
 	if (from_access != HASH_NOLOCK)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	if (to_access != HASH_NOLOCK)
@@ -332,7 +336,7 @@ _hash_chgbufaccess(Relation rel,
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -344,19 +348,18 @@ _hash_chgbufaccess(Relation rel,
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -378,6 +381,154 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -397,30 +548,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -437,7 +583,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -456,44 +602,11 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_firstfree = 0;
 
 	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
 	 */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
-
-	/*
-	 * Initialize the first N buckets
-	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-		pageopaque->hasho_prevblkno = InvalidBlockNumber;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		_hash_wrtbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
-
-	/*
-	 * Initialize first bitmap page
-	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	_hash_wrtbuf(rel, metabuf);
-
-	return num_buckets;
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -502,7 +615,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
@@ -530,10 +642,14 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -569,7 +685,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
@@ -703,7 +819,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -728,18 +848,44 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = InvalidBlockNumber;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -747,6 +893,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -759,10 +906,10 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -775,15 +922,71 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_NOLOCK, HASH_WRITE);
 
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -835,20 +1038,31 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 
 	MemSet(zerobuf, 0, sizeof(zerobuf));
 
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
+
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
 
 	return true;
 }
 
-
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold exclusive locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -874,70 +1088,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress and it has
-	 * deletable tuples. At operation end, we clear split in progress flag and
-	 * vacuum will clear page_has_garbage flag after deleting such tuples.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = InvalidBlockNumber;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * is used to finish the incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -1026,7 +1181,7 @@ _hash_splitbucket_guts(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
 					bool		retain_pin = false;
 
@@ -1036,8 +1191,12 @@ _hash_splitbucket_guts(Relation rel,
 					 */
 					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
 
-					/* write out nbuf and drop lock, but keep pin */
-					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
+					/* drop lock, but keep pin */
+					_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
@@ -1045,6 +1204,13 @@ _hash_splitbucket_guts(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -1053,6 +1219,8 @@ _hash_splitbucket_guts(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1075,7 +1243,16 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
+			if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1091,15 +1268,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
-		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, nbuf);
-
-	/*
-	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
-	 * there is no pending scan that has seen the flag after it is cleared.
-	 */
 	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1108,6 +1276,8 @@ _hash_splitbucket_guts(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	/* indicate that split is finished */
 	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
@@ -1118,6 +1288,29 @@ _hash_splitbucket_guts(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1222,9 +1415,41 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
 	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	obucket = opageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	hash_destroy(tidhtab);
 }
+
+/*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 6ec3bea..5d8942c 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -306,6 +306,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 
 
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
@@ -362,6 +363,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		while (BlockNumberIsValid(opaque->hasho_nextblkno))
 			_hash_readnext(rel, &buf, &page, &opaque);
+
+		if (BufferIsValid(buf))
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	}
 
 	/* Now find the first tuple satisfying the qualification */
@@ -480,6 +484,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readnext(rel, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch(page, so->hashso_sk_hash);
 					}
@@ -503,6 +508,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
 
 							page = BufferGetPage(buf);
+							TestForOldSnapshot(scan->xs_snapshot, rel, page);
 							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 							maxoff = PageGetMaxOffsetNumber(page);
 							offnum = _hash_binsearch(page, so->hashso_sk_hash);
@@ -567,6 +573,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(rel, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
@@ -590,6 +597,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
 
 							page = BufferGetPage(buf);
+							TestForOldSnapshot(scan->xs_snapshot, rel, page);
 							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 							maxoff = PageGetMaxOffsetNumber(page);
 							offnum = _hash_binsearch(page, so->hashso_sk_hash);
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 12e1818..245ce97 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,143 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "split_complete_old_bucket %c, split_complete_new_bucket %c",
+								 (xlrec->old_bucket_flag & LH_BUCKET_OLD_PAGE_SPLIT) ? 'F' : 'T',
+								 (xlrec->new_bucket_flag & LH_BUCKET_NEW_PAGE_SPLIT) ? 'F' : 'T');
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_CLEAR_GARBAGE:
+			id = "CLEAR_GARBAGE";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 85817c6..b868fd8 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -499,11 +499,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 73aa0c0..31b66d2 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -598,6 +598,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMulTups
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMulTups(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 79e0b1f..a918c6d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5379,13 +5379,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5407,8 +5404,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index bbf822b..0e6ad1a 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -313,13 +313,16 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
+				   BufferAccessStrategy bstrategy);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
@@ -331,6 +334,8 @@ extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
 								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag,
+			  bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -339,11 +344,12 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 5f941a9..30e16c0 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -17,6 +17,245 @@
 #include "access/hash.h"
 #include "access/xlogreader.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_CLEAR_GARBAGE 0xA0	/* clear garbage flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * Data to regenerate the meta-data page
+ */
+typedef struct xl_hash_metadata
+{
+	HashMetaPageData metadata;
+}	xl_hash_metadata;
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (xl_hash_metadata)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index ad4ab5f..6ea46ef 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -425,6 +425,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMulTups(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 76593e1..80089b9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index 39a60a5..62f2ad7 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 59cb1e0..d907519 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail

#41

Jesper Pedersen

jesper.pedersen@redhat.com

over 9 years ago

In reply to: Amit Kapila (#40)

Re: Write Ahead Logging for Hash Indexes

On 09/13/2016 07:41 AM, Amit Kapila wrote:

README:
+in_complete split flag. The reader algorithm works correctly, as it will
scan

What flag ?

in-complete-split flag which indicates that split has to be finished
for that particular bucket. The value of these flags are
LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
bucket respectively. It is set in hasho_flag in special area of page.
I have slightly expanded the definition in README, but not sure if it
is good idea to mention flags defined in hash.h. Let me know, if still
it is unclear or you want some thing additional to be added in README.

I think it is better now.

hashxlog.c:

hash_xlog_move_page_contents
hash_xlog_squeeze_page

Both have "bukcetbuf" (-> "bucketbuf"), and

+ if (BufferIsValid(bukcetbuf));

->

+ if (BufferIsValid(bucketbuf))

and indent following line:

LockBufferForCleanup(bukcetbuf);

hash_xlog_delete

has the "if" issue too.

Fixed all the above cosmetic issues.

I meant there is an extra ';' on the "if" statements:

+ if (BufferIsValid(bukcetbuf)); <--

in hash_xlog_move_page_contents, hash_xlog_squeeze_page and
hash_xlog_delete.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Ashutosh Sharma

ashu.coek88@gmail.com

over 9 years ago

In reply to: Jesper Pedersen (#41)

Re: Write Ahead Logging for Hash Indexes

Hi All,

I am getting following error when running the test script shared by
Jeff -[1]/messages/by-id/CAMkU=1xRt8jBBB7g_7K41W00=br9UrxMVn_rhWhKPLaHfEdM5A@mail.gmail.com . The error is observed upon executing the test script for
around 3-4 hrs.

57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR: lock
buffer_content 1 is not held
57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT: insert into
foo (index) select $1 from generate_series(1,10000)

124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR: lock
buffer_content 10 is not held
124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT: insert
into foo (index) select $1 from generate_series(1,10000)

124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR: lock
buffer_content 10 is not held
124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT: insert
into foo (index) select $1 from generate_series(1,10000)

[1]: /messages/by-id/CAMkU=1xRt8jBBB7g_7K41W00=br9UrxMVn_rhWhKPLaHfEdM5A@mail.gmail.com

Please note that i am performing the test on latest patch for
concurrent hash index and WAL log in hash index shared by Amit
yesterday.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 09/13/2016 07:41 AM, Amit Kapila wrote:

README:
+in_complete split flag. The reader algorithm works correctly, as it
will
scan

What flag ?

in-complete-split flag which indicates that split has to be finished
for that particular bucket. The value of these flags are
LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
bucket respectively. It is set in hasho_flag in special area of page.
I have slightly expanded the definition in README, but not sure if it
is good idea to mention flags defined in hash.h. Let me know, if still
it is unclear or you want some thing additional to be added in README.

I think it is better now.

hashxlog.c:

hash_xlog_move_page_contents
hash_xlog_squeeze_page

Both have "bukcetbuf" (-> "bucketbuf"), and

+ if (BufferIsValid(bukcetbuf));

->

+ if (BufferIsValid(bucketbuf))

and indent following line:

LockBufferForCleanup(bukcetbuf);

hash_xlog_delete

has the "if" issue too.

Fixed all the above cosmetic issues.

I meant there is an extra ';' on the "if" statements:

+ if (BufferIsValid(bukcetbuf)); <--

in hash_xlog_move_page_contents, hash_xlog_squeeze_page and
hash_xlog_delete.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Ashutosh Sharma

ashu.coek88@gmail.com

over 9 years ago

In reply to: Ashutosh Sharma (#42)

Re: Write Ahead Logging for Hash Indexes

Hi All,

Below is the backtrace for the issue reported in my earlier mail [1]/messages/by-id/CAE9k0Pmxh-4NAr4GjzDDFHdBKDrKy2FV-Z+2Tp8vb2Kmxu=6zg@mail.gmail.com.
From the callstack it looks like we are trying to release lock on a
meta page twice in _hash_expandtable().

(gdb) bt
#0 0x00000000007b01cf in LWLockRelease (lock=0x7f55f59d0570) at lwlock.c:1799
#1 0x000000000078990c in LockBuffer (buffer=2, mode=0) at bufmgr.c:3540
#2 0x00000000004a5d3c in _hash_chgbufaccess (rel=0x7f5605b608c0,
buf=2, from_access=1, to_access=-1) at hashpage.c:331
#3 0x00000000004a722d in _hash_expandtable (rel=0x7f5605b608c0,
metabuf=2) at hashpage.c:995
#4 0x00000000004a316a in _hash_doinsert (rel=0x7f5605b608c0,
itup=0x2ba5030) at hashinsert.c:313
#5 0x00000000004a0d85 in hashinsert (rel=0x7f5605b608c0,
values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", ht_ctid=0x2c2e4f4,
heapRel=0x7f5605b58250,
checkUnique=UNIQUE_CHECK_NO) at hash.c:248
#6 0x00000000004c5a16 in index_insert (indexRelation=0x7f5605b608c0,
values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "",
heap_t_ctid=0x2c2e4f4,
heapRelation=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at indexam.c:204
#7 0x000000000063f2d5 in ExecInsertIndexTuples (slot=0x2b9a2c0,
tupleid=0x2c2e4f4, estate=0x2b99c00, noDupErr=0 '\000',
specConflict=0x0,
arbiterIndexes=0x0) at execIndexing.c:388
#8 0x000000000066a1fc in ExecInsert (mtstate=0x2b99e50,
slot=0x2b9a2c0, planSlot=0x2b9a2c0, arbiterIndexes=0x0,
onconflict=ONCONFLICT_NONE,
estate=0x2b99c00, canSetTag=1 '\001') at nodeModifyTable.c:481
#9 0x000000000066b841 in ExecModifyTable (node=0x2b99e50) at
nodeModifyTable.c:1496
#10 0x0000000000645e51 in ExecProcNode (node=0x2b99e50) at execProcnode.c:396
#11 0x00000000006424d9 in ExecutePlan (estate=0x2b99c00,
planstate=0x2b99e50, use_parallel_mode=0 '\000', operation=CMD_INSERT,
sendTuples=0 '\000',
numberTuples=0, direction=ForwardScanDirection, dest=0xd73c20
<donothingDR>) at execMain.c:1567
#12 0x000000000064087a in standard_ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at execMain.c:338
#13 0x00007f55fe590605 in pgss_ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#14 0x0000000000640751 in ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at execMain.c:284
#15 0x00000000007c331e in ProcessQuery (plan=0x2b873f0,
sourceText=0x2b89bd0 "insert into foo (index) select $1 from
generate_series(1,10000)",
params=0x2b89c20, dest=0xd73c20 <donothingDR>,
completionTag=0x7ffdf540a3f0 "") at pquery.c:187
#16 0x00000000007c4a0c in PortalRunMulti (portal=0x2ae7930,
isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xd73c20
<donothingDR>,
altdest=0xd73c20 <donothingDR>, completionTag=0x7ffdf540a3f0 "")
at pquery.c:1303
#17 0x00000000007c4055 in PortalRun (portal=0x2ae7930,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2b5dc90,
altdest=0x2b5dc90,
completionTag=0x7ffdf540a3f0 "") at pquery.c:815
#18 0x00000000007bfb45 in exec_execute_message (portal_name=0x2b5d880
"", max_rows=9223372036854775807) at postgres.c:1977
#19 0x00000000007c25c7 in PostgresMain (argc=1, argv=0x2b05df8,
dbname=0x2aca3c0 "postgres", username=0x2b05de0 "edb") at
postgres.c:4133
#20 0x0000000000744d8f in BackendRun (port=0x2aefa60) at postmaster.c:4260
#21 0x0000000000744523 in BackendStartup (port=0x2aefa60) at postmaster.c:3934
#22 0x0000000000740f9c in ServerLoop () at postmaster.c:1691
#23 0x0000000000740623 in PostmasterMain (argc=4, argv=0x2ac81f0) at
postmaster.c:1299
#24 0x00000000006940fe in main (argc=4, argv=0x2ac81f0) at main.c:228

Please let me know for any further inputs.

[1]: /messages/by-id/CAE9k0Pmxh-4NAr4GjzDDFHdBKDrKy2FV-Z+2Tp8vb2Kmxu=6zg@mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Wed, Sep 14, 2016 at 2:45 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi All,

I am getting following error when running the test script shared by
Jeff -[1] . The error is observed upon executing the test script for
around 3-4 hrs.

57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR: lock
buffer_content 1 is not held
57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT: insert into
foo (index) select $1 from generate_series(1,10000)

124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR: lock
buffer_content 10 is not held
124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT: insert
into foo (index) select $1 from generate_series(1,10000)

124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR: lock
buffer_content 10 is not held
124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT: insert
into foo (index) select $1 from generate_series(1,10000)

[1]- /messages/by-id/CAMkU=1xRt8jBBB7g_7K41W00=br9UrxMVn_rhWhKPLaHfEdM5A@mail.gmail.com

Please note that i am performing the test on latest patch for
concurrent hash index and WAL log in hash index shared by Amit
yesterday.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 09/13/2016 07:41 AM, Amit Kapila wrote:

README:
+in_complete split flag. The reader algorithm works correctly, as it
will
scan

What flag ?

in-complete-split flag which indicates that split has to be finished
for that particular bucket. The value of these flags are
LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
bucket respectively. It is set in hasho_flag in special area of page.
I have slightly expanded the definition in README, but not sure if it
is good idea to mention flags defined in hash.h. Let me know, if still
it is unclear or you want some thing additional to be added in README.

I think it is better now.

hashxlog.c:

hash_xlog_move_page_contents
hash_xlog_squeeze_page

Both have "bukcetbuf" (-> "bucketbuf"), and

+ if (BufferIsValid(bukcetbuf));

->

+ if (BufferIsValid(bucketbuf))

and indent following line:

LockBufferForCleanup(bukcetbuf);

hash_xlog_delete

has the "if" issue too.

Fixed all the above cosmetic issues.

I meant there is an extra ';' on the "if" statements:

+ if (BufferIsValid(bukcetbuf)); <--

in hash_xlog_move_page_contents, hash_xlog_squeeze_page and
hash_xlog_delete.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Ashutosh Sharma (#43)

Re: Write Ahead Logging for Hash Indexes

On Wed, Sep 14, 2016 at 4:36 PM, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

Hi All,

Below is the backtrace for the issue reported in my earlier mail [1].
From the callstack it looks like we are trying to release lock on a
meta page twice in _hash_expandtable().

Thanks for the call stack. I think below code in patch is culprit.
Here we have already released the meta page lock and then again on
failure, we are trying to release it.

_hash_expandtable()
{
..

/* Release the metapage lock, before completing the split. */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
..

if (!buf_nblkno)
{
_hash_relbuf(rel, buf_oblkno);
goto fail;
}
..
fail:
/* We didn't write the metapage, so just drop lock */
_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
}

This is a problem of concurrent hash index patch. I will send the fix
in next version of the patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jesper Pedersen (#41)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 09/13/2016 07:41 AM, Amit Kapila wrote:

I meant there is an extra ';' on the "if" statements:

+ if (BufferIsValid(bukcetbuf)); <--

in hash_xlog_move_page_contents, hash_xlog_squeeze_page and
hash_xlog_delete.

Okay, Thanks for pointing out the same. I have fixed it. Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1]/messages/by-id/CAA4eK1+zT16bxVpF3J9UsiZxExCqFa6QxzUfazmJthW6dO78ww@mail.gmail.com. I have also updated README.

[1]: /messages/by-id/CAA4eK1+zT16bxVpF3J9UsiZxExCqFa6QxzUfazmJthW6dO78ww@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal_hash_index_v4.patchapplication/octet-stream; name=wal_hash_index_v4.patchDownload

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0f09d82..2d0d3bd 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1520,19 +1520,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cd66abc..8990bbe 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2107,10 +2107,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 06f49db..f285795 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2323,12 +2323,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f8e55..e075ede 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e9f47c4..a2dc212 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index 5d3bd94..5e24f4b 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashscan.o \
-       hashsearch.o hashsort.o hashutil.o hashvalidate.o
+       hashsearch.o hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 8d815f0..56c94c4 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -452,18 +452,16 @@ Obtaining an overflow page:
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		mark bitmap page dirty and release content lock
+		mark bitmap page dirty
 		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
 			update it and mark meta page dirty
-		release meta page buffer content lock
-		return page number
 	else (not found):
 	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -483,12 +481,13 @@ like this:
 
 	-- having determined that no space is free in the target bucket:
 	remember last page of bucket, drop write lock on it
-	call free-page-acquire routine
 	re-write-lock last page of bucket
 	if it is not last anymore, step to the last page
+	execute free-page-acquire (Obtaining an overflow page) mechanism described above
 	update (former) last page to point to new page
 	write-lock and initialize new page, with back link to former last page
 	write and release former last page
+	release the locks on meta page and bitmap page acquired in free-page-acquire algorithm
 	insert tuple into new page
 	-- etc.
 
@@ -515,12 +514,14 @@ accessors of pages in the bucket.  The algorithm is:
 	determine which bitmap page contains the free space bit for page
 	release meta page buffer content lock
 	pin bitmap page and take buffer content lock in exclusive mode
-	update bitmap bit
-	mark bitmap page dirty and release buffer content lock and pin
-	if page number is less than what we saw as first-free-bit in meta:
 	retake meta page buffer content lock in exclusive mode
+	move (insert) tuples that belong to the overflow page being freed
+	update bitmap bit
+	mark bitmap page dirty
 	if page number is still less than first-free-bit,
 		update first-free-bit field and mark meta page dirty
+	write WAL for delink overflow page operation
+	release buffer content lock and pin
 	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
@@ -531,8 +532,101 @@ page acquirer will scan more bitmap bits than he needs to.  What must be
 avoided is having first-free-bit greater than the actual first free bit,
 because then that free page would never be found by searchers.
 
-All the freespace operations should be called while holding no buffer
-locks.  Since they need no lmgr locks, deadlock is not possible.
+The reason of moving tuples from overflow page while delinking the later is
+to make that as atomic operation.  Not doing so could lead to spurious reads
+on standby.  Basically, user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which gets
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolledback.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with
+current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform split operation if the number of tuples are more than what can be
+accomodated in initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with in-complete-split flag which indicates that split has to be
+finished for that particular bucket.  The reader algorithm works correctly, as
+it will scan both the old and new buckets as explained in the reader algorithm
+section above.
+
+We finish the split at next insert or split operation on old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require to allocate a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vaccum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeezes the bucket completely.
 
 
 Other Notes
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 302f6ed..bb94321 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -27,6 +27,7 @@
 #include "optimizer/plancat.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -115,7 +116,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -177,7 +178,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
@@ -297,6 +298,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -604,6 +610,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -629,7 +636,28 @@ loop_top:
 		num_index_tuples = metap->hashm_ntuples;
 	}
 
-	_hash_wrtbuf(rel, metabuf);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
 	if (stats == NULL)
@@ -720,7 +748,6 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
-		bool		curr_page_dirty = false;
 
 		if (delay)
 			vacuum_delay_point();
@@ -787,9 +814,41 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
-			curr_page_dirty = true;
+
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -804,15 +863,7 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		 * release the lock on previous page after acquiring the lock on next
 		 * page
 		 */
-		if (curr_page_dirty)
-		{
-			if (retain_pin)
-				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-			else
-				_hash_wrtbuf(rel, buf);
-			curr_page_dirty = false;
-		}
-		else if (retain_pin)
+		if (retain_pin)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, buf);
@@ -844,10 +895,36 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+
+		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_CLEAR_GARBAGE);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
+	 * we need to release and reacquire the lock on bucket buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
 	 * If we deleted anything, try to compact free space.  For squeezing the
 	 * bucket, we must have a cleanup lock, else it can impact the ordering of
 	 * tuples for a scan that has started before it.
@@ -856,9 +933,3 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 							bstrategy);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..d030a8d
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,970 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, *num_bucket, LH_OVERFLOW_PAGE, true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay clear garbage flag operation for primary bucket page.
+ */
+static void
+hash_xlog_clear_garbage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_CLEAR_GARBAGE:
+			hash_xlog_clear_garbage(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5cfd0aa..3514138 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -44,6 +46,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -248,31 +251,63 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * write and release the modified page and ensure to release the pin on
-	 * primary page.
-	 */
-	_hash_wrtbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap->hashm_ntuples += 1;
 
 	/* Make sure this stays in sync with _hash_expandtable() */
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/* Attempt to split if a split is needed */
 	if (do_expand)
@@ -314,3 +349,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 760563a..a6ae979 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,10 +18,11 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -84,7 +85,9 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -102,13 +105,37 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
+	HashMetaPage metap;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
+	BlockNumber blkno;
+	uint32		orig_firstfree;
+	uint32		splitnum;
+	uint32	   *freep = NULL;
+	uint32		max_ovflpg;
+	uint32		bit;
+	uint32		bitmap_page_bit;
+	uint32		first_page;
+	uint32		last_bit;
+	uint32		last_page;
+	uint32		i,
+				j;
+	bool		page_found = false;
 
 	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
+	 * Write-lock the tail page.  Here, we need to maintain locking order such
+	 * that, first acquire the lock on tail page of bucket, then on meta page
+	 * to find and lock the bitmap page and if it is found, then lock on meta
+	 * page is released, then finally acquire the lock on new overflow buffer.
+	 * We need this locking order to avoid deadlock with backends that are
+	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here, if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_WRITE);
 
@@ -136,55 +163,6 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
 
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
-	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
-	BlockNumber blkno;
-	uint32		orig_firstfree;
-	uint32		splitnum;
-	uint32	   *freep = NULL;
-	uint32		max_ovflpg;
-	uint32		bit;
-	uint32		first_page;
-	uint32		last_bit;
-	uint32		last_page;
-	uint32		i,
-				j;
-
 	/* Get exclusive lock on the meta page */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
@@ -233,11 +211,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -261,8 +259,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -273,7 +278,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	}
 
 	/* Calculate address of the new overflow page */
-	bit = metap->hashm_spares[splitnum];
+	bit = BufferIsValid(newmapbuf) ?
+		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -281,39 +287,51 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * relation length stays in sync with ours.  XXX It's annoying to do this
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
+	 *
+	 * It is okay to hold two buffer locks here (one on tail page of bucket
+	 * and other on new overflow page) since there cannot be anyone else
+	 * contending for access to ovflbuf.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case new page is added for them.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
-
-	/* Write updated metapage and release lock, but not pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	START_CRIT_SECTION();
 
-	return newbuf;
-
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	_hash_wrtbuf(rel, mapbuf);
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
 
-	/* Reacquire exclusive lock on the meta page */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -322,18 +340,101 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
-		_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* We didn't change the metapage, so no need to write */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
 	}
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	END_CRIT_SECTION();
+
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, buf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, ovflbuf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, ovflbuf, HASH_NOLOCK, HASH_WRITE);
+
+	return ovflbuf;
 }
 
 /*
@@ -366,6 +467,12 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -377,7 +484,9 @@ _hash_firstfreebit(uint32 map)
  *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -387,6 +496,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	BlockNumber prevblkno;
 	BlockNumber blkno;
 	BlockNumber nextblkno;
+	BlockNumber writeblkno;
 	HashPageOpaque ovflopaque;
 	Page		ovflpage;
 	Page		mappage;
@@ -395,6 +505,10 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	BlockNumber bucketblkno;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -403,69 +517,37 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
 	nextblkno = ovflopaque->hasho_nextblkno;
 	prevblkno = ovflopaque->hasho_prevblkno;
+	writeblkno = BufferGetBlockNumber(wbuf);
 	bucket = ovflopaque->hasho_bucket;
-
-	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	_hash_wrtbuf(rel, ovflbuf);
+	bucketblkno = BufferGetBlockNumber(bucketbuf);
 
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  No concurrency issues since we hold the cleanup lock on
 	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
-	 * primary bucket, as we already have that lock.
+	 * primary bucket or if the previous bucket is same as write bucket, as we
+	 * already have lock on those buckets.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		if (prevblkno == bucket_blkno)
-		{
-			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
-													 prevblkno,
-													 RBM_NORMAL,
-													 bstrategy);
-
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
-			MarkBufferDirty(prevbuf);
-			ReleaseBuffer(prevbuf);
-		}
+		if (prevblkno == bucketblkno)
+			prevbuf = bucketbuf;
+		else if (prevblkno == writeblkno)
+			prevbuf = wbuf;
 		else
-		{
-			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-															 prevblkno,
-															 HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-															 bstrategy);
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
-			_hash_wrtbuf(rel, prevbuf);
-		}
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_WRITE,
+												 LH_OVERFLOW_PAGE,
+												 bstrategy);
 	}
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		_hash_wrtbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -491,60 +573,191 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	_hash_wrtbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed
+	 * the page as WAL replay routines expect pages to be initialized.
+	 * See explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
-		_hash_wrtbuf(rel, metabuf);
+		update_metap = true;
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* no need to change metapage */
-		_hash_relbuf(rel, metabuf);
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		/*
+		 * As bitmap page doesn't have standard page layout, so this will
+		 * allow us to log the data.
+		 */
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
 
+	END_CRIT_SECTION();
+
+	/*
+	 * Release the lock, caller will decide whether to release the pin or
+	 * reacquire the lock.  Caller is responsibile for locking and unlocking
+	 * bucket buffer.
+	 */
+	if (BufferGetBlockNumber(wbuf) != bucketblkno)
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	/* release previous bucket if it is not same as primary or write bucket */
+	if (BufferIsValid(prevbuf) &&
+		prevblkno != bucketblkno &&
+		prevblkno != writeblkno)
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
+
 	return nextblkno;
 }
 
-
 /*
- *	_hash_initbitmap()
- *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
+ *	_hash_initbitmapbuffer()
  *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
  */
 void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 {
-	Buffer		buf;
 	Page		pg;
 	HashPageOpaque op;
 	uint32	   *freep;
 
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
 	pg = BufferGetPage(buf);
 
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
 	/* initialize the page's special space */
 	op = (HashPageOpaque) PageGetSpecialPointer(pg);
 	op->hasho_prevblkno = InvalidBlockNumber;
@@ -555,25 +768,9 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 	/* set all of the bits to 1 */
 	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* write out the new bitmap page (releasing write lock and pin) */
-	_hash_wrtbuf(rel, buf);
-
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
+	MemSet(freep, 0xFF, bmsize);
 }
 
-
 /*
  *	_hash_squeezebucket(rel, bucket)
  *
@@ -613,8 +810,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
-	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
@@ -657,23 +852,30 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
-
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		IndexTuple	itup;
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		Size		itemsz;
+		int			i;
+
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
+
+readpage:
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
 		for (roffnum = FirstOffsetNumber;
 			 roffnum <= maxroffnum;
 			 roffnum = OffsetNumberNext(roffnum))
 		{
-			IndexTuple	itup;
-			Size		itemsz;
-
 			itup = (IndexTuple) PageGetItem(rpage,
 											PageGetItemId(rpage, roffnum));
 			itemsz = IndexTupleDSize(*itup);
@@ -681,39 +883,113 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMulTups(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
+				bool		release_wbuf = false;
+				bool		tups_moved = false;
+
 				Assert(!PageIsEmpty(wpage));
 
+				/*
+				 * caller is responsibile for locking and unlocking bucket
+				 * buffer
+				 */
 				if (wblkno != bucket_blkno)
-					release_buf = true;
+					release_wbuf = true;
 
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty && release_buf)
-					_hash_wrtbuf(rel, wbuf);
-				else if (wbuf_dirty)
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
 					MarkBufferDirty(wbuf);
-				else if (release_buf)
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
+					tups_moved = true;
+				}
+
+				if (release_wbuf)
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_SQUEEZE_PAGE on standby users will be able to
+				 * view the results of partial deletion on rblkno.
+				 */
+				_hash_chgbufaccess(rel, rbuf, HASH_READ, HASH_NOLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						_hash_wrtbuf(rel, rbuf);
-					}
-					else
-						_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				_hash_chgbufaccess(rel, rbuf, HASH_NOLOCK, HASH_WRITE);
+
 				wbuf = _hash_getbuf_with_strategy(rel,
 												  wblkno,
 												  HASH_WRITE,
@@ -722,21 +998,33 @@ _hash_squeezebucket(Relation rel,
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
-				release_buf = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
+
+				/*
+				 * after moving the tuples, rpage would have been compacted,
+				 * so we need to rescan it.
+				 */
+				if (tups_moved)
+					goto readpage;
+			}
 
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -748,34 +1036,36 @@ _hash_squeezebucket(Relation rel,
 		 * Tricky point here: if our read and write pages are adjacent in the
 		 * bucket chain, our write lock on wbuf will conflict with
 		 * _hash_freeovflpage's attempt to update the sibling links of the
-		 * removed page.  However, in that case we are done anyway, so we can
-		 * simply drop the write lock before calling _hash_freeovflpage.
+		 * removed page.  However, in that case we are ensuring that
+		 * _hash_freeovflpage doesn't take lock on that page again.  Releasing
+		 * the lock is not an option, because before that we need to write WAL
+		 * for the change in this page.
 		 */
 		rblkno = ropaque->hasho_prevblkno;
 		Assert(BlockNumberIsValid(rblkno));
 
+		/* free this overflow page (releases rbuf) */
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
+
+		pfree(itup_offsets);
+
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
+			/* Caller is responsibile for locking and unlocking bucket buffer */
 			if (wblkno != bucket_blkno)
-				release_buf = true;
-
-			/* yes, so release wbuf lock first if needed */
-			if (wbuf_dirty && release_buf)
-				_hash_wrtbuf(rel, wbuf);
-			else if (wbuf_dirty)
-				MarkBufferDirty(wbuf);
-			else if (release_buf)
-				_hash_relbuf(rel, wbuf);
-
-			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
-			/* done */
+				_hash_dropbuf(rel, wbuf);
 			return;
 		}
 
-		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
+		/* Caller is responsibile for locking and unlocking bucket buffer */
+		if (wblkno != bucket_blkno)
+			_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 2a45862..f9d67a7 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -40,12 +41,11 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -160,6 +160,29 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag, bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	pageopaque->hasho_prevblkno = InvalidBlockNumber;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -286,35 +309,17 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 }
 
 /*
- *	_hash_wrtbuf() -- write a hash page to disk.
- *
- *		This routine releases the lock held on the buffer and our refcount
- *		for it.  It is an error to call _hash_wrtbuf() without a write lock
- *		and a pin on the buffer.
- *
- * NOTE: this routine should go away when/if hash indexes are WAL-ified.
- * The correct sequence of operations is to mark the buffer dirty, then
- * write the WAL record, then release the lock and pin; so marking dirty
- * can't be combined with releasing.
- */
-void
-_hash_wrtbuf(Relation rel, Buffer buf)
-{
-	MarkBufferDirty(buf);
-	UnlockReleaseBuffer(buf);
-}
-
-/*
  * _hash_chgbufaccess() -- Change the lock type on a buffer, without
  *			dropping our pin on it.
  *
- * from_access and to_access may be HASH_READ, HASH_WRITE, or HASH_NOLOCK,
- * the last indicating that no buffer-level lock is held or wanted.
+ * from_access can be HASH_READ or HASH_NOLOCK, the later indicating that
+ * no buffer-level lock is held or wanted.
+ *
+ * to_access can be HASH_READ, HASH_WRITE, or HASH_NOLOCK, the last indicating
+ * that no buffer-level lock is held or wanted.
  *
- * When from_access == HASH_WRITE, we assume the buffer is dirty and tell
- * bufmgr it must be written out.  If the caller wants to release a write
- * lock on a page that's not been modified, it's okay to pass from_access
- * as HASH_READ (a bit ugly, but handy in some places).
+ * If the caller wants to release a write lock on a page, it's okay to pass
+ * from_access as HASH_READ (a bit ugly, but quite handy when required).
  */
 void
 _hash_chgbufaccess(Relation rel,
@@ -322,8 +327,7 @@ _hash_chgbufaccess(Relation rel,
 				   int from_access,
 				   int to_access)
 {
-	if (from_access == HASH_WRITE)
-		MarkBufferDirty(buf);
+	Assert(from_access != HASH_WRITE);
 	if (from_access != HASH_NOLOCK)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	if (to_access != HASH_NOLOCK)
@@ -332,7 +336,7 @@ _hash_chgbufaccess(Relation rel,
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -344,19 +348,18 @@ _hash_chgbufaccess(Relation rel,
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -378,6 +381,154 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -397,30 +548,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -437,7 +583,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -456,44 +602,11 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_firstfree = 0;
 
 	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
 	 */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
-
-	/*
-	 * Initialize the first N buckets
-	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-		pageopaque->hasho_prevblkno = InvalidBlockNumber;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		_hash_wrtbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
-
-	/*
-	 * Initialize first bitmap page
-	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	_hash_wrtbuf(rel, metabuf);
-
-	return num_buckets;
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -502,7 +615,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
@@ -530,10 +642,14 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -569,7 +685,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
@@ -703,7 +819,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -728,18 +848,44 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = InvalidBlockNumber;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -747,6 +893,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -759,10 +906,10 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -775,15 +922,71 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_NOLOCK, HASH_WRITE);
 
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -823,6 +1026,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
 	char		zerobuf[BLCKSZ];
+	Page		page;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -833,7 +1037,21 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	MemSet(zerobuf, 0, sizeof(zerobuf));
+	page = (Page) zerobuf;
+
+	/*
+	 * Initialise the new bucket page, here we can't complete zeroed
+	 * the page as WAL replay routines expect pages to be initialized.
+	 * See explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(page, BLCKSZ);
+
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
 
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
@@ -841,14 +1059,18 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	return true;
 }
 
-
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -874,70 +1096,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress and it has
-	 * deletable tuples. At operation end, we clear split in progress flag and
-	 * vacuum will clear page_has_garbage flag after deleting such tuples.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = InvalidBlockNumber;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * is used to finish the incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -1021,7 +1184,7 @@ _hash_splitbucket_guts(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
 					bool		retain_pin = false;
 
@@ -1031,8 +1194,12 @@ _hash_splitbucket_guts(Relation rel,
 					 */
 					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
 
-					/* write out nbuf and drop lock, but keep pin */
-					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
+					/* drop lock, but keep pin */
+					_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
@@ -1040,6 +1207,13 @@ _hash_splitbucket_guts(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -1048,6 +1222,8 @@ _hash_splitbucket_guts(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1070,7 +1246,16 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
+			if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1086,15 +1271,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
-		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, nbuf);
-
-	/*
-	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
-	 * there is no pending scan that has seen the flag after it is cleared.
-	 */
 	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1103,6 +1279,8 @@ _hash_splitbucket_guts(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	/* indicate that split is finished */
 	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
@@ -1113,6 +1291,29 @@ _hash_splitbucket_guts(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1217,9 +1418,41 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
 	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	obucket = opageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	hash_destroy(tidhtab);
 }
+
+/*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 6ec3bea..5d8942c 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -306,6 +306,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 
 
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
@@ -362,6 +363,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		while (BlockNumberIsValid(opaque->hasho_nextblkno))
 			_hash_readnext(rel, &buf, &page, &opaque);
+
+		if (BufferIsValid(buf))
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	}
 
 	/* Now find the first tuple satisfying the qualification */
@@ -480,6 +484,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readnext(rel, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch(page, so->hashso_sk_hash);
 					}
@@ -503,6 +508,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
 
 							page = BufferGetPage(buf);
+							TestForOldSnapshot(scan->xs_snapshot, rel, page);
 							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 							maxoff = PageGetMaxOffsetNumber(page);
 							offnum = _hash_binsearch(page, so->hashso_sk_hash);
@@ -567,6 +573,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(rel, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
@@ -590,6 +597,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
 
 							page = BufferGetPage(buf);
+							TestForOldSnapshot(scan->xs_snapshot, rel, page);
 							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 							maxoff = PageGetMaxOffsetNumber(page);
 							offnum = _hash_binsearch(page, so->hashso_sk_hash);
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 12e1818..245ce97 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,143 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "split_complete_old_bucket %c, split_complete_new_bucket %c",
+								 (xlrec->old_bucket_flag & LH_BUCKET_OLD_PAGE_SPLIT) ? 'F' : 'T',
+								 (xlrec->new_bucket_flag & LH_BUCKET_NEW_PAGE_SPLIT) ? 'F' : 'T');
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_CLEAR_GARBAGE:
+			id = "CLEAR_GARBAGE";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 85817c6..b868fd8 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -499,11 +499,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 73aa0c0..31b66d2 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -598,6 +598,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMulTups
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMulTups(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 79e0b1f..a918c6d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5379,13 +5379,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5407,8 +5404,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index bbf822b..0e6ad1a 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -313,13 +313,16 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
+				   BufferAccessStrategy bstrategy);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
@@ -331,6 +334,8 @@ extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
 								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag,
+			  bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -339,11 +344,12 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 5f941a9..30e16c0 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -17,6 +17,245 @@
 #include "access/hash.h"
 #include "access/xlogreader.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_CLEAR_GARBAGE 0xA0	/* clear garbage flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * Data to regenerate the meta-data page
+ */
+typedef struct xl_hash_metadata
+{
+	HashMetaPageData metadata;
+}	xl_hash_metadata;
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (xl_hash_metadata)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index ad4ab5f..6ea46ef 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -425,6 +425,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMulTups(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 76593e1..80089b9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index 39a60a5..62f2ad7 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 59cb1e0..d907519 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail

#46

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Amit Kapila (#45)

Re: Write Ahead Logging for Hash Indexes

On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Okay, Thanks for pointing out the same. I have fixed it. Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1]. I have also updated README.

with v7 of the concurrent has patch and v4 of the write ahead log patch and
the latest relcache patch (I don't know how important that is to
reproducing this, I suspect it is not), I once got this error:

38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was
interrupted; last known up at 2016-09-19 16:25:49 PDT
38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not
properly shut down; automatic recovery in progress
38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at 3F/2200DE90
38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification
failed, calculated checksum 65067 but expected 21260
38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at 3F/22053B50
for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9 of
relation base/16384/17334
38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at 3F/22053B50
for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T

The original page with the invalid checksum is:

$ od 16384_17334_9
0000000 000032 000000 015420 077347 020404 000000 000030 017760
0000020 017760 020004 000000 000000 000000 000000 000000 000000
0000040 000000 000000 000000 000000 000000 000000 000000 000000
*
0017760 000007 000002 177777 177777 000007 000000 000002 177600
0020000

If I ignore the checksum failure and re-start the system, the page gets
restored to be a bitmap page.

Cheers,

Jeff

#47

Jesper Pedersen

jesper.pedersen@redhat.com

over 9 years ago

In reply to: Amit Kapila (#45)

Re: Write Ahead Logging for Hash Indexes

On 09/16/2016 02:42 AM, Amit Kapila wrote:

Okay, Thanks for pointing out the same. I have fixed it. Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1]. I have also updated README.

Thanks.

This needs a rebase against the CHI v8 [1]/messages/by-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com patch.

[1]: /messages/by-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com
/messages/by-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#46)

Re: Write Ahead Logging for Hash Indexes

On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Okay, Thanks for pointing out the same. I have fixed it. Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1]. I have also updated README.

with v7 of the concurrent has patch and v4 of the write ahead log patch and
the latest relcache patch (I don't know how important that is to reproducing
this, I suspect it is not), I once got this error:

38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was
interrupted; last known up at 2016-09-19 16:25:49 PDT
38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not
properly shut down; automatic recovery in progress
38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at 3F/2200DE90
38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification failed,
calculated checksum 65067 but expected 21260
38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at 3F/22053B50
for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9 of
relation base/16384/17334
38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at 3F/22053B50
for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T

The original page with the invalid checksum is:

I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.

! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
! errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
! errhint("JJ is screwing with the database.")));
! } else {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }

If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.

I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code. The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.

Today, Ashutosh has shared the logs of his test run where he has shown
similar problem for HEAP page. I think this could happen though
rarely for any page with the above kind of tests.

Does this explanation explains the reason of problem you are seeing?

If I ignore the checksum failure and re-start the system, the page gets
restored to be a bitmap page.

Okay, but have you ensured in some way that redo is applied to bitmap page?

Today, while thinking on this problem, I realized that currently in
patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1]/messages/by-id/CAA4eK1KJOfVvFUmi6dcX9Y2-0PFHkomDzGuyoC=aD3Qj9WPpFA@mail.gmail.com. That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards. I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.

Thoughts?

[1]: /messages/by-id/CAA4eK1KJOfVvFUmi6dcX9Y2-0PFHkomDzGuyoC=aD3Qj9WPpFA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Jeff Janes

jeff.janes@gmail.com

over 9 years ago

In reply to: Amit Kapila (#48)

Re: Write Ahead Logging for Hash Indexes

On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Okay, Thanks for pointing out the same. I have fixed it. Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1]. I have also updated README.

with v7 of the concurrent has patch and v4 of the write ahead log patch

and

the latest relcache patch (I don't know how important that is to

reproducing

this, I suspect it is not), I once got this error:

38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was
interrupted; last known up at 2016-09-19 16:25:49 PDT
38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not
properly shut down; automatic recovery in progress
38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at 3F/2200DE90
38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification

failed,

calculated checksum 65067 but expected 21260
38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at

3F/22053B50

for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9

of

relation base/16384/17334
38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at

3F/22053B50

for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T

The original page with the invalid checksum is:

I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.

! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
! errmsg("could not write block %u of relation %s: wrote only %d of %d
bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
! errhint("JJ is screwing with the database.")));
! } else {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }

If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.

I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code. The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.

Correct. But any torn page write must be covered by the restoration of a
full page image during replay, shouldn't it? And that restoration should
happen blindly, without first reading in the old page and verifying the
checksum. Failure to restore the page from a FPI would be a bug. (That
was the purpose for which I wrote this testing harness in the first place,
to verify that the restoration of FPI happens correctly; although most of
the bugs it happens to uncover have been unrelated to that.)

Today, Ashutosh has shared the logs of his test run where he has shown
similar problem for HEAP page. I think this could happen though
rarely for any page with the above kind of tests.

I think Ashutosh's examples are of warnings, not errors. I think the
warnings occur when replay needs to read in the block (for reason's I don't
understand yet) but then doesn't care if it passes the checksum or not
because it will just be blown away by the replay anyway.

Does this explanation explains the reason of problem you are seeing?

If it can't survive artificial torn page writes, then it probably can't
survive reals ones either. So I am pretty sure it is a bug of some sort.
Perhaps the bug is that it is generating an ERROR when should just be a
WARNING?

If I ignore the checksum failure and re-start the system, the page gets
restored to be a bitmap page.

Okay, but have you ensured in some way that redo is applied to bitmap page?

I haven't done that yet. I can't start the system without destroying the
evidence, and I haven't figured out yet how to import a specific block from
a shut-down server into a bytea of a running server, in order to inspect it
using pageinspect.

Today, while thinking on this problem, I realized that currently in

patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1]. That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards. I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.

I'm afraid that is over my head. I can study it until it makes sense, but
it will take me a while.

Cheers,

Jeff

Show quoted text

[1] - /messages/by-id/CAA4eK1KJOfVvFUmi6dcX9Y2-
0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com

#50

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Jeff Janes (#49)

Re: Write Ahead Logging for Hash Indexes

On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Okay, Thanks for pointing out the same. I have fixed it. Apart from
that, I have changed _hash_alloc_buckets() to initialize the page
instead of making it completely Zero because of problems discussed in
another related thread [1]. I have also updated README.

with v7 of the concurrent has patch and v4 of the write ahead log patch
and
the latest relcache patch (I don't know how important that is to
reproducing
this, I suspect it is not), I once got this error:

38422 00000 2016-09-19 16:25:50.055 PDT:LOG: database system was
interrupted; last known up at 2016-09-19 16:25:49 PDT
38422 00000 2016-09-19 16:25:50.057 PDT:LOG: database system was not
properly shut down; automatic recovery in progress
38422 00000 2016-09-19 16:25:50.057 PDT:LOG: redo starts at
3F/2200DE90
38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification
failed,
calculated checksum 65067 but expected 21260
38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at
3F/22053B50
for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9
of
relation base/16384/17334
38422 XX001 2016-09-19 16:25:50.071 PDT:CONTEXT: xlog redo at
3F/22053B50
for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T

The original page with the invalid checksum is:

I think this is a example of torn page problem, which seems to be
happening because of the below code in your test.

! if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress()) {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
! ereport(FATAL,
! (errcode(ERRCODE_DISK_FULL),
! errmsg("could not write block %u of relation %s: wrote only %d of %d
bytes",
! blocknum,
! relpath(reln->smgr_rnode, forknum),
! nbytes, BLCKSZ),
! errhint("JJ is screwing with the database.")));
! } else {
! nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
! }

If you are running the above test by disabling JJ_torn_page, then it
is a different matter and we need to investigate it, but l assume you
are running by enabling it.

I think this could happen if the actual change in page is in 2/3 part
of page which you are not writing in above code. The checksum in page
header which is written as part of partial page write (1/3 part of
page) would have considered the actual change you have made whereas
after restart when it again read the page to apply redo, the checksum
calculation won't include the change being made in 2/3 part.

Correct. But any torn page write must be covered by the restoration of a
full page image during replay, shouldn't it? And that restoration should
happen blindly, without first reading in the old page and verifying the
checksum.

Probably, but I think this is not currently the way it is handled and
I don't want to change it. AFAIU, what happens now is that we first
read the old page (and we do page verification while reading old page
and display *warning* if checksum doesn't match) and then restore the
image. The relevant code path is
XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().

Now, here the point is that why instead of WARNING we are seeing FATAL
for the bitmap page of hash index. The reason as explained in my
previous e-mail is that for bitmap page we are not logging full page
image and we should fix that as explained there. Once the full page
image is logged, then first time it reads the torn page, it will use
flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
to WARNING.

I think this is the reason why Ashutosh is seeing WARNING for heap
page whereas you are seeing FATAL for bitmap page of hash index.

I don't think we should try to avoid WARNING for torn pages as part of
this patch, even if that is better course of action, but certainly
FATAL should be fixed and a WARNING should be displayed instead as it
is displayed for heap pages.

Today, while thinking on this problem, I realized that currently in
patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
problem reported by you [1]. That change will fix the problem
reported by you, but it will expose bitmap pages for torn-page
hazards. I think the right fix there is to make pd_lower equal to
pd_upper for bitmap page, so that full page writes doesn't exclude the
data in bitmappage.

I'm afraid that is over my head. I can study it until it makes sense, but
it will take me a while.

Does the explanation above helps, if not, then tell me, I will try to
explain it once more.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#50)

Re: Write Ahead Logging for Hash Indexes

On Thu, Sep 22, 2016 at 10:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Sep 22, 2016 at 8:51 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

Correct. But any torn page write must be covered by the restoration of a
full page image during replay, shouldn't it? And that restoration should
happen blindly, without first reading in the old page and verifying the
checksum.

Probably, but I think this is not currently the way it is handled and
I don't want to change it. AFAIU, what happens now is that we first
read the old page (and we do page verification while reading old page
and display *warning* if checksum doesn't match) and then restore the
image. The relevant code path is
XLogReadBufferForRedo()->XLogReadBufferExtended()->ReadBufferWithoutRelcache()->ReadBuffer_common()->PageIsVerified().

Now, here the point is that why instead of WARNING we are seeing FATAL
for the bitmap page of hash index. The reason as explained in my
previous e-mail is that for bitmap page we are not logging full page
image and we should fix that as explained there. Once the full page
image is logged, then first time it reads the torn page, it will use
flag RBM_ZERO_AND_LOCK which will make the FATAL error you are seeing
to WARNING.

I think here I am slightly wrong. For the full page writes, it do use
RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
doing page verification check and rather blindly setting the page to
zero and then overwrites it with full page image. So after my fix,
you will not see the error of checksum failure. I have a fix ready,
but still doing some more verification. If everything passes, I will
share the patch in a day or so.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#51)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think here I am slightly wrong. For the full page writes, it do use
RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
doing page verification check and rather blindly setting the page to
zero and then overwrites it with full page image. So after my fix,
you will not see the error of checksum failure. I have a fix ready,
but still doing some more verification. If everything passes, I will
share the patch in a day or so.

Attached patch fixes the problem, now we do perform full page writes
for bitmap pages. Apart from that, I have rebased the patch based on
latest concurrent index patch [1]/messages/by-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com. I have updated the README as well
to reflect the WAL logging related information for different
operations.

With attached patch, all the review comments or issues found till now
are addressed.

[1]: /messages/by-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal_hash_index_v5.patchapplication/octet-stream; name=wal_hash_index_v5.patchDownload

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0f09d82..2d0d3bd 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1520,19 +1520,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a848a7e..2607dc7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2112,10 +2112,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 06f49db..f285795 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2323,12 +2323,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 46f8e55..e075ede 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e9f47c4..a2dc212 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index e2e7e91..b154569 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
-       hashsort.o hashutil.o hashvalidate.o
+       hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1974fad..db46010 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -271,7 +271,6 @@ The insertion algorithm is rather similar:
 			retake meta page buffer content lock in shared mode
 -- (so far same as reader, except for acquisation of buffer content lock in
 	exclusive mode on primary bucket page)
-	release pin on metapage
 	if the split-in-progress flag is set for bucket in old half of split
 	and pin count on it is one, then finish the split
 		we already have a buffer content lock on old bucket, conditionally get the content lock on new bucket
@@ -280,13 +279,17 @@ The insertion algorithm is rather similar:
 			release the buffer content lock and pin on new bucket
 	if full, release lock but not pin, read/exclusive-lock next page; repeat as needed
 	>> see below if no space in any page of bucket
+	take buffer content lock in exclusive mode on metapage
 	insert tuple at appropriate place in page
-	mark current page dirty and release buffer content lock and pin
-	if current page is not a bucket page, release the pin on bucket page
-	pin meta page and take buffer content lock in exclusive mode
+	mark current page dirty
 	increment tuple count, decide if split needed
-	mark meta page dirty and release buffer content lock and pin
-	done if no split needed, else enter Split algorithm below
+	mark meta page dirty
+	write WAL for insertion of tuple
+	release the buffer content lock on metapage
+	release buffer content lock on current page
+	if current page is not a bucket page, release the pin on bucket page
+	if split is needed, enter Split algorithm below
+	release the pin on metapage
 
 To speed searches, the index entries within any individual index page are
 kept sorted by hash code; the insertion code must take care to insert new
@@ -331,23 +334,32 @@ existing bucket in two, thereby lowering the fill ratio:
 			release the buffer content lock on meta page
 			remove the tuples that doesn't belong to this bucket; see bucket cleanup below
 	Attempt to acquire cleanup lock on new bucket number (shouldn't fail, but...)
+	update old bucket to indicate split-in-progress and bucket has garbage
+	mark old bucket buffer dirty
+	initialize new bucket and update to indicate split-in-progress
+	mark new bucket buffer dirty
 	update meta page to reflect new number of buckets
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
+	write WAL for allocation of new page for split
+	release buffer content lock on metapage
 	-- now, accesses to all other buckets can proceed.
-	Perform actual split of bucket, moving tuples as needed
+	release and require buffer content lock on new bucket page
+	Perform actual split of bucket (refer split guts below), moving tuples as needed
 	>> see below about acquiring needed extra space
 
 	split guts
-	mark the old and new buckets indicating split-in-progress
-	mark the old bucket indicating has-garbage
 	copy the tuples that belongs to new bucket from old bucket
 	during copy mark such tuples as move-by-split
+	write WAL record for moving tuples to new page once the new page is full or
+	all the pages of old bucket are finished
 	release lock but not pin for primary bucket page of old bucket,
-	read/shared-lock next page; repeat as needed
+	acquire buffer content lock in shared mode on next page of old bucket; repeat as needed
 	>> see below if no space in bucket page of new bucket
-	ensure to have exclusive-lock on both old and new buckets in that order
+	ensure to have buffer content lock in exclusive mode on both old and new buckets in that order
 	clear the split-in-progress flag from both the buckets
-	mark buffers dirty and release the locks and pins on both old and new buckets
+	mark buffers dirty
+	write WAL for clearing the flags on both old and new buckets
+	release the locks and pins on both old and new buckets
 
 Note the metapage lock is not held while the actual tuple rearrangement is
 performed, so accesses to other buckets can proceed in parallel; in fact,
@@ -400,18 +412,24 @@ The fourth operation is garbage collection (bulk deletion):
 	while next bucket <= max bucket do
 		Acquire cleanup lock on target bucket
 		Scan and remove tuples
+		mark the target page dirty
+		write WAL for deleting tuples from target page
 		For overflow page, first we need to lock the next page and then
 		release the lock on current bucket or overflow page
 		Ensure to have buffer content lock in exclusive mode on bucket page
 		If buffer pincount is one, then compact free space as needed
 		Release lock
+		if the bucket has garbage tuples, then clear the garbage flag
+		mark the buffer dirty and write WAL record for clearing the flag
 		next bucket ++
 	end loop
 	pin metapage and take buffer content lock in exclusive mode
 	check if number of buckets changed
 	if so, release content lock and pin and return to for-each-bucket loop
 	else update metapage tuple count
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty
+	write WAL for update of metapage
+	release buffer content lock and pin
 
 Note that this is designed to allow concurrent splits and scans.  If a
 split occurs, tuples relocated into the new bucket will be visited twice
@@ -452,18 +470,16 @@ Obtaining an overflow page:
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		mark bitmap page dirty and release content lock
+		mark bitmap page dirty
 		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
 			update it and mark meta page dirty
-		release meta page buffer content lock
-		return page number
 	else (not found):
 	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -483,12 +499,15 @@ like this:
 
 	-- having determined that no space is free in the target bucket:
 	remember last page of bucket, drop write lock on it
-	call free-page-acquire routine
 	re-write-lock last page of bucket
 	if it is not last anymore, step to the last page
-	update (former) last page to point to new page
+	execute free-page-acquire (Obtaining an overflow page) mechanism described above
+	update (former) last page to point to new page and mark the  buffer dirty.
 	write-lock and initialize new page, with back link to former last page
-	write and release former last page
+	write WAL for addition of overflow page
+	release the locks on meta page and bitmap page acquired in free-page-acquire algorithm
+	release the lock on former last page
+	release the lock on new overflow page
 	insert tuple into new page
 	-- etc.
 
@@ -515,12 +534,14 @@ accessors of pages in the bucket.  The algorithm is:
 	determine which bitmap page contains the free space bit for page
 	release meta page buffer content lock
 	pin bitmap page and take buffer content lock in exclusive mode
-	update bitmap bit
-	mark bitmap page dirty and release buffer content lock and pin
-	if page number is less than what we saw as first-free-bit in meta:
 	retake meta page buffer content lock in exclusive mode
+	move (insert) tuples that belong to the overflow page being freed
+	update bitmap bit
+	mark bitmap page dirty
 	if page number is still less than first-free-bit,
 		update first-free-bit field and mark meta page dirty
+	write WAL for delink overflow page operation
+	release buffer content lock and pin
 	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
@@ -531,8 +552,101 @@ page acquirer will scan more bitmap bits than he needs to.  What must be
 avoided is having first-free-bit greater than the actual first free bit,
 because then that free page would never be found by searchers.
 
-All the freespace operations should be called while holding no buffer
-locks.  Since they need no lmgr locks, deadlock is not possible.
+The reason of moving tuples from overflow page while delinking the later is
+to make that as atomic operation.  Not doing so could lead to spurious reads
+on standby.  Basically, user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which gets
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolledback.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with
+current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform split operation if the number of tuples are more than what can be
+accomodated in initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with in-complete-split flag which indicates that split has to be
+finished for that particular bucket.  The reader algorithm works correctly, as
+it will scan both the old and new buckets as explained in the reader algorithm
+section above.
+
+We finish the split at next insert or split operation on old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require to allocate a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vaccum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeezes the bucket completely.
 
 
 Other Notes
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index ba9a1c2..db73f05 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -27,6 +27,7 @@
 #include "optimizer/plancat.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -115,7 +116,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -177,7 +178,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
@@ -297,6 +298,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -593,6 +599,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -618,7 +625,28 @@ loop_top:
 		num_index_tuples = metap->hashm_ntuples;
 	}
 
-	_hash_wrtbuf(rel, metabuf);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
 	if (stats == NULL)
@@ -709,7 +737,6 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
-		bool		curr_page_dirty = false;
 
 		if (delay)
 			vacuum_delay_point();
@@ -776,9 +803,41 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
-			curr_page_dirty = true;
+
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -793,15 +852,7 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		 * release the lock on previous page after acquiring the lock on next
 		 * page
 		 */
-		if (curr_page_dirty)
-		{
-			if (retain_pin)
-				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-			else
-				_hash_wrtbuf(rel, buf);
-			curr_page_dirty = false;
-		}
-		else if (retain_pin)
+		if (retain_pin)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, buf);
@@ -833,10 +884,36 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+
+		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_CLEAR_GARBAGE);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
+	 * we need to release and reacquire the lock on bucket buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
 	 * If we deleted anything, try to compact free space.  For squeezing the
 	 * bucket, we must have a cleanup lock, else it can impact the ordering of
 	 * tuples for a scan that has started before it.
@@ -845,9 +922,3 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 							bstrategy);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..d030a8d
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,970 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, *num_bucket, LH_OVERFLOW_PAGE, true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay clear garbage flag operation for primary bucket page.
+ */
+static void
+hash_xlog_clear_garbage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_PAGE_HAS_GARBAGE;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_CLEAR_GARBAGE:
+			hash_xlog_clear_garbage(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 5cfd0aa..3514138 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -44,6 +46,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -248,31 +251,63 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * write and release the modified page and ensure to release the pin on
-	 * primary page.
-	 */
-	_hash_wrtbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap->hashm_ntuples += 1;
 
 	/* Make sure this stays in sync with _hash_expandtable() */
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/* Attempt to split if a split is needed */
 	if (do_expand)
@@ -314,3 +349,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 760563a..6292e6c 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,10 +18,11 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -84,7 +85,9 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -102,13 +105,37 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
+	HashMetaPage metap;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
+	BlockNumber blkno;
+	uint32		orig_firstfree;
+	uint32		splitnum;
+	uint32	   *freep = NULL;
+	uint32		max_ovflpg;
+	uint32		bit;
+	uint32		bitmap_page_bit;
+	uint32		first_page;
+	uint32		last_bit;
+	uint32		last_page;
+	uint32		i,
+				j;
+	bool		page_found = false;
 
 	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
+	 * Write-lock the tail page.  Here, we need to maintain locking order such
+	 * that, first acquire the lock on tail page of bucket, then on meta page
+	 * to find and lock the bitmap page and if it is found, then lock on meta
+	 * page is released, then finally acquire the lock on new overflow buffer.
+	 * We need this locking order to avoid deadlock with backends that are
+	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here, if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_WRITE);
 
@@ -136,55 +163,6 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
 
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
-	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
-	BlockNumber blkno;
-	uint32		orig_firstfree;
-	uint32		splitnum;
-	uint32	   *freep = NULL;
-	uint32		max_ovflpg;
-	uint32		bit;
-	uint32		first_page;
-	uint32		last_bit;
-	uint32		last_page;
-	uint32		i,
-				j;
-
 	/* Get exclusive lock on the meta page */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
@@ -233,11 +211,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -261,8 +259,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -273,7 +278,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	}
 
 	/* Calculate address of the new overflow page */
-	bit = metap->hashm_spares[splitnum];
+	bit = BufferIsValid(newmapbuf) ?
+		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -281,39 +287,51 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * relation length stays in sync with ours.  XXX It's annoying to do this
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
+	 *
+	 * It is okay to hold two buffer locks here (one on tail page of bucket
+	 * and other on new overflow page) since there cannot be anyone else
+	 * contending for access to ovflbuf.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case new page is added for them.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
-
-	/* Write updated metapage and release lock, but not pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
-
-	return newbuf;
+	START_CRIT_SECTION();
 
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	_hash_wrtbuf(rel, mapbuf);
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
 
-	/* Reacquire exclusive lock on the meta page */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -322,18 +340,101 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
-		_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* We didn't change the metapage, so no need to write */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
 	}
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	END_CRIT_SECTION();
+
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, buf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, ovflbuf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, ovflbuf, HASH_NOLOCK, HASH_WRITE);
+
+	return ovflbuf;
 }
 
 /*
@@ -366,6 +467,12 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -377,7 +484,9 @@ _hash_firstfreebit(uint32 map)
  *	better hold cleanup lock on the primary bucket.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
@@ -387,6 +496,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	BlockNumber prevblkno;
 	BlockNumber blkno;
 	BlockNumber nextblkno;
+	BlockNumber writeblkno;
 	HashPageOpaque ovflopaque;
 	Page		ovflpage;
 	Page		mappage;
@@ -395,6 +505,10 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	BlockNumber bucketblkno;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -403,69 +517,37 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
 	nextblkno = ovflopaque->hasho_nextblkno;
 	prevblkno = ovflopaque->hasho_prevblkno;
+	writeblkno = BufferGetBlockNumber(wbuf);
 	bucket = ovflopaque->hasho_bucket;
-
-	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	_hash_wrtbuf(rel, ovflbuf);
+	bucketblkno = BufferGetBlockNumber(bucketbuf);
 
 	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  No concurrency issues since we hold the cleanup lock on
 	 * primary bucket.  We don't need to aqcuire buffer lock to fix the
-	 * primary bucket, as we already have that lock.
+	 * primary bucket or if the previous bucket is same as write bucket, as we
+	 * already have lock on those buckets.
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		if (prevblkno == bucket_blkno)
-		{
-			Buffer		prevbuf = ReadBufferExtended(rel, MAIN_FORKNUM,
-													 prevblkno,
-													 RBM_NORMAL,
-													 bstrategy);
-
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
-			MarkBufferDirty(prevbuf);
-			ReleaseBuffer(prevbuf);
-		}
+		if (prevblkno == bucketblkno)
+			prevbuf = bucketbuf;
+		else if (prevblkno == writeblkno)
+			prevbuf = wbuf;
 		else
-		{
-			Buffer		prevbuf = _hash_getbuf_with_strategy(rel,
-															 prevblkno,
-															 HASH_WRITE,
-										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
-															 bstrategy);
-			Page		prevpage = BufferGetPage(prevbuf);
-			HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-			Assert(prevopaque->hasho_bucket == bucket);
-			prevopaque->hasho_nextblkno = nextblkno;
-			_hash_wrtbuf(rel, prevbuf);
-		}
+			prevbuf = _hash_getbuf_with_strategy(rel,
+												 prevblkno,
+												 HASH_WRITE,
+												 LH_OVERFLOW_PAGE,
+												 bstrategy);
 	}
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		_hash_wrtbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -491,60 +573,187 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, BlockNumber bucket_blkno,
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	_hash_wrtbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed
+	 * the page as WAL replay routines expect pages to be initialized.
+	 * See explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
-		_hash_wrtbuf(rel, metabuf);
+		update_metap = true;
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* no need to change metapage */
-		_hash_relbuf(rel, metabuf);
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
 
+	END_CRIT_SECTION();
+
+	/*
+	 * Release the lock, caller will decide whether to release the pin or
+	 * reacquire the lock.  Caller is responsibile for locking and unlocking
+	 * bucket buffer.
+	 */
+	if (BufferGetBlockNumber(wbuf) != bucketblkno)
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	/* release previous bucket if it is not same as primary or write bucket */
+	if (BufferIsValid(prevbuf) &&
+		prevblkno != bucketblkno &&
+		prevblkno != writeblkno)
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
+
 	return nextblkno;
 }
 
-
 /*
- *	_hash_initbitmap()
- *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
- *
- * 'blkno' is the block number of the new bitmap page.
+ *	_hash_initbitmapbuffer()
  *
- * All bits in the new bitmap page are set to "1", indicating "in use".
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
  */
 void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 {
-	Buffer		buf;
 	Page		pg;
 	HashPageOpaque op;
 	uint32	   *freep;
 
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
 	pg = BufferGetPage(buf);
 
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
 	/* initialize the page's special space */
 	op = (HashPageOpaque) PageGetSpecialPointer(pg);
 	op->hasho_prevblkno = InvalidBlockNumber;
@@ -555,25 +764,16 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 	/* set all of the bits to 1 */
 	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
+	MemSet(freep, 0xFF, bmsize);
 
-	/* write out the new bitmap page (releasing write lock and pin) */
-	_hash_wrtbuf(rel, buf);
-
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
+	/*
+	 * Set pd_lower just past the end of the bitmap page data.  We could even
+	 * set pd_lower equal to pd_upper, but this is more precise and makes the
+	 * page look compressible to xlog.c.
+	 */
+	((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
 }
 
-
 /*
  *	_hash_squeezebucket(rel, bucket)
  *
@@ -613,8 +813,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
-	bool		release_buf = false;
 
 	/*
 	 * start squeezing into the base bucket page.
@@ -657,23 +855,30 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
-
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		IndexTuple	itup;
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		Size		itemsz;
+		int			i;
+
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
+
+readpage:
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
 		for (roffnum = FirstOffsetNumber;
 			 roffnum <= maxroffnum;
 			 roffnum = OffsetNumberNext(roffnum))
 		{
-			IndexTuple	itup;
-			Size		itemsz;
-
 			itup = (IndexTuple) PageGetItem(rpage,
 											PageGetItemId(rpage, roffnum));
 			itemsz = IndexTupleDSize(*itup);
@@ -681,39 +886,113 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMulTups(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
+				bool		release_wbuf = false;
+				bool		tups_moved = false;
+
 				Assert(!PageIsEmpty(wpage));
 
+				/*
+				 * caller is responsibile for locking and unlocking bucket
+				 * buffer
+				 */
 				if (wblkno != bucket_blkno)
-					release_buf = true;
+					release_wbuf = true;
 
 				wblkno = wopaque->hasho_nextblkno;
 				Assert(BlockNumberIsValid(wblkno));
 
-				if (wbuf_dirty && release_buf)
-					_hash_wrtbuf(rel, wbuf);
-				else if (wbuf_dirty)
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
 					MarkBufferDirty(wbuf);
-				else if (release_buf)
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
+					tups_moved = true;
+				}
+
+				if (release_wbuf)
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_SQUEEZE_PAGE on standby users will be able to
+				 * view the results of partial deletion on rblkno.
+				 */
+				_hash_chgbufaccess(rel, rbuf, HASH_READ, HASH_NOLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						_hash_wrtbuf(rel, rbuf);
-					}
-					else
-						_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				_hash_chgbufaccess(rel, rbuf, HASH_NOLOCK, HASH_WRITE);
+
 				wbuf = _hash_getbuf_with_strategy(rel,
 												  wblkno,
 												  HASH_WRITE,
@@ -722,21 +1001,33 @@ _hash_squeezebucket(Relation rel,
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
-				release_buf = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
+
+				/*
+				 * after moving the tuples, rpage would have been compacted,
+				 * so we need to rescan it.
+				 */
+				if (tups_moved)
+					goto readpage;
+			}
 
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -748,34 +1039,36 @@ _hash_squeezebucket(Relation rel,
 		 * Tricky point here: if our read and write pages are adjacent in the
 		 * bucket chain, our write lock on wbuf will conflict with
 		 * _hash_freeovflpage's attempt to update the sibling links of the
-		 * removed page.  However, in that case we are done anyway, so we can
-		 * simply drop the write lock before calling _hash_freeovflpage.
+		 * removed page.  However, in that case we are ensuring that
+		 * _hash_freeovflpage doesn't take lock on that page again.  Releasing
+		 * the lock is not an option, because before that we need to write WAL
+		 * for the change in this page.
 		 */
 		rblkno = ropaque->hasho_prevblkno;
 		Assert(BlockNumberIsValid(rblkno));
 
+		/* free this overflow page (releases rbuf) */
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
+
+		pfree(itup_offsets);
+
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
 		{
+			/* Caller is responsibile for locking and unlocking bucket buffer */
 			if (wblkno != bucket_blkno)
-				release_buf = true;
-
-			/* yes, so release wbuf lock first if needed */
-			if (wbuf_dirty && release_buf)
-				_hash_wrtbuf(rel, wbuf);
-			else if (wbuf_dirty)
-				MarkBufferDirty(wbuf);
-			else if (release_buf)
-				_hash_relbuf(rel, wbuf);
-
-			/* free this overflow page (releases rbuf) */
-			_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
-			/* done */
+				_hash_dropbuf(rel, wbuf);
 			return;
 		}
 
-		/* free this overflow page, then get the previous one */
-		_hash_freeovflpage(rel, rbuf, bucket_blkno, bstrategy);
+		/* Caller is responsibile for locking and unlocking bucket buffer */
+		if (wblkno != bucket_blkno)
+			_hash_chgbufaccess(rel, wbuf, HASH_NOLOCK, HASH_WRITE);
 
 		rbuf = _hash_getbuf_with_strategy(rel,
 										  rblkno,
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 2c8e4b5..8e00d34 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -40,12 +41,11 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -160,6 +160,29 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag, bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	pageopaque->hasho_prevblkno = InvalidBlockNumber;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -286,35 +309,17 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 }
 
 /*
- *	_hash_wrtbuf() -- write a hash page to disk.
- *
- *		This routine releases the lock held on the buffer and our refcount
- *		for it.  It is an error to call _hash_wrtbuf() without a write lock
- *		and a pin on the buffer.
- *
- * NOTE: this routine should go away when/if hash indexes are WAL-ified.
- * The correct sequence of operations is to mark the buffer dirty, then
- * write the WAL record, then release the lock and pin; so marking dirty
- * can't be combined with releasing.
- */
-void
-_hash_wrtbuf(Relation rel, Buffer buf)
-{
-	MarkBufferDirty(buf);
-	UnlockReleaseBuffer(buf);
-}
-
-/*
  * _hash_chgbufaccess() -- Change the lock type on a buffer, without
  *			dropping our pin on it.
  *
- * from_access and to_access may be HASH_READ, HASH_WRITE, or HASH_NOLOCK,
- * the last indicating that no buffer-level lock is held or wanted.
+ * from_access can be HASH_READ or HASH_NOLOCK, the later indicating that
+ * no buffer-level lock is held or wanted.
+ *
+ * to_access can be HASH_READ, HASH_WRITE, or HASH_NOLOCK, the last indicating
+ * that no buffer-level lock is held or wanted.
  *
- * When from_access == HASH_WRITE, we assume the buffer is dirty and tell
- * bufmgr it must be written out.  If the caller wants to release a write
- * lock on a page that's not been modified, it's okay to pass from_access
- * as HASH_READ (a bit ugly, but handy in some places).
+ * If the caller wants to release a write lock on a page, it's okay to pass
+ * from_access as HASH_READ (a bit ugly, but quite handy when required).
  */
 void
 _hash_chgbufaccess(Relation rel,
@@ -322,8 +327,7 @@ _hash_chgbufaccess(Relation rel,
 				   int from_access,
 				   int to_access)
 {
-	if (from_access == HASH_WRITE)
-		MarkBufferDirty(buf);
+	Assert(from_access != HASH_WRITE);
 	if (from_access != HASH_NOLOCK)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	if (to_access != HASH_NOLOCK)
@@ -332,7 +336,7 @@ _hash_chgbufaccess(Relation rel,
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -344,19 +348,18 @@ _hash_chgbufaccess(Relation rel,
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -378,6 +381,154 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -397,30 +548,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -437,7 +583,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -456,44 +602,11 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_firstfree = 0;
 
 	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
 	 */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
-
-	/*
-	 * Initialize the first N buckets
-	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-		pageopaque->hasho_prevblkno = InvalidBlockNumber;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		_hash_wrtbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
-
-	/*
-	 * Initialize first bitmap page
-	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	_hash_wrtbuf(rel, metabuf);
-
-	return num_buckets;
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -502,7 +615,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
@@ -530,10 +642,14 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -569,7 +685,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
@@ -697,7 +813,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -722,18 +842,44 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress and it has
+	 * deletable tuples. At operation end, we clear split in progress flag and
+	 * vacuum will clear page_has_garbage flag after deleting such tuples.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = InvalidBlockNumber;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -741,6 +887,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -753,10 +900,10 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -769,15 +916,71 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_NOLOCK, HASH_WRITE);
 
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -817,6 +1020,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
 	char		zerobuf[BLCKSZ];
+	Page		page;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -827,7 +1031,21 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	MemSet(zerobuf, 0, sizeof(zerobuf));
+	page = (Page) zerobuf;
+
+	/*
+	 * Initialise the new bucket page, here we can't complete zeroed
+	 * the page as WAL replay routines expect pages to be initialized.
+	 * See explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(page, BLCKSZ);
+
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
 
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
@@ -835,14 +1053,18 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	return true;
 }
 
-
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, caller needs to fill htab.  If htab is set, then
+ * we skip the movement of tuples that exists in htab, otherwise NULL value of
+ * htab indicates movement of all the tuples that belong to new bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -868,70 +1090,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress and it has
-	 * deletable tuples. At operation end, we clear split in progress flag and
-	 * vacuum will clear page_has_garbage flag after deleting such tuples.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_PAGE_HAS_GARBAGE | LH_BUCKET_OLD_PAGE_SPLIT;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = InvalidBlockNumber;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_NEW_PAGE_SPLIT;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * is used to finish the incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -1015,7 +1178,7 @@ _hash_splitbucket_guts(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
 					bool		retain_pin = false;
 
@@ -1025,8 +1188,12 @@ _hash_splitbucket_guts(Relation rel,
 					 */
 					retain_pin = nopaque->hasho_flag & LH_BUCKET_PAGE;
 
-					/* write out nbuf and drop lock, but keep pin */
-					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
+					/* drop lock, but keep pin */
+					_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, retain_pin);
 					npage = BufferGetPage(nbuf);
@@ -1034,6 +1201,13 @@ _hash_splitbucket_guts(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -1042,6 +1216,8 @@ _hash_splitbucket_guts(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1064,7 +1240,16 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
+			if (nopaque->hasho_flag & LH_BUCKET_PAGE)
+				_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1080,15 +1265,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nopaque->hasho_flag & LH_BUCKET_PAGE)
-		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, nbuf);
-
-	/*
-	 * Acquiring cleanup lock to clear the split-in-progress flag ensures that
-	 * there is no pending scan that has seen the flag after it is cleared.
-	 */
 	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1097,6 +1273,8 @@ _hash_splitbucket_guts(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	/* indicate that split is finished */
 	oopaque->hasho_flag &= ~LH_BUCKET_OLD_PAGE_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_NEW_PAGE_SPLIT;
@@ -1107,6 +1285,29 @@ _hash_splitbucket_guts(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1211,9 +1412,41 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Buffer nbuf,
 	opageopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 	obucket = opageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	hash_destroy(tidhtab);
 }
+
+/*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 8723e13..0df64a8 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -301,6 +301,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	_hash_dropbuf(rel, metabuf);
 
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
@@ -357,6 +358,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		while (BlockNumberIsValid(opaque->hasho_nextblkno))
 			_hash_readnext(rel, &buf, &page, &opaque);
+
+		if (BufferIsValid(buf))
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	}
 
 	/* Now find the first tuple satisfying the qualification */
@@ -475,6 +479,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readnext(rel, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch(page, so->hashso_sk_hash);
 					}
@@ -498,6 +503,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
 
 							page = BufferGetPage(buf);
+							TestForOldSnapshot(scan->xs_snapshot, rel, page);
 							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 							maxoff = PageGetMaxOffsetNumber(page);
 							offnum = _hash_binsearch(page, so->hashso_sk_hash);
@@ -562,6 +568,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(rel, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
@@ -585,6 +592,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 							_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
 
 							page = BufferGetPage(buf);
+							TestForOldSnapshot(scan->xs_snapshot, rel, page);
 							opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 							maxoff = PageGetMaxOffsetNumber(page);
 							offnum = _hash_binsearch(page, so->hashso_sk_hash);
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 12e1818..245ce97 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,143 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "split_complete_old_bucket %c, split_complete_new_bucket %c",
+								 (xlrec->old_bucket_flag & LH_BUCKET_OLD_PAGE_SPLIT) ? 'F' : 'T',
+								 (xlrec->new_bucket_flag & LH_BUCKET_NEW_PAGE_SPLIT) ? 'F' : 'T');
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_CLEAR_GARBAGE:
+			id = "CLEAR_GARBAGE";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 85817c6..b868fd8 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -499,11 +499,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 73aa0c0..31b66d2 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -598,6 +598,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMulTups
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMulTups(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 79e0b1f..a918c6d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5379,13 +5379,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5407,8 +5404,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 2967ba7..17ee6b8 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -306,13 +306,16 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf,
-				   BlockNumber bucket_blkno, BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
+				   BufferAccessStrategy bstrategy);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
@@ -324,6 +327,8 @@ extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
 								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag,
+			  bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -332,11 +337,12 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 5f941a9..30e16c0 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -17,6 +17,245 @@
 #include "access/hash.h"
 #include "access/xlogreader.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_CLEAR_GARBAGE 0xA0	/* clear garbage flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * Data to regenerate the meta-data page
+ */
+typedef struct xl_hash_metadata
+{
+	HashMetaPageData metadata;
+}	xl_hash_metadata;
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (xl_hash_metadata)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index ad4ab5f..6ea46ef 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -425,6 +425,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMulTups(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 76593e1..80089b9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index 39a60a5..62f2ad7 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 59cb1e0..d907519 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail

#53

Amit Kapila

amit.kapila16@gmail.com

over 9 years ago

In reply to: Amit Kapila (#52)

Re: Write Ahead Logging for Hash Indexes

On Sun, Sep 25, 2016 at 10:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think here I am slightly wrong. For the full page writes, it do use
RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
doing page verification check and rather blindly setting the page to
zero and then overwrites it with full page image. So after my fix,
you will not see the error of checksum failure. I have a fix ready,
but still doing some more verification. If everything passes, I will
share the patch in a day or so.

Attached patch fixes the problem, now we do perform full page writes
for bitmap pages. Apart from that, I have rebased the patch based on
latest concurrent index patch [1]. I have updated the README as well
to reflect the WAL logging related information for different
operations.

With attached patch, all the review comments or issues found till now
are addressed.

I forgot to mention that Ashutosh has tested this patch for a day
using Jeff's tool and he didn't found any problem. Also, he has found
a way to easily reproduce the problem. Ashutosh, can you share your
changes to the script using which you have reproduce the problem?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Ashutosh Sharma

ashu.coek88@gmail.com

over 9 years ago

In reply to: Amit Kapila (#53)

Re: Write Ahead Logging for Hash Indexes

Hi All,

I forgot to mention that Ashutosh has tested this patch for a day
using Jeff's tool and he didn't found any problem. Also, he has found
a way to easily reproduce the problem. Ashutosh, can you share your
changes to the script using which you have reproduce the problem?

I made slight changes in Jeff Janes patch (crash_REL10.patch) to
reproduce the issue reported by him earlier [1]/messages/by-id/CAMkU=1zGcfNTWWxnud8j_p0vb1ENGcngkwqZgPUEnwZmAN+XQw@mail.gmail.com. I changed the
crash_REL10.patch such that a torn page write happens only for a
bitmap page in hash index. As described by Amit in [2]/messages/by-id/CAA4eK1LmQZGnYhSHXDDCOsSb_0U-gsxReEmSDRgCZr=AdKbTEg@mail.gmail.com & [3]/messages/by-id/CAA4eK1+k9tGPw7vOACRo+4h5n=utHnSuGgMoJrLANybAjNBW9w@mail.gmail.com we were
not logging full page image for bitmap page and that's the reason we
were not be able to restore bitmap page in case it is a torn page
which would eventually result in below error.

38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification
failed, calculated checksum 65067 but expected 21260
38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at
3F/22053B50 for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block
9 of relation base/16384/17334

Jeff Jane's original patch had following conditions for a torn page write:

if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress())
nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);

Ashutosh has added some more condition's in above if( ) to ensure that
the issue associated with bitmap page gets reproduced easily:

if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress() && bucket_opaque->hasho_page_id == HASHO_PAGE_ID
&& bucket_opaque->hasho_flag & LH_BITMAP_PAGE)
nbytes = FileWrite(v->mdfd_vfd, buffer, 24);

Also, please note that i am allowing only 24 bytes of data i.e. just a
page header to be written into a disk file so that even if a single
overflow page is added into hash index table the issue will be
observed.

Note: You can find Jeff Jane's patch for torn page write at - [4]/messages/by-id/CAMkU=1xRt8jBBB7g_7K41W00=br9UrxMVn_rhWhKPLaHfEdM5A@mail.gmail.com.

[1]: /messages/by-id/CAMkU=1zGcfNTWWxnud8j_p0vb1ENGcngkwqZgPUEnwZmAN+XQw@mail.gmail.com
[2]: /messages/by-id/CAA4eK1LmQZGnYhSHXDDCOsSb_0U-gsxReEmSDRgCZr=AdKbTEg@mail.gmail.com
[3]: /messages/by-id/CAA4eK1+k9tGPw7vOACRo+4h5n=utHnSuGgMoJrLANybAjNBW9w@mail.gmail.com
[4]: /messages/by-id/CAMkU=1xRt8jBBB7g_7K41W00=br9UrxMVn_rhWhKPLaHfEdM5A@mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Jesper Pedersen

jesper.pedersen@redhat.com

over 9 years ago

In reply to: Amit Kapila (#52)

Re: Write Ahead Logging for Hash Indexes

On 09/25/2016 01:00 AM, Amit Kapila wrote:

Attached patch fixes the problem, now we do perform full page writes
for bitmap pages. Apart from that, I have rebased the patch based on
latest concurrent index patch [1]. I have updated the README as well
to reflect the WAL logging related information for different
operations.

With attached patch, all the review comments or issues found till now
are addressed.

I have been running various tests, and applications with this patch
together with the CHI v8 patch [1]/messages/by-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com.

As I havn't seen any failures and doesn't currently have additional
feedback I'm moving this patch to "Ready for Committer" for their feedback.

If others have comments, move the patch status back in the CommitFest
application, please.

[1]: /messages/by-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com
/messages/by-id/CAA4eK1+X=8sUd1UCZDZnE3D9CGi9kw+kjxp2Tnw7SX5w8pLBNw@mail.gmail.com

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Jeff Janes

jeff.janes@gmail.com

about 9 years ago

In reply to: Amit Kapila (#52)

Re: Write Ahead Logging for Hash Indexes

On Sat, Sep 24, 2016 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I think here I am slightly wrong. For the full page writes, it do use
RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
doing page verification check and rather blindly setting the page to
zero and then overwrites it with full page image. So after my fix,
you will not see the error of checksum failure. I have a fix ready,
but still doing some more verification. If everything passes, I will
share the patch in a day or so.

Attached patch fixes the problem, now we do perform full page writes
for bitmap pages. Apart from that, I have rebased the patch based on
latest concurrent index patch [1]. I have updated the README as well
to reflect the WAL logging related information for different
operations.

With attached patch, all the review comments or issues found till now
are addressed.

This needs to be updated to apply over concurrent_hash_index_v10.patch.

Unless we want to wait until that work is committed before doing more
review and testing on this.

Thanks,

Jeff

#57

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Jeff Janes (#56)

Re: Write Ahead Logging for Hash Indexes

On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Sat, Sep 24, 2016 at 10:00 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Sep 23, 2016 at 5:34 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

I think here I am slightly wrong. For the full page writes, it do use
RBM_ZERO_AND_LOCK mode to read the page and for such mode we are not
doing page verification check and rather blindly setting the page to
zero and then overwrites it with full page image. So after my fix,
you will not see the error of checksum failure. I have a fix ready,
but still doing some more verification. If everything passes, I will
share the patch in a day or so.

Attached patch fixes the problem, now we do perform full page writes
for bitmap pages. Apart from that, I have rebased the patch based on
latest concurrent index patch [1]. I have updated the README as well
to reflect the WAL logging related information for different
operations.

With attached patch, all the review comments or issues found till now
are addressed.

This needs to be updated to apply over concurrent_hash_index_v10.patch.

Unless we want to wait until that work is committed before doing more review
and testing on this.

The concurrent hash index patch is getting changed and some of the
changes needs change in this patch as well. So, I think after it gets
somewhat stabilized, I will update this patch as well. I am not sure
if it is good idea to update it with every version of hash index.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Amit Kapila (#57)

Re: Write Ahead Logging for Hash Indexes

On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Unless we want to wait until that work is committed before doing more review
and testing on this.

The concurrent hash index patch is getting changed and some of the
changes needs change in this patch as well. So, I think after it gets
somewhat stabilized, I will update this patch as well.

Now that concurrent hash index patch is committed [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6d46f4783efe457f74816a75173eb23ed8930020, I will work on
rebasing this patch. Note, I have moved this to next CF.

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6d46f4783efe457f74816a75173eb23ed8930020

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Kapila (#58)

Re: Write Ahead Logging for Hash Indexes

On Thu, Dec 1, 2016 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Unless we want to wait until that work is committed before doing more review
and testing on this.

The concurrent hash index patch is getting changed and some of the
changes needs change in this patch as well. So, I think after it gets
somewhat stabilized, I will update this patch as well.

Now that concurrent hash index patch is committed [1], I will work on
rebasing this patch. Note, I have moved this to next CF.

Thanks. I am thinking that it might make sense to try to get the
"microvacuum support for hash index" and "cache hash index meta page"
patches committed before this one, because I'm guessing they are much
simpler than this one, and I think therefore that the review of those
patches can probably move fairly quickly. Of course, ideally I can
also start reviewing this one in the meantime. Does that make sense
to you?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Robert Haas (#59)

Re: Write Ahead Logging for Hash Indexes

On Thu, Dec 1, 2016 at 9:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 1, 2016 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Nov 9, 2016 at 7:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 8, 2016 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Unless we want to wait until that work is committed before doing more review
and testing on this.

The concurrent hash index patch is getting changed and some of the
changes needs change in this patch as well. So, I think after it gets
somewhat stabilized, I will update this patch as well.

Now that concurrent hash index patch is committed [1], I will work on
rebasing this patch. Note, I have moved this to next CF.

Thanks. I am thinking that it might make sense to try to get the
"microvacuum support for hash index" and "cache hash index meta page"
patches committed before this one, because I'm guessing they are much
simpler than this one, and I think therefore that the review of those
patches can probably move fairly quickly.

I think it makes sense to move "cache hash index meta page" first,
however "microvacuum support for hash index" is based on WAL patch as
the action in this patch (delete op) also needs to be logged. One
idea could be that we can try to split the patch so that WAL logging
can be done as a separate patch, but I am not sure if it is worth.

Of course, ideally I can
also start reviewing this one in the meantime. Does that make sense
to you?

You can start reviewing some of the operations like "Create Index",
"Insert". However, some changes are required because of change in
locking strategy for Vacuum. I am planning to work on rebasing it and
making required changes in next week.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Kapila (#60)

Re: Write Ahead Logging for Hash Indexes

On Thu, Dec 1, 2016 at 6:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Thanks. I am thinking that it might make sense to try to get the
"microvacuum support for hash index" and "cache hash index meta page"
patches committed before this one, because I'm guessing they are much
simpler than this one, and I think therefore that the review of those
patches can probably move fairly quickly.

I think it makes sense to move "cache hash index meta page" first,
however "microvacuum support for hash index" is based on WAL patch as
the action in this patch (delete op) also needs to be logged. One
idea could be that we can try to split the patch so that WAL logging
can be done as a separate patch, but I am not sure if it is worth.

The thing is, there's a fair amount locking stupidity in what just got
committed because of the requirement that the TID doesn't decrease
within a page. I'd really like to get that fixed.

Of course, ideally I can
also start reviewing this one in the meantime. Does that make sense
to you?

You can start reviewing some of the operations like "Create Index",
"Insert". However, some changes are required because of change in
locking strategy for Vacuum. I am planning to work on rebasing it and
making required changes in next week.

I'll review after that, since I have other things to review meanwhile.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Robert Haas (#61)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Fri, Dec 2, 2016 at 7:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 1, 2016 at 6:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Of course, ideally I can
also start reviewing this one in the meantime. Does that make sense
to you?

You can start reviewing some of the operations like "Create Index",
"Insert". However, some changes are required because of change in
locking strategy for Vacuum. I am planning to work on rebasing it and
making required changes in next week.

I'll review after that, since I have other things to review meanwhile.

Attached, please find the rebased patch attached with this e-mail.
There is no fundamental change in patch except for adapting the new
locking strategy in squeeze operation. I have done the sanity testing
of the patch with master-standby setup and Kuntal has helped me to
verify it with his wal-consistency checker patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal_hash_index_v6.patchapplication/octet-stream; name=wal_hash_index_v6.patchDownload

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 6eaed1e..dd6e851 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1536,19 +1536,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fdf8b3e..512f78d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2128,10 +2128,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 5bedaf2..f0f42d8 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2323,12 +2323,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 271c135..e40750e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index fcb7a60..7163b03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index e2e7e91..b154569 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
-       hashsort.o hashutil.o hashvalidate.o
+       hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 01ea115..06ef477 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -248,7 +248,6 @@ The insertion algorithm is rather similar:
          split happened)
 		take the buffer content lock on bucket page in exclusive mode
 		retake meta page buffer content lock in shared mode
-	release pin on metapage
 -- (so far same as reader, except for acquisition of buffer content lock in
 	exclusive mode on primary bucket page)
 	if the bucket-being-split flag is set for a bucket and pin count on it is
@@ -263,13 +262,17 @@ The insertion algorithm is rather similar:
 	if current page is full, release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
+	take buffer content lock in exclusive mode on metapage
 	insert tuple at appropriate place in page
-	mark current page dirty and release buffer content lock and pin
-	if the current page is not a bucket page, release the pin on bucket page
-	pin meta page and take buffer content lock in exclusive mode
+	mark current page dirty
 	increment tuple count, decide if split needed
-	mark meta page dirty and release buffer content lock and pin
-	done if no split needed, else enter Split algorithm below
+	mark meta page dirty
+	write WAL for insertion of tuple
+	release the buffer content lock on metapage
+	release buffer content lock on current page
+	if current page is not a bucket page, release the pin on bucket page
+	if split is needed, enter Split algorithm below
+	release the pin on metapage
 
 To speed searches, the index entries within any individual index page are
 kept sorted by hash code; the insertion code must take care to insert new
@@ -304,12 +307,17 @@ existing bucket in two, thereby lowering the fill ratio:
        try to finish the split and the cleanup work
        if that succeeds, start over; if it fails, give up
 	mark the old and new buckets indicating split is in progress
+	mark both old and new buckets as dirty
+	write WAL for allocation of new page for split
 	copy the tuples that belongs to new bucket from old bucket, marking
      them as moved-by-split
+	write WAL record for moving tuples to new page once the new page is full
+	or all the pages of old bucket are finished
 	release lock but not pin for primary bucket page of old bucket,
 	 read/shared-lock next page; repeat as needed
 	clear the bucket-being-split and bucket-being-populated flags
 	mark the old bucket indicating split-cleanup
+	write WAL for changing the flags on both old and new buckets
 
 The split operation's attempt to acquire cleanup-lock on the old bucket number
 could fail if another process holds any lock or pin on it.  We do not want to
@@ -345,6 +353,8 @@ The fourth operation is garbage collection (bulk deletion):
 		acquire cleanup lock on primary bucket page
 		loop:
 			scan and remove tuples
+			mark the target page dirty
+			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
 			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
@@ -359,7 +369,8 @@ The fourth operation is garbage collection (bulk deletion):
 	check if number of buckets changed
 	if so, release content lock and pin and return to for-each-bucket loop
 	else update metapage tuple count
-	mark meta page dirty and release buffer content lock and pin
+	 mark meta page dirty and write WAL for update of metapage
+	 release buffer content lock and pin
 
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
@@ -401,18 +412,16 @@ Obtaining an overflow page:
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		mark bitmap page dirty and release content lock
+		mark bitmap page dirty
 		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
 			update it and mark meta page dirty
-		release meta page buffer content lock
-		return page number
 	else (not found):
 	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -432,12 +441,15 @@ like this:
 
 	-- having determined that no space is free in the target bucket:
 	remember last page of bucket, drop write lock on it
-	call free-page-acquire routine
 	re-write-lock last page of bucket
 	if it is not last anymore, step to the last page
-	update (former) last page to point to new page
+	execute free-page-acquire (Obtaining an overflow page) mechanism described above
+	update (former) last page to point to the new page and mark the  buffer dirty.
 	write-lock and initialize new page, with back link to former last page
-	write and release former last page
+	write WAL for addition of overflow page
+	release the locks on meta page and bitmap page acquired in free-page-acquire algorithm
+	release the lock on former last page
+	release the lock on new overflow page
 	insert tuple into new page
 	-- etc.
 
@@ -464,12 +476,14 @@ accessors of pages in the bucket.  The algorithm is:
 	determine which bitmap page contains the free space bit for page
 	release meta page buffer content lock
 	pin bitmap page and take buffer content lock in exclusive mode
-	update bitmap bit
-	mark bitmap page dirty and release buffer content lock and pin
-	if page number is less than what we saw as first-free-bit in meta:
 	retake meta page buffer content lock in exclusive mode
+	move (insert) tuples that belong to the overflow page being freed
+	update bitmap bit
+	mark bitmap page dirty
 	if page number is still less than first-free-bit,
 		update first-free-bit field and mark meta page dirty
+	write WAL for delinking overflow page operation
+	release buffer content lock and pin
 	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
@@ -480,8 +494,101 @@ page acquirer will scan more bitmap bits than he needs to.  What must be
 avoided is having first-free-bit greater than the actual first free bit,
 because then that free page would never be found by searchers.
 
-All the freespace operations should be called while holding no buffer
-locks.  Since they need no lmgr locks, deadlock is not possible.
+The reason of moving tuples from overflow page while delinking the later is
+to make that as an atomic operation.  Not doing so could lead to spurious reads
+on standby.  Basically, the user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which get
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolled back.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only the fixed number of pages XLR_MAX_BLOCK_ID (32)
+with current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform the split operation if the number of tuples are more than what can be
+accomodated in the initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of the old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with bucket-being-split and bucket-being-populated flags respectively
+which indicates that split is in progress for those buckets.  The reader
+algorithm works correctly, as it will scan both the old and new buckets when
+the split is in progress as explained in the reader algorithm section above.
+
+We finish the split at next insert or split operation on the old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vacuum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeeze the bucket completely.
 
 
 Other Notes
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 6806e32..04fe0c2 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -27,6 +27,7 @@
 #include "optimizer/plancat.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -115,7 +116,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -177,7 +178,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
@@ -297,6 +298,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -594,6 +600,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -619,7 +626,28 @@ loop_top:
 		num_index_tuples = metap->hashm_ntuples;
 	}
 
-	_hash_wrtbuf(rel, metabuf);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
 	if (stats == NULL)
@@ -708,7 +736,6 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		OffsetNumber deletable[MaxOffsetNumber];
 		int			ndeletable = 0;
 		bool		retain_pin = false;
-		bool		curr_page_dirty = false;
 
 		vacuum_delay_point();
 
@@ -787,9 +814,41 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
-			curr_page_dirty = true;
+
+			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -804,15 +863,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 * release the lock on previous page after acquiring the lock on next
 		 * page
 		 */
-		if (curr_page_dirty)
-		{
-			if (retain_pin)
-				_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-			else
-				_hash_wrtbuf(rel, buf);
-			curr_page_dirty = false;
-		}
-		else if (retain_pin)
+		if (retain_pin)
 			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 		else
 			_hash_relbuf(rel, buf);
@@ -845,10 +896,36 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+
+		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
+	 * we need to release and reacquire the lock on bucket buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
 	 * If we have deleted anything, try to compact free space.  For squeezing
 	 * the bucket, we must have a cleanup lock, else it can impact the
 	 * ordering of tuples for a scan that has started before it.
@@ -857,11 +934,5 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 							bstrategy);
 	else
-		_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK);
-}
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
+		_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
 }
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..41429a7
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,970 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, *num_bucket, LH_OVERFLOW_PAGE, true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay split cleanup flag operation for primary bucket page.
+ */
+static void
+hash_xlog_split_cleanup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			hash_xlog_split_cleanup(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 572146a..ef24fa7 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -44,6 +46,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -204,32 +207,63 @@ restart_insert:
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * write and release the modified page.  if the page we modified was an
-	 * overflow page, we also need to separately drop the pin we retained on
-	 * the primary bucket page.
-	 */
-	_hash_wrtbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap->hashm_ntuples += 1;
 
 	/* Make sure this stays in sync with _hash_expandtable() */
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
 
 	/* Attempt to split if a split is needed */
 	if (do_expand)
@@ -271,3 +305,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index e2d208e..9336901 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,10 +18,11 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -84,7 +85,9 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -102,13 +105,37 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
+	HashMetaPage metap;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
+	BlockNumber blkno;
+	uint32		orig_firstfree;
+	uint32		splitnum;
+	uint32	   *freep = NULL;
+	uint32		max_ovflpg;
+	uint32		bit;
+	uint32		bitmap_page_bit;
+	uint32		first_page;
+	uint32		last_bit;
+	uint32		last_page;
+	uint32		i,
+				j;
+	bool		page_found = false;
 
 	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
+	 * Write-lock the tail page.  Here, we need to maintain locking order such
+	 * that, first acquire the lock on tail page of bucket, then on meta page
+	 * to find and lock the bitmap page and if it is found, then lock on meta
+	 * page is released, then finally acquire the lock on new overflow buffer.
+	 * We need this locking order to avoid deadlock with backends that are
+	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_WRITE);
 
@@ -136,55 +163,6 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
 
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-		_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
-	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
-	BlockNumber blkno;
-	uint32		orig_firstfree;
-	uint32		splitnum;
-	uint32	   *freep = NULL;
-	uint32		max_ovflpg;
-	uint32		bit;
-	uint32		first_page;
-	uint32		last_bit;
-	uint32		last_page;
-	uint32		i,
-				j;
-
 	/* Get exclusive lock on the meta page */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
@@ -233,11 +211,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -261,8 +259,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -273,7 +278,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	}
 
 	/* Calculate address of the new overflow page */
-	bit = metap->hashm_spares[splitnum];
+	bit = BufferIsValid(newmapbuf) ?
+		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -281,39 +287,51 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * relation length stays in sync with ours.  XXX It's annoying to do this
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
+	 *
+	 * It is okay to hold two buffer locks here (one on tail page of bucket
+	 * and other on new overflow page) since there cannot be anyone else
+	 * contending for access to ovflbuf.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case the new page is added.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
-
-	/* Write updated metapage and release lock, but not pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	START_CRIT_SECTION();
 
-	return newbuf;
-
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	_hash_wrtbuf(rel, mapbuf);
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
 
-	/* Reacquire exclusive lock on the meta page */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -322,18 +340,101 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
-		_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* We didn't change the metapage, so no need to write */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
 	}
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	END_CRIT_SECTION();
+
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+	else
+		_hash_relbuf(rel, buf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, ovflbuf, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, ovflbuf, HASH_NOLOCK, HASH_WRITE);
+
+	return ovflbuf;
 }
 
 /*
@@ -366,6 +467,12 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -379,13 +486,14 @@ _hash_firstfreebit(uint32 map)
  *	is responsible for releasing the pin on same.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
-				   bool wbuf_dirty, BufferAccessStrategy bstrategy)
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
+				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
-	Buffer		prevbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
@@ -399,6 +507,9 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -411,14 +522,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	bucket = ovflopaque->hasho_bucket;
 
 	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	_hash_wrtbuf(rel, ovflbuf);
-
-	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
@@ -426,8 +529,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Page		prevpage;
-		HashPageOpaque prevopaque;
 
 		if (prevblkno == writeblkno)
 			prevbuf = wbuf;
@@ -437,37 +538,14 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 												 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
 												 bstrategy);
-
-		prevpage = BufferGetPage(prevbuf);
-		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-
-		if (prevblkno != writeblkno)
-			_hash_wrtbuf(rel, prevbuf);
 	}
 
-	/* write and unlock the write buffer */
-	if (wbuf_dirty)
-		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
-
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		_hash_wrtbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -493,60 +571,183 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	_hash_wrtbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed the
+	 * page as WAL replay routines expect pages to be initialized. See
+	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
-		_hash_wrtbuf(rel, metabuf);
+		update_metap = true;
+		MarkBufferDirty(metabuf);
 	}
-	else
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* no need to change metapage */
-		_hash_relbuf(rel, metabuf);
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
 	}
 
+	END_CRIT_SECTION();
+
+	/*
+	 * Release the lock on write buffer, caller will decide whether to release
+	 * the pin or reacquire the lock.
+	 */
+	_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	/* release previous bucket if it is not same as write bucket */
+	if (BufferIsValid(prevbuf) && prevblkno != writeblkno)
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
+
 	return nextblkno;
 }
 
-
 /*
- *	_hash_initbitmap()
- *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
+ *	_hash_initbitmapbuffer()
  *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
  */
 void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 {
-	Buffer		buf;
 	Page		pg;
 	HashPageOpaque op;
 	uint32	   *freep;
 
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
 	pg = BufferGetPage(buf);
 
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
 	/* initialize the page's special space */
 	op = (HashPageOpaque) PageGetSpecialPointer(pg);
 	op->hasho_prevblkno = InvalidBlockNumber;
@@ -557,22 +758,14 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 	/* set all of the bits to 1 */
 	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* write out the new bitmap page (releasing write lock and pin) */
-	_hash_wrtbuf(rel, buf);
+	MemSet(freep, 0xFF, bmsize);
 
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
+	/*
+	 * Set pd_lower just past the end of the bitmap page data.  We could even
+	 * set pd_lower equal to pd_upper, but this is more precise and makes the
+	 * page look compressible to xlog.c.
+	 */
+	((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
 }
 
 
@@ -622,7 +815,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
 
 	/*
 	 * start squeezing into the primary bucket page.
@@ -638,7 +830,7 @@ _hash_squeezebucket(Relation rel,
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
 	{
-		_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
+		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
 		return;
 	}
 
@@ -668,15 +860,23 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		int			i;
 		bool		retain_pin = false;
 
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
+
+readpage:
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
 		for (roffnum = FirstOffsetNumber;
@@ -697,10 +897,12 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMulTups(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
+				bool		tups_moved = false;
 				Buffer		next_wbuf = InvalidBuffer;
 
 				Assert(!PageIsEmpty(wpage));
@@ -718,56 +920,132 @@ _hash_squeezebucket(Relation rel,
 														   HASH_WRITE,
 														   LH_OVERFLOW_PAGE,
 														   bstrategy);
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+					MarkBufferDirty(wbuf);
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
+					tups_moved = true;
+				}
+
 
 				/*
 				 * release the lock on previous page after acquiring the lock
 				 * on next page
 				 */
-				if (wbuf_dirty)
-				{
-					if (retain_pin)
-						_hash_chgbufaccess(rel, wbuf, HASH_WRITE, HASH_NOLOCK);
-					else
-						_hash_wrtbuf(rel, wbuf);
-				}
-				else if (retain_pin)
+				if (retain_pin)
 					_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
 				else
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_SQUEEZE_PAGE on standby users will be able to
+				 * view the results of partial deletion on rblkno.
+				 */
+				_hash_chgbufaccess(rel, rbuf, HASH_READ, HASH_NOLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						_hash_wrtbuf(rel, rbuf);
-					}
-					else
-						_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				_hash_chgbufaccess(rel, rbuf, HASH_NOLOCK, HASH_WRITE);
+
 				wbuf = next_wbuf;
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
 				retain_pin = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
+
+				/*
+				 * after moving the tuples, rpage would have been compacted,
+				 * so we need to rescan it.
+				 */
+				if (tups_moved)
+					goto readpage;
+			}
 
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -782,14 +1060,20 @@ _hash_squeezebucket(Relation rel,
 		 * removed page.  In that case, we don't need to lock it again and we
 		 * always release the lock on wbuf in _hash_freeovflpage and then
 		 * retake it again here.  This will not only simplify the code, but is
-		 * required to atomically log the changes which will be helpful when
-		 * we write WAL for hash indexes.
+		 * required to atomically log the changes.
 		 */
 		rblkno = ropaque->hasho_prevblkno;
 		Assert(BlockNumberIsValid(rblkno));
 
 		/* free this overflow page (releases rbuf) */
-		_hash_freeovflpage(rel, rbuf, wbuf, wbuf_dirty, bstrategy);
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
+
+		pfree(itup_offsets);
 
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 74ffa9d..4862b66 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -40,12 +41,11 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -160,6 +160,29 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag, bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	pageopaque->hasho_prevblkno = InvalidBlockNumber;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -290,35 +313,17 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 }
 
 /*
- *	_hash_wrtbuf() -- write a hash page to disk.
- *
- *		This routine releases the lock held on the buffer and our refcount
- *		for it.  It is an error to call _hash_wrtbuf() without a write lock
- *		and a pin on the buffer.
- *
- * NOTE: this routine should go away when/if hash indexes are WAL-ified.
- * The correct sequence of operations is to mark the buffer dirty, then
- * write the WAL record, then release the lock and pin; so marking dirty
- * can't be combined with releasing.
- */
-void
-_hash_wrtbuf(Relation rel, Buffer buf)
-{
-	MarkBufferDirty(buf);
-	UnlockReleaseBuffer(buf);
-}
-
-/*
  * _hash_chgbufaccess() -- Change the lock type on a buffer, without
  *			dropping our pin on it.
  *
- * from_access and to_access may be HASH_READ, HASH_WRITE, or HASH_NOLOCK,
- * the last indicating that no buffer-level lock is held or wanted.
+ * from_access can be HASH_READ or HASH_NOLOCK, the later indicating that
+ * no buffer-level lock is held or wanted.
+ *
+ * to_access can be HASH_READ, HASH_WRITE, or HASH_NOLOCK, the last indicating
+ * that no buffer-level lock is held or wanted.
  *
- * When from_access == HASH_WRITE, we assume the buffer is dirty and tell
- * bufmgr it must be written out.  If the caller wants to release a write
- * lock on a page that's not been modified, it's okay to pass from_access
- * as HASH_READ (a bit ugly, but handy in some places).
+ * If the caller wants to release a write lock on a page, it's okay to pass
+ * from_access as HASH_READ (a bit ugly, but quite handy when required).
  */
 void
 _hash_chgbufaccess(Relation rel,
@@ -326,8 +331,7 @@ _hash_chgbufaccess(Relation rel,
 				   int from_access,
 				   int to_access)
 {
-	if (from_access == HASH_WRITE)
-		MarkBufferDirty(buf);
+	Assert(from_access != HASH_WRITE);
 	if (from_access != HASH_NOLOCK)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	if (to_access != HASH_NOLOCK)
@@ -336,7 +340,7 @@ _hash_chgbufaccess(Relation rel,
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -348,19 +352,18 @@ _hash_chgbufaccess(Relation rel,
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -382,6 +385,154 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -401,30 +552,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -441,7 +587,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -460,44 +606,11 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_firstfree = 0;
 
 	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
-	 */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
-
-	/*
-	 * Initialize the first N buckets
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
 	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-		pageopaque->hasho_prevblkno = InvalidBlockNumber;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		_hash_wrtbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
-
-	/*
-	 * Initialize first bitmap page
-	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	_hash_wrtbuf(rel, metabuf);
-
-	return num_buckets;
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -506,7 +619,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
@@ -534,10 +646,14 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -573,7 +689,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
@@ -689,7 +805,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -714,18 +834,43 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress.  At
+	 * operation end, we clear split-in-progress flag.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = InvalidBlockNumber;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -733,6 +878,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -745,10 +891,10 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -761,15 +907,71 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_READ, HASH_NOLOCK);
+	_hash_chgbufaccess(rel, buf_nblkno, HASH_NOLOCK, HASH_WRITE);
 
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -809,6 +1011,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
 	char		zerobuf[BLCKSZ];
+	Page		page;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -819,7 +1022,21 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	MemSet(zerobuf, 0, sizeof(zerobuf));
+	page = (Page) zerobuf;
+
+	/*
+	 * Initialise the new bucket page, here we can't complete zeroed the page
+	 * as WAL replay routines expect pages to be initialized. See explanation
+	 * of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(page, BLCKSZ);
+
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
 
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
@@ -827,14 +1044,19 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	return true;
 }
 
-
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, the caller needs to fill htab.  If htab is set,
+ * then we skip the movement of tuples that exists in htab, otherwise NULL
+ * value of htab indicates movement of all the tuples that belong to the new
+ * bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -860,69 +1082,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress.  At
-	 * operation end, we clear split-in-progress flag.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = InvalidBlockNumber;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * to finish incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -1007,10 +1171,14 @@ _hash_splitbucket_guts(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
-					/* write out nbuf and drop lock, but keep pin */
-					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
+					/* drop lock, but keep pin */
+					_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
@@ -1018,6 +1186,13 @@ _hash_splitbucket_guts(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -1026,6 +1201,8 @@ _hash_splitbucket_guts(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1048,7 +1225,16 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
+			if (nbuf == bucket_nbuf)
+				_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1064,11 +1250,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nbuf == bucket_nbuf)
-		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
-	else
-		_hash_wrtbuf(rel, nbuf);
-
 	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1077,6 +1258,8 @@ _hash_splitbucket_guts(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
 
@@ -1093,6 +1276,29 @@ _hash_splitbucket_guts(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1209,11 +1415,43 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nbucket = npageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	_hash_relbuf(rel, bucket_nbuf);
 	_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
 	hash_destroy(tidhtab);
 }
+
+/*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 8d43b38..18155ef 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -123,6 +123,7 @@ _hash_readnext(IndexScanDesc scan,
 	if (block_found)
 	{
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 	}
 }
@@ -159,6 +160,7 @@ _hash_readprev(IndexScanDesc scan,
 		*bufp = _hash_getbuf(rel, blkno, HASH_READ,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 
 		/*
@@ -328,6 +330,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	_hash_dropbuf(rel, metabuf);
 
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
@@ -362,6 +365,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
 
 		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+		TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf));
 
 		/*
 		 * remember the split bucket buffer so as to use it later for
@@ -564,6 +568,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 12e1818..fc04363 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,143 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "split_complete_old_bucket %c, split_complete_new_bucket %c",
+				(xlrec->old_bucket_flag & LH_BUCKET_BEING_SPLIT) ? 'F' : 'T',
+								 (xlrec->new_bucket_flag & LH_BUCKET_BEING_POPULATED) ? 'F' : 'T');
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			id = "SPLIT_CLEANUP";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 85817c6..b868fd8 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -499,11 +499,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 73aa0c0..31b66d2 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -598,6 +598,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMulTups
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMulTups(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 79e0b1f..a918c6d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5379,13 +5379,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5407,8 +5404,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 6dfc41f..1a41ba7 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -310,13 +310,15 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
-				   bool wbuf_dirty, BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
@@ -328,6 +330,8 @@ extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
 								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag,
+			  bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -336,11 +340,12 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern void _hash_wrtbuf(Relation rel, Buffer buf);
 extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
 				   int to_access);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 5f941a9..35c1b8f 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -17,6 +17,245 @@
 #include "access/hash.h"
 #include "access/xlogreader.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_SPLIT_CLEANUP 0xA0	/* clear split-cleanup flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * Data to regenerate the meta-data page
+ */
+typedef struct xl_hash_metadata
+{
+	HashMetaPageData metadata;
+}	xl_hash_metadata;
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (xl_hash_metadata)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index ad4ab5f..6ea46ef 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -425,6 +425,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMulTups(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index e663f9a..6b2f693 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index f8b9f02..0a18efa 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -201,7 +201,6 @@ SELECT h.seqno AS f20000
 --
 CREATE TABLE hash_split_heap (keycol INT);
 CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
 VACUUM FULL hash_split_heap;
 -- Let's do a backward scan.
@@ -230,5 +229,4 @@ DROP TABLE hash_temp_heap CASCADE;
 CREATE TABLE hash_heap_float4 (x float4, y int);
 INSERT INTO hash_heap_float4 VALUES (1.1,1);
 CREATE INDEX hash_idx ON hash_heap_float4 USING hash (x);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 DROP TABLE hash_heap_float4 CASCADE;
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index 1a04ec5..f0bb253 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 59cb1e0..d907519 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail

#63

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Kapila (#62)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I'll review after that, since I have other things to review meanwhile.

Attached, please find the rebased patch attached with this e-mail.
There is no fundamental change in patch except for adapting the new
locking strategy in squeeze operation. I have done the sanity testing
of the patch with master-standby setup and Kuntal has helped me to
verify it with his wal-consistency checker patch.

This patch again needs a rebase, but before you do that I'd like to
make it harder by applying the attached patch to remove
_hash_chgbufaccess(), which I think is a bad plan for more or less the
same reasons that motivated the removal of _hash_wrtbuf() in commit
25216c98938495fd741bf585dcbef45b3a9ffd40. I think there's probably
more simplification and cleanup that can be done afterward in the wake
of this; what I've done here is just a mechanical replacement of
_hash_chgbufaccess() with LockBuffer() and/or MarkBufferDirty(). The
point is that having MarkBufferDirty() calls happen implicitly instead
some other function is not what we want for write-ahead logging. Your
patch gets rid of that, too; this is just doing it somewhat more
thoroughly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

remove-chgbufaccess.patchtext/x-diff; charset=US-ASCII; name=remove-chgbufaccess.patchDownload

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0eeb37d..1fa087a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -274,7 +274,7 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 	 * Reacquire the read lock here.
 	 */
 	if (BufferIsValid(so->hashso_curbuf))
-		_hash_chgbufaccess(rel, so->hashso_curbuf, HASH_NOLOCK, HASH_READ);
+		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
 
 	/*
 	 * If we've already initialized this scan, we can just advance it in the
@@ -354,7 +354,7 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 
 	/* Release read lock on current buffer, but keep it pinned */
 	if (BufferIsValid(so->hashso_curbuf))
-		_hash_chgbufaccess(rel, so->hashso_curbuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
 
 	/* Return current heap TID on success */
 	scan->xs_ctup.t_self = so->hashso_heappos;
@@ -524,7 +524,7 @@ hashbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	orig_ntuples = metap->hashm_ntuples;
 	memcpy(&local_metapage, metap, sizeof(local_metapage));
 	/* release the lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 	/* Scan the buckets that we know exist */
 	cur_bucket = 0;
@@ -576,9 +576,9 @@ loop_top:
 			 * (and thus can't be further split), update our cached metapage
 			 * data.
 			 */
-			_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+			LockBuffer(metabuf, BUFFER_LOCK_SHARE);
 			memcpy(&local_metapage, metap, sizeof(local_metapage));
-			_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+			LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 		}
 
 		bucket_buf = buf;
@@ -597,7 +597,7 @@ loop_top:
 	}
 
 	/* Write-lock metapage and check for split since we started */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
 
 	if (cur_maxbucket != metap->hashm_maxbucket)
@@ -605,7 +605,7 @@ loop_top:
 		/* There's been a split, so process the additional bucket(s) */
 		cur_maxbucket = metap->hashm_maxbucket;
 		memcpy(&local_metapage, metap, sizeof(local_metapage));
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 		goto loop_top;
 	}
 
@@ -821,7 +821,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 * page
 		 */
 		if (retain_pin)
-			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
@@ -836,7 +836,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	if (buf != bucket_buf)
 	{
 		_hash_relbuf(rel, buf);
-		_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE);
+		LockBuffer(bucket_buf, BUFFER_LOCK_EXCLUSIVE);
 	}
 
 	/*
@@ -866,7 +866,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		_hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf,
 							bstrategy);
 	else
-		_hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
 
 void
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 59c4213..b51b487 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -102,7 +102,7 @@ restart_insert:
 		lowmask = metap->hashm_lowmask;
 
 		/* Release metapage lock, but keep pin. */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 		/*
 		 * If the previous iteration of this loop locked the primary page of
@@ -123,7 +123,7 @@ restart_insert:
 		 * Reacquire metapage lock and check that no bucket split has taken
 		 * place while we were awaiting the bucket lock.
 		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		LockBuffer(metabuf, BUFFER_LOCK_SHARE);
 		oldblkno = blkno;
 		retry = true;
 	}
@@ -147,7 +147,7 @@ restart_insert:
 	if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
 	{
 		/* release the lock on bucket buffer, before completing the split. */
-		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		_hash_finish_split(rel, metabuf, buf, pageopaque->hasho_bucket,
 						   maxbucket, highmask, lowmask);
@@ -178,7 +178,7 @@ restart_insert:
 			if (buf != bucket_buf)
 				_hash_relbuf(rel, buf);
 			else
-				_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 			page = BufferGetPage(buf);
 		}
@@ -190,7 +190,7 @@ restart_insert:
 			 */
 
 			/* release our write lock without modifying buffer */
-			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 			/* chain to a new overflow page */
 			buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf) ? true : false);
@@ -221,7 +221,7 @@ restart_insert:
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
 	metap->hashm_ntuples += 1;
 
@@ -230,7 +230,8 @@ restart_insert:
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
 	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	MarkBufferDirty(metabuf);
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 	/* Attempt to split if a split is needed */
 	if (do_expand)
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index df7838c..6b106f3 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -110,7 +110,7 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	 * Write-lock the tail page.  It is okay to hold two buffer locks here
 	 * since there cannot be anyone else contending for access to ovflbuf.
 	 */
-	_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
 	/* probably redundant... */
 	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
@@ -129,7 +129,7 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 
 		/* we assume we do not need to write the unmodified page */
 		if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-			_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, buf);
 
@@ -151,7 +151,7 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
 	MarkBufferDirty(buf);
 	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
 		_hash_relbuf(rel, buf);
 
@@ -187,7 +187,7 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 				j;
 
 	/* Get exclusive lock on the meta page */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
 	_hash_checkpage(rel, metabuf, LH_META_PAGE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
@@ -225,7 +225,7 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 			last_inpage = BMPGSZ_BIT(metap) - 1;
 
 		/* Release exclusive lock on metapage while reading bitmap page */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 		mapbuf = _hash_getbuf(rel, mapblkno, HASH_WRITE, LH_BITMAP_PAGE);
 		mappage = BufferGetPage(mapbuf);
@@ -244,7 +244,7 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		bit = 0;
 
 		/* Reacquire exclusive lock on the meta page */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+		LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 	}
 
 	/*
@@ -295,7 +295,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		metap->hashm_firstfree = bit + 1;
 
 	/* Write updated metapage and release lock, but not pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	MarkBufferDirty(metabuf);
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 	return newbuf;
 
@@ -309,7 +310,7 @@ found:
 	_hash_relbuf(rel, mapbuf);
 
 	/* Reacquire exclusive lock on the meta page */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
 	/* convert bit to absolute bit number */
 	bit += (i << BMPG_SHIFT(metap));
@@ -326,12 +327,13 @@ found:
 		metap->hashm_firstfree = bit + 1;
 
 		/* Write updated metapage and release lock, but not pin */
-		_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+		MarkBufferDirty(metabuf);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 	}
 	else
 	{
 		/* We didn't change the metapage, so no need to write */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 	}
 
 	/* Fetch, init, and return the recycled page */
@@ -483,7 +485,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	blkno = metap->hashm_mapp[bitmappage];
 
 	/* Release metapage lock while we access the bitmap page */
-	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 	/* Clear the bitmap bit to indicate that this overflow page is free */
 	mapbuf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BITMAP_PAGE);
@@ -495,7 +497,7 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	_hash_relbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
@@ -633,7 +635,7 @@ _hash_squeezebucket(Relation rel,
 	 */
 	if (!BlockNumberIsValid(wopaque->hasho_nextblkno))
 	{
-		_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(wbuf, BUFFER_LOCK_UNLOCK);
 		return;
 	}
 
@@ -721,7 +723,7 @@ _hash_squeezebucket(Relation rel,
 				if (wbuf_dirty)
 					MarkBufferDirty(wbuf);
 				if (retain_pin)
-					_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+					LockBuffer(wbuf, BUFFER_LOCK_UNLOCK);
 				else
 					_hash_relbuf(rel, wbuf);
 
@@ -784,7 +786,7 @@ _hash_squeezebucket(Relation rel,
 		{
 			/* retain the pin on primary bucket page till end of bucket scan */
 			if (wblkno == bucket_blkno)
-				_hash_chgbufaccess(rel, wbuf, HASH_READ, HASH_NOLOCK);
+				LockBuffer(wbuf, BUFFER_LOCK_UNLOCK);
 			else
 				_hash_relbuf(rel, wbuf);
 			return;
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index a3d2138..45e184c 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -289,32 +289,6 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 	so->hashso_buc_split = false;
 }
 
-/*
- * _hash_chgbufaccess() -- Change the lock type on a buffer, without
- *			dropping our pin on it.
- *
- * from_access and to_access may be HASH_READ, HASH_WRITE, or HASH_NOLOCK,
- * the last indicating that no buffer-level lock is held or wanted.
- *
- * When from_access == HASH_WRITE, we assume the buffer is dirty and tell
- * bufmgr it must be written out.  If the caller wants to release a write
- * lock on a page that's not been modified, it's okay to pass from_access
- * as HASH_READ (a bit ugly, but handy in some places).
- */
-void
-_hash_chgbufaccess(Relation rel,
-				   Buffer buf,
-				   int from_access,
-				   int to_access)
-{
-	if (from_access == HASH_WRITE)
-		MarkBufferDirty(buf);
-	if (from_access != HASH_NOLOCK)
-		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-	if (to_access != HASH_NOLOCK)
-		LockBuffer(buf, to_access);
-}
-
 
 /*
  *	_hash_metapinit() -- Initialize the metadata page of a hash index,
@@ -446,7 +420,8 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
 	 * long intervals in any case, since that can block the bgwriter.
 	 */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	MarkBufferDirty(metabuf);
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 	/*
 	 * Initialize the first N buckets
@@ -469,7 +444,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	}
 
 	/* Now reacquire buffer lock on metapage */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
 	/*
 	 * Initialize first bitmap page
@@ -528,7 +503,7 @@ restart_expand:
 	 * Write-lock the meta page.  It used to be necessary to acquire a
 	 * heavyweight lock to begin a split, but that is no longer required.
 	 */
-	_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
 	_hash_checkpage(rel, metabuf, LH_META_PAGE);
 	metap = HashPageGetMeta(BufferGetPage(metabuf));
@@ -609,8 +584,8 @@ restart_expand:
 		 * Release the lock on metapage and old_bucket, before completing the
 		 * split.
 		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
-		_hash_chgbufaccess(rel, buf_oblkno, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+		LockBuffer(buf_oblkno, BUFFER_LOCK_UNLOCK);
 
 		_hash_finish_split(rel, metabuf, buf_oblkno, old_bucket, maxbucket,
 						   highmask, lowmask);
@@ -646,7 +621,7 @@ restart_expand:
 		lowmask = metap->hashm_lowmask;
 
 		/* Release the metapage lock. */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 		hashbucketcleanup(rel, old_bucket, buf_oblkno, start_oblkno, NULL,
 						  maxbucket, highmask, lowmask, NULL, NULL, true,
@@ -753,7 +728,8 @@ restart_expand:
 	lowmask = metap->hashm_lowmask;
 
 	/* Write out the metapage and drop lock, but keep pin */
-	_hash_chgbufaccess(rel, metabuf, HASH_WRITE, HASH_NOLOCK);
+	MarkBufferDirty(metabuf);
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
@@ -767,7 +743,7 @@ restart_expand:
 fail:
 
 	/* We didn't write the metapage, so just drop lock */
-	_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 }
 
 
@@ -1001,7 +977,8 @@ _hash_splitbucket_guts(Relation rel,
 				if (PageGetFreeSpace(npage) < itemsz)
 				{
 					/* write out nbuf and drop lock, but keep pin */
-					_hash_chgbufaccess(rel, nbuf, HASH_WRITE, HASH_NOLOCK);
+					MarkBufferDirty(nbuf);
+					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
@@ -1033,7 +1010,7 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* retain the pin on the old primary bucket */
 		if (obuf == bucket_obuf)
-			_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
+			LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, obuf);
 
@@ -1056,18 +1033,21 @@ _hash_splitbucket_guts(Relation rel,
 	 * bucket and then the new bucket.
 	 */
 	if (nbuf == bucket_nbuf)
-		_hash_chgbufaccess(rel, bucket_nbuf, HASH_WRITE, HASH_NOLOCK);
+	{
+		MarkBufferDirty(bucket_nbuf);
+		LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
+	}
 	else
 	{
 		MarkBufferDirty(nbuf);
 		_hash_relbuf(rel, nbuf);
 	}
 
-	_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
 
-	_hash_chgbufaccess(rel, bucket_nbuf, HASH_NOLOCK, HASH_WRITE);
+	LockBuffer(bucket_nbuf, BUFFER_LOCK_EXCLUSIVE);
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
@@ -1172,7 +1152,7 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 		 * retain the pin on primary bucket.
 		 */
 		if (nbuf == bucket_nbuf)
-			_hash_chgbufaccess(rel, nbuf, HASH_READ, HASH_NOLOCK);
+			LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
 		else
 			_hash_relbuf(rel, nbuf);
 
@@ -1194,7 +1174,7 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 	}
 	if (!ConditionalLockBufferForCleanup(bucket_nbuf))
 	{
-		_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 		hash_destroy(tidhtab);
 		return;
 	}
@@ -1208,6 +1188,6 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 						   maxbucket, highmask, lowmask);
 
 	_hash_relbuf(rel, bucket_nbuf);
-	_hash_chgbufaccess(rel, obuf, HASH_READ, HASH_NOLOCK);
+	LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 	hash_destroy(tidhtab);
 }
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 8d43b38..913b87c 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -83,7 +83,7 @@ _hash_readnext(IndexScanDesc scan,
 	 * comments in _hash_first to know the reason of retaining pin.
 	 */
 	if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
-		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+		LockBuffer(*bufp, BUFFER_LOCK_UNLOCK);
 	else
 		_hash_relbuf(rel, *bufp);
 
@@ -109,7 +109,7 @@ _hash_readnext(IndexScanDesc scan,
 		 */
 		Assert(BufferIsValid(*bufp));
 
-		_hash_chgbufaccess(rel, *bufp, HASH_NOLOCK, HASH_READ);
+		LockBuffer(*bufp, BUFFER_LOCK_SHARE);
 
 		/*
 		 * setting hashso_buc_split to true indicates that we are scanning
@@ -147,7 +147,7 @@ _hash_readprev(IndexScanDesc scan,
 	 * comments in _hash_first to know the reason of retaining pin.
 	 */
 	if (*bufp == so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)
-		_hash_chgbufaccess(rel, *bufp, HASH_READ, HASH_NOLOCK);
+		LockBuffer(*bufp, BUFFER_LOCK_UNLOCK);
 	else
 		_hash_relbuf(rel, *bufp);
 
@@ -182,7 +182,7 @@ _hash_readprev(IndexScanDesc scan,
 		 */
 		Assert(BufferIsValid(*bufp));
 
-		_hash_chgbufaccess(rel, *bufp, HASH_NOLOCK, HASH_READ);
+		LockBuffer(*bufp, BUFFER_LOCK_SHARE);
 		*pagep = BufferGetPage(*bufp);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 
@@ -298,7 +298,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		blkno = BUCKET_TO_BLKNO(metap, bucket);
 
 		/* Release metapage lock, but keep pin. */
-		_hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 		/*
 		 * If the previous iteration of this loop locked what is still the
@@ -319,7 +319,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		 * Reacquire metapage lock and check that no bucket split has taken
 		 * place while we were awaiting the bucket lock.
 		 */
-		_hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_READ);
+		LockBuffer(metabuf, BUFFER_LOCK_SHARE);
 		oldblkno = blkno;
 		retry = true;
 	}
@@ -359,7 +359,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		 * release the lock on new bucket and re-acquire it after acquiring
 		 * the lock on old bucket.
 		 */
-		_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
 
@@ -368,9 +368,9 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		 * scanning.
 		 */
 		so->hashso_split_bucket_buf = old_buf;
-		_hash_chgbufaccess(rel, old_buf, HASH_READ, HASH_NOLOCK);
+		LockBuffer(old_buf, BUFFER_LOCK_UNLOCK);
 
-		_hash_chgbufaccess(rel, buf, HASH_NOLOCK, HASH_READ);
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		page = BufferGetPage(buf);
 		opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 		Assert(opaque->hasho_bucket == bucket);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 627fa2c..bc08f81 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -336,8 +336,6 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern void _hash_chgbufaccess(Relation rel, Buffer buf, int from_access,
-				   int to_access);
 extern uint32 _hash_metapinit(Relation rel, double num_tuples,
 				ForkNumber forkNum);
 extern void _hash_pageinit(Page page, Size size);

#64

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Robert Haas (#63)

Re: Write Ahead Logging for Hash Indexes

On Thu, Dec 22, 2016 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I'll review after that, since I have other things to review meanwhile.

Attached, please find the rebased patch attached with this e-mail.
There is no fundamental change in patch except for adapting the new
locking strategy in squeeze operation. I have done the sanity testing
of the patch with master-standby setup and Kuntal has helped me to
verify it with his wal-consistency checker patch.

This patch again needs a rebase,

I had completed it yesterday night and kept it for night long tests to
check the sanity of the patch, but I guess now I need another rebase.
Anyway, I feel this is all for the betterment of final patch.

but before you do that I'd like to
make it harder by applying the attached patch to remove
_hash_chgbufaccess(), which I think is a bad plan for more or less the
same reasons that motivated the removal of _hash_wrtbuf() in commit
25216c98938495fd741bf585dcbef45b3a9ffd40. I think there's probably
more simplification and cleanup that can be done afterward in the wake
of this; what I've done here is just a mechanical replacement of
_hash_chgbufaccess() with LockBuffer() and/or MarkBufferDirty().

The patch has replaced usage of HASH_READ/HASH_WRITE with
BUFFER_LOCK_SHARE/BUFFER_LOCK_EXCLUSIVE which will make hash code
using two types of defines for the same purpose. Now, we can say that
it is better to replace HASH_READ/HASH_WRITE in whole hash index code
or maybe the usage this patch is introducing is okay, however, it
seems like btree is using similar terminology (BT_READ/BT_WRITE).
Other than that your patch looks good.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Amit Kapila (#64)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Fri, Dec 23, 2016 at 6:40 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Dec 22, 2016 at 9:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 5, 2016 at 2:46 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I'll review after that, since I have other things to review meanwhile.

Attached, please find the rebased patch attached with this e-mail.
There is no fundamental change in patch except for adapting the new
locking strategy in squeeze operation. I have done the sanity testing
of the patch with master-standby setup and Kuntal has helped me to
verify it with his wal-consistency checker patch.

This patch again needs a rebase,

I had completed it yesterday night and kept it for night long tests to
check the sanity of the patch, but I guess now I need another rebase.

After recent commit's 7819ba1e and 25216c98, this patch requires a
rebase. Attached is the rebased patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal_hash_index_v7.patchapplication/octet-stream; name=wal_hash_index_v7.patchDownload

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 6eaed1e..dd6e851 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1536,19 +1536,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8d7b3bf..10f8074 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2138,10 +2138,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index a1a9532..964af84 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2353,12 +2353,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 271c135..e40750e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index fcb7a60..7163b03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index e2e7e91..b154569 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
-       hashsort.o hashutil.o hashvalidate.o
+       hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 01ea115..06ef477 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -248,7 +248,6 @@ The insertion algorithm is rather similar:
          split happened)
 		take the buffer content lock on bucket page in exclusive mode
 		retake meta page buffer content lock in shared mode
-	release pin on metapage
 -- (so far same as reader, except for acquisition of buffer content lock in
 	exclusive mode on primary bucket page)
 	if the bucket-being-split flag is set for a bucket and pin count on it is
@@ -263,13 +262,17 @@ The insertion algorithm is rather similar:
 	if current page is full, release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
+	take buffer content lock in exclusive mode on metapage
 	insert tuple at appropriate place in page
-	mark current page dirty and release buffer content lock and pin
-	if the current page is not a bucket page, release the pin on bucket page
-	pin meta page and take buffer content lock in exclusive mode
+	mark current page dirty
 	increment tuple count, decide if split needed
-	mark meta page dirty and release buffer content lock and pin
-	done if no split needed, else enter Split algorithm below
+	mark meta page dirty
+	write WAL for insertion of tuple
+	release the buffer content lock on metapage
+	release buffer content lock on current page
+	if current page is not a bucket page, release the pin on bucket page
+	if split is needed, enter Split algorithm below
+	release the pin on metapage
 
 To speed searches, the index entries within any individual index page are
 kept sorted by hash code; the insertion code must take care to insert new
@@ -304,12 +307,17 @@ existing bucket in two, thereby lowering the fill ratio:
        try to finish the split and the cleanup work
        if that succeeds, start over; if it fails, give up
 	mark the old and new buckets indicating split is in progress
+	mark both old and new buckets as dirty
+	write WAL for allocation of new page for split
 	copy the tuples that belongs to new bucket from old bucket, marking
      them as moved-by-split
+	write WAL record for moving tuples to new page once the new page is full
+	or all the pages of old bucket are finished
 	release lock but not pin for primary bucket page of old bucket,
 	 read/shared-lock next page; repeat as needed
 	clear the bucket-being-split and bucket-being-populated flags
 	mark the old bucket indicating split-cleanup
+	write WAL for changing the flags on both old and new buckets
 
 The split operation's attempt to acquire cleanup-lock on the old bucket number
 could fail if another process holds any lock or pin on it.  We do not want to
@@ -345,6 +353,8 @@ The fourth operation is garbage collection (bulk deletion):
 		acquire cleanup lock on primary bucket page
 		loop:
 			scan and remove tuples
+			mark the target page dirty
+			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
 			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
@@ -359,7 +369,8 @@ The fourth operation is garbage collection (bulk deletion):
 	check if number of buckets changed
 	if so, release content lock and pin and return to for-each-bucket loop
 	else update metapage tuple count
-	mark meta page dirty and release buffer content lock and pin
+	 mark meta page dirty and write WAL for update of metapage
+	 release buffer content lock and pin
 
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
@@ -401,18 +412,16 @@ Obtaining an overflow page:
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		mark bitmap page dirty and release content lock
+		mark bitmap page dirty
 		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
 			update it and mark meta page dirty
-		release meta page buffer content lock
-		return page number
 	else (not found):
 	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -432,12 +441,15 @@ like this:
 
 	-- having determined that no space is free in the target bucket:
 	remember last page of bucket, drop write lock on it
-	call free-page-acquire routine
 	re-write-lock last page of bucket
 	if it is not last anymore, step to the last page
-	update (former) last page to point to new page
+	execute free-page-acquire (Obtaining an overflow page) mechanism described above
+	update (former) last page to point to the new page and mark the  buffer dirty.
 	write-lock and initialize new page, with back link to former last page
-	write and release former last page
+	write WAL for addition of overflow page
+	release the locks on meta page and bitmap page acquired in free-page-acquire algorithm
+	release the lock on former last page
+	release the lock on new overflow page
 	insert tuple into new page
 	-- etc.
 
@@ -464,12 +476,14 @@ accessors of pages in the bucket.  The algorithm is:
 	determine which bitmap page contains the free space bit for page
 	release meta page buffer content lock
 	pin bitmap page and take buffer content lock in exclusive mode
-	update bitmap bit
-	mark bitmap page dirty and release buffer content lock and pin
-	if page number is less than what we saw as first-free-bit in meta:
 	retake meta page buffer content lock in exclusive mode
+	move (insert) tuples that belong to the overflow page being freed
+	update bitmap bit
+	mark bitmap page dirty
 	if page number is still less than first-free-bit,
 		update first-free-bit field and mark meta page dirty
+	write WAL for delinking overflow page operation
+	release buffer content lock and pin
 	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
@@ -480,8 +494,101 @@ page acquirer will scan more bitmap bits than he needs to.  What must be
 avoided is having first-free-bit greater than the actual first free bit,
 because then that free page would never be found by searchers.
 
-All the freespace operations should be called while holding no buffer
-locks.  Since they need no lmgr locks, deadlock is not possible.
+The reason of moving tuples from overflow page while delinking the later is
+to make that as an atomic operation.  Not doing so could lead to spurious reads
+on standby.  Basically, the user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which get
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolled back.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only the fixed number of pages XLR_MAX_BLOCK_ID (32)
+with current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform the split operation if the number of tuples are more than what can be
+accomodated in the initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of the old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with bucket-being-split and bucket-being-populated flags respectively
+which indicates that split is in progress for those buckets.  The reader
+algorithm works correctly, as it will scan both the old and new buckets when
+the split is in progress as explained in the reader algorithm section above.
+
+We finish the split at next insert or split operation on the old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vacuum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeeze the bucket completely.
 
 
 Other Notes
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 1fa087a..3f663c9 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -27,6 +27,7 @@
 #include "optimizer/plancat.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -115,7 +116,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -177,7 +178,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
@@ -297,6 +298,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -610,6 +616,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -636,6 +643,26 @@ loop_top:
 	}
 
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
@@ -803,9 +830,40 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -853,8 +911,25 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
 		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
@@ -868,9 +943,3 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..41429a7
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,970 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, *num_bucket, LH_OVERFLOW_PAGE, true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay split cleanup flag operation for primary bucket page.
+ */
+static void
+hash_xlog_split_cleanup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			hash_xlog_split_cleanup(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 46df589..9e78b40 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -45,6 +47,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -206,35 +209,63 @@ restart_insert:
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * dirty and release the modified page.  if the page we modified was an
-	 * overflow page, we also need to separately drop the pin we retained on
-	 * the primary bucket page.
-	 */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap->hashm_ntuples += 1;
 
 	/* Make sure this stays in sync with _hash_expandtable() */
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
+
 	/* Attempt to split if a split is needed */
 	if (do_expand)
 		_hash_expandtable(rel, metabuf);
@@ -275,3 +306,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 6b106f3..c7c4922 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,10 +18,11 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -84,7 +85,9 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -102,13 +105,37 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
+	HashMetaPage metap;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
+	BlockNumber blkno;
+	uint32		orig_firstfree;
+	uint32		splitnum;
+	uint32	   *freep = NULL;
+	uint32		max_ovflpg;
+	uint32		bit;
+	uint32		bitmap_page_bit;
+	uint32		first_page;
+	uint32		last_bit;
+	uint32		last_page;
+	uint32		i,
+				j;
+	bool		page_found = false;
 
 	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
+	 * Write-lock the tail page.  Here, we need to maintain locking order such
+	 * that, first acquire the lock on tail page of bucket, then on meta page
+	 * to find and lock the bitmap page and if it is found, then lock on meta
+	 * page is released, then finally acquire the lock on new overflow buffer.
+	 * We need this locking order to avoid deadlock with backends that are
+	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -136,56 +163,6 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
 
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	MarkBufferDirty(buf);
-	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
-		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-	else
-		_hash_relbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
-	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
-	BlockNumber blkno;
-	uint32		orig_firstfree;
-	uint32		splitnum;
-	uint32	   *freep = NULL;
-	uint32		max_ovflpg;
-	uint32		bit;
-	uint32		first_page;
-	uint32		last_bit;
-	uint32		last_page;
-	uint32		i,
-				j;
-
 	/* Get exclusive lock on the meta page */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -234,11 +211,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -262,8 +259,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -274,7 +278,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	}
 
 	/* Calculate address of the new overflow page */
-	bit = metap->hashm_spares[splitnum];
+	bit = BufferIsValid(newmapbuf) ?
+		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -282,41 +287,51 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * relation length stays in sync with ours.  XXX It's annoying to do this
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
+	 *
+	 * It is okay to hold two buffer locks here (one on tail page of bucket
+	 * and other on new overflow page) since there cannot be anyone else
+	 * contending for access to ovflbuf.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case the new page is added.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
+	START_CRIT_SECTION();
 
-	/* Write updated metapage and release lock, but not pin */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-	return newbuf;
-
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	MarkBufferDirty(mapbuf);
-	_hash_relbuf(rel, mapbuf);
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
 
-	/* Reacquire exclusive lock on the meta page */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -325,19 +340,101 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
 		MarkBufferDirty(metabuf);
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 	}
-	else
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* We didn't change the metapage, so no need to write */
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
 	}
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	END_CRIT_SECTION();
+
+	if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+	LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);
+
+	return ovflbuf;
 }
 
 /*
@@ -370,6 +467,12 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -382,13 +485,14 @@ _hash_firstfreebit(uint32 map)
  *	has a lock on same.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
-	Buffer		prevbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
@@ -402,6 +506,9 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -414,15 +521,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	bucket = ovflopaque->hasho_bucket;
 
 	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	MarkBufferDirty(ovflbuf);
-	_hash_relbuf(rel, ovflbuf);
-
-	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
@@ -430,8 +528,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Page		prevpage;
-		HashPageOpaque prevopaque;
 
 		if (prevblkno == writeblkno)
 			prevbuf = wbuf;
@@ -441,32 +537,13 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 												 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
 												 bstrategy);
-
-		prevpage = BufferGetPage(prevbuf);
-		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-
-		MarkBufferDirty(prevbuf);
-		if (prevblkno != writeblkno)
-			_hash_relbuf(rel, prevbuf);
 	}
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		MarkBufferDirty(nextbuf);
-		_hash_relbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -487,62 +564,184 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	/* Release metapage lock while we access the bitmap page */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
-	/* Clear the bitmap bit to indicate that this overflow page is free */
+	/* read the bitmap page to clear the bitmap bit */
 	mapbuf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BITMAP_PAGE);
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	MarkBufferDirty(mapbuf);
-	_hash_relbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed the
+	 * page as WAL replay routines expect pages to be initialized. See
+	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	/* Clear the bitmap bit to indicate that this overflow page is free */
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
+		update_metap = true;
 		MarkBufferDirty(metabuf);
 	}
-	_hash_relbuf(rel, metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* release previous bucket if it is not same as write bucket */
+	if (BufferIsValid(prevbuf) && prevblkno != writeblkno)
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
 
 	return nextblkno;
 }
 
 
 /*
- *	_hash_initbitmap()
+ *	_hash_initbitmapbuffer()
  *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
- *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
  */
 void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 {
-	Buffer		buf;
 	Page		pg;
 	HashPageOpaque op;
 	uint32	   *freep;
 
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
 	pg = BufferGetPage(buf);
 
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
 	/* initialize the page's special space */
 	op = (HashPageOpaque) PageGetSpecialPointer(pg);
 	op->hasho_prevblkno = InvalidBlockNumber;
@@ -553,23 +752,14 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 	/* set all of the bits to 1 */
 	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* dirty the new bitmap page, and release write lock and pin */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
+	MemSet(freep, 0xFF, bmsize);
 
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
+	/*
+	 * Set pd_lower just past the end of the bitmap page data.  We could even
+	 * set pd_lower equal to pd_upper, but this is more precise and makes the
+	 * page look compressible to xlog.c.
+	 */
+	((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
 }
 
 
@@ -619,7 +809,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
 
 	/*
 	 * start squeezing into the primary bucket page.
@@ -665,15 +854,23 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		int			i;
 		bool		retain_pin = false;
 
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
+
+readpage:
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
 		for (roffnum = FirstOffsetNumber;
@@ -694,11 +891,13 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMulTups(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
 				Buffer		next_wbuf = InvalidBuffer;
+				bool		tups_moved = false;
 
 				Assert(!PageIsEmpty(wpage));
 
@@ -715,50 +914,130 @@ _hash_squeezebucket(Relation rel,
 														   HASH_WRITE,
 														   LH_OVERFLOW_PAGE,
 														   bstrategy);
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+					MarkBufferDirty(wbuf);
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
+					tups_moved = true;
+				}
 
 				/*
 				 * release the lock on previous page after acquiring the lock
 				 * on next page
 				 */
-				if (wbuf_dirty)
-					MarkBufferDirty(wbuf);
 				if (retain_pin)
 					LockBuffer(wbuf, BUFFER_LOCK_UNLOCK);
 				else
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_SQUEEZE_PAGE on standby users will be able to
+				 * view the results of partial deletion on rblkno.
+				 */
+				LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						MarkBufferDirty(rbuf);
-					}
-					_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				LockBuffer(rbuf, BUFFER_LOCK_EXCLUSIVE);
+
 				wbuf = next_wbuf;
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
 				retain_pin = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
 
+				/*
+				 * after moving the tuples, rpage would have been compacted,
+				 * so we need to rescan it.
+				 */
+				if (tups_moved)
+					goto readpage;
+			}
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -776,10 +1055,14 @@ _hash_squeezebucket(Relation rel,
 		Assert(BlockNumberIsValid(rblkno));
 
 		/* free this overflow page (releases rbuf) */
-		_hash_freeovflpage(rel, rbuf, wbuf, bstrategy);
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
 
-		if (wbuf_dirty)
-			MarkBufferDirty(wbuf);
+		pfree(itup_offsets);
 
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 45e184c..d7eb8c7 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -40,12 +41,11 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -160,6 +160,29 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag, bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	pageopaque->hasho_prevblkno = InvalidBlockNumber;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -291,7 +314,7 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -303,19 +326,18 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -337,6 +359,154 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -356,30 +526,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -396,7 +561,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -415,47 +580,11 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_firstfree = 0;
 
 	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
-	 */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-	/*
-	 * Initialize the first N buckets
-	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-		pageopaque->hasho_prevblkno = InvalidBlockNumber;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		MarkBufferDirty(buf);
-		_hash_relbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Initialize first bitmap page
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
 	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	MarkBufferDirty(metabuf);
-	_hash_relbuf(rel, metabuf);
-
-	return num_buckets;
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -464,7 +593,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
@@ -492,10 +620,14 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -531,7 +663,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
@@ -655,7 +787,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -680,18 +816,43 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress.  At
+	 * operation end, we clear split-in-progress flag.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = InvalidBlockNumber;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -699,6 +860,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -711,10 +873,10 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -727,16 +889,71 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
-	/* Write out the metapage and drop lock, but keep pin */
-	MarkBufferDirty(metabuf);
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(buf_nblkno, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buf_nblkno, BUFFER_LOCK_EXCLUSIVE);
+
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -776,6 +993,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
 	char		zerobuf[BLCKSZ];
+	Page		page;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -786,7 +1004,21 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	MemSet(zerobuf, 0, sizeof(zerobuf));
+	page = (Page) zerobuf;
+
+	/*
+	 * Initialise the new bucket page, here we can't complete zeroed the page
+	 * as WAL replay routines expect pages to be initialized. See explanation
+	 * of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(page, BLCKSZ);
+
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
 
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
@@ -794,14 +1026,19 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	return true;
 }
 
-
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, the caller needs to fill htab.  If htab is set,
+ * then we skip the movement of tuples that exists in htab, otherwise NULL
+ * value of htab indicates movement of all the tuples that belong to the new
+ * bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -827,69 +1064,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress.  At
-	 * operation end, we clear split-in-progress flag.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = InvalidBlockNumber;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * to finish incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -974,11 +1153,14 @@ _hash_splitbucket_guts(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
-					/* write out nbuf and drop lock, but keep pin */
-					MarkBufferDirty(nbuf);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
+					/* drop lock, but keep pin */
 					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
@@ -986,6 +1168,13 @@ _hash_splitbucket_guts(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -994,6 +1183,8 @@ _hash_splitbucket_guts(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1016,7 +1207,16 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
+			if (nbuf == bucket_nbuf)
+				LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1032,17 +1232,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nbuf == bucket_nbuf)
-	{
-		MarkBufferDirty(bucket_nbuf);
-		LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
-	}
-	else
-	{
-		MarkBufferDirty(nbuf);
-		_hash_relbuf(rel, nbuf);
-	}
-
 	LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1051,6 +1240,8 @@ _hash_splitbucket_guts(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
 
@@ -1067,6 +1258,29 @@ _hash_splitbucket_guts(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1183,11 +1397,44 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nbucket = npageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	_hash_relbuf(rel, bucket_nbuf);
 	LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 	hash_destroy(tidhtab);
 }
+
+/*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 913b87c..4d741ee 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -123,6 +123,7 @@ _hash_readnext(IndexScanDesc scan,
 	if (block_found)
 	{
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 	}
 }
@@ -159,6 +160,7 @@ _hash_readprev(IndexScanDesc scan,
 		*bufp = _hash_getbuf(rel, blkno, HASH_READ,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 
 		/*
@@ -328,6 +330,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	_hash_dropbuf(rel, metabuf);
 
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
@@ -362,6 +365,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+		TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf));
 
 		/*
 		 * remember the split bucket buffer so as to use it later for
@@ -564,6 +568,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 12e1818..fc04363 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,143 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "split_complete_old_bucket %c, split_complete_new_bucket %c",
+				(xlrec->old_bucket_flag & LH_BUCKET_BEING_SPLIT) ? 'F' : 'T',
+								 (xlrec->new_bucket_flag & LH_BUCKET_BEING_POPULATED) ? 'F' : 'T');
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			id = "SPLIT_CLEANUP";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index eeb2b1f..82f019d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -502,11 +502,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 73aa0c0..31b66d2 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -598,6 +598,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMulTups
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMulTups(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2a68359..266122a 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5731,13 +5731,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5759,8 +5756,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index bc08f81..bd56ee2 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -310,13 +310,15 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
-				   BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
@@ -328,6 +330,8 @@ extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
 								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag,
+			  bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -336,8 +340,10 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 5f941a9..35c1b8f 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -17,6 +17,245 @@
 #include "access/hash.h"
 #include "access/xlogreader.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_SPLIT_CLEANUP 0xA0	/* clear split-cleanup flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * Data to regenerate the meta-data page
+ */
+typedef struct xl_hash_metadata
+{
+	HashMetaPageData metadata;
+}	xl_hash_metadata;
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (xl_hash_metadata)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index ad4ab5f..6ea46ef 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -425,6 +425,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMulTups(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index e663f9a..6b2f693 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index f8b9f02..0a18efa 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -201,7 +201,6 @@ SELECT h.seqno AS f20000
 --
 CREATE TABLE hash_split_heap (keycol INT);
 CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
 VACUUM FULL hash_split_heap;
 -- Let's do a backward scan.
@@ -230,5 +229,4 @@ DROP TABLE hash_temp_heap CASCADE;
 CREATE TABLE hash_heap_float4 (x float4, y int);
 INSERT INTO hash_heap_float4 VALUES (1.1,1);
 CREATE INDEX hash_idx ON hash_heap_float4 USING hash (x);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 DROP TABLE hash_heap_float4 CASCADE;
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index 1a04ec5..f0bb253 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 59cb1e0..d907519 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail

#66

Jesper Pedersen

jesper.pedersen@redhat.com

almost 9 years ago

In reply to: Amit Kapila (#65)

Re: Write Ahead Logging for Hash Indexes

On 12/27/2016 01:58 AM, Amit Kapila wrote:

After recent commit's 7819ba1e and 25216c98, this patch requires a
rebase. Attached is the rebased patch.

This needs a rebase after commit e898437.

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Jesper Pedersen (#66)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 12/27/2016 01:58 AM, Amit Kapila wrote:

After recent commit's 7819ba1e and 25216c98, this patch requires a
rebase. Attached is the rebased patch.

This needs a rebase after commit e898437.

Attached find the rebased patch.

Thanks!

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal_hash_index_v8.patchapplication/octet-stream; name=wal_hash_index_v8.patchDownload

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 6eaed1e..dd6e851 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1536,19 +1536,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 30dd54c..e8d8dc4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2152,10 +2152,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index a1a9532..964af84 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2353,12 +2353,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 271c135..e40750e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index fcb7a60..7163b03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index e2e7e91..b154569 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
-       hashsort.o hashutil.o hashvalidate.o
+       hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 01ea115..06ef477 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -248,7 +248,6 @@ The insertion algorithm is rather similar:
          split happened)
 		take the buffer content lock on bucket page in exclusive mode
 		retake meta page buffer content lock in shared mode
-	release pin on metapage
 -- (so far same as reader, except for acquisition of buffer content lock in
 	exclusive mode on primary bucket page)
 	if the bucket-being-split flag is set for a bucket and pin count on it is
@@ -263,13 +262,17 @@ The insertion algorithm is rather similar:
 	if current page is full, release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
+	take buffer content lock in exclusive mode on metapage
 	insert tuple at appropriate place in page
-	mark current page dirty and release buffer content lock and pin
-	if the current page is not a bucket page, release the pin on bucket page
-	pin meta page and take buffer content lock in exclusive mode
+	mark current page dirty
 	increment tuple count, decide if split needed
-	mark meta page dirty and release buffer content lock and pin
-	done if no split needed, else enter Split algorithm below
+	mark meta page dirty
+	write WAL for insertion of tuple
+	release the buffer content lock on metapage
+	release buffer content lock on current page
+	if current page is not a bucket page, release the pin on bucket page
+	if split is needed, enter Split algorithm below
+	release the pin on metapage
 
 To speed searches, the index entries within any individual index page are
 kept sorted by hash code; the insertion code must take care to insert new
@@ -304,12 +307,17 @@ existing bucket in two, thereby lowering the fill ratio:
        try to finish the split and the cleanup work
        if that succeeds, start over; if it fails, give up
 	mark the old and new buckets indicating split is in progress
+	mark both old and new buckets as dirty
+	write WAL for allocation of new page for split
 	copy the tuples that belongs to new bucket from old bucket, marking
      them as moved-by-split
+	write WAL record for moving tuples to new page once the new page is full
+	or all the pages of old bucket are finished
 	release lock but not pin for primary bucket page of old bucket,
 	 read/shared-lock next page; repeat as needed
 	clear the bucket-being-split and bucket-being-populated flags
 	mark the old bucket indicating split-cleanup
+	write WAL for changing the flags on both old and new buckets
 
 The split operation's attempt to acquire cleanup-lock on the old bucket number
 could fail if another process holds any lock or pin on it.  We do not want to
@@ -345,6 +353,8 @@ The fourth operation is garbage collection (bulk deletion):
 		acquire cleanup lock on primary bucket page
 		loop:
 			scan and remove tuples
+			mark the target page dirty
+			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
 			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
@@ -359,7 +369,8 @@ The fourth operation is garbage collection (bulk deletion):
 	check if number of buckets changed
 	if so, release content lock and pin and return to for-each-bucket loop
 	else update metapage tuple count
-	mark meta page dirty and release buffer content lock and pin
+	 mark meta page dirty and write WAL for update of metapage
+	 release buffer content lock and pin
 
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
@@ -401,18 +412,16 @@ Obtaining an overflow page:
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		mark bitmap page dirty and release content lock
+		mark bitmap page dirty
 		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
 			update it and mark meta page dirty
-		release meta page buffer content lock
-		return page number
 	else (not found):
 	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -432,12 +441,15 @@ like this:
 
 	-- having determined that no space is free in the target bucket:
 	remember last page of bucket, drop write lock on it
-	call free-page-acquire routine
 	re-write-lock last page of bucket
 	if it is not last anymore, step to the last page
-	update (former) last page to point to new page
+	execute free-page-acquire (Obtaining an overflow page) mechanism described above
+	update (former) last page to point to the new page and mark the  buffer dirty.
 	write-lock and initialize new page, with back link to former last page
-	write and release former last page
+	write WAL for addition of overflow page
+	release the locks on meta page and bitmap page acquired in free-page-acquire algorithm
+	release the lock on former last page
+	release the lock on new overflow page
 	insert tuple into new page
 	-- etc.
 
@@ -464,12 +476,14 @@ accessors of pages in the bucket.  The algorithm is:
 	determine which bitmap page contains the free space bit for page
 	release meta page buffer content lock
 	pin bitmap page and take buffer content lock in exclusive mode
-	update bitmap bit
-	mark bitmap page dirty and release buffer content lock and pin
-	if page number is less than what we saw as first-free-bit in meta:
 	retake meta page buffer content lock in exclusive mode
+	move (insert) tuples that belong to the overflow page being freed
+	update bitmap bit
+	mark bitmap page dirty
 	if page number is still less than first-free-bit,
 		update first-free-bit field and mark meta page dirty
+	write WAL for delinking overflow page operation
+	release buffer content lock and pin
 	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
@@ -480,8 +494,101 @@ page acquirer will scan more bitmap bits than he needs to.  What must be
 avoided is having first-free-bit greater than the actual first free bit,
 because then that free page would never be found by searchers.
 
-All the freespace operations should be called while holding no buffer
-locks.  Since they need no lmgr locks, deadlock is not possible.
+The reason of moving tuples from overflow page while delinking the later is
+to make that as an atomic operation.  Not doing so could lead to spurious reads
+on standby.  Basically, the user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which get
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolled back.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only the fixed number of pages XLR_MAX_BLOCK_ID (32)
+with current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform the split operation if the number of tuples are more than what can be
+accomodated in the initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of the old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with bucket-being-split and bucket-being-populated flags respectively
+which indicates that split is in progress for those buckets.  The reader
+algorithm works correctly, as it will scan both the old and new buckets when
+the split is in progress as explained in the reader algorithm section above.
+
+We finish the split at next insert or split operation on the old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vacuum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeeze the bucket completely.
 
 
 Other Notes
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0cbf6b0..f186e52 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -27,6 +27,7 @@
 #include "optimizer/plancat.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -115,7 +116,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -177,7 +178,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
@@ -297,6 +298,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -610,6 +616,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -636,6 +643,26 @@ loop_top:
 	}
 
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
@@ -803,9 +830,40 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -853,8 +911,25 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
 		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
@@ -868,9 +943,3 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..41429a7
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,970 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, *num_bucket, LH_OVERFLOW_PAGE, true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay split cleanup flag operation for primary bucket page.
+ */
+static void
+hash_xlog_split_cleanup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			hash_xlog_split_cleanup(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 39c70d3..057bd3c 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -45,6 +47,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -206,35 +209,63 @@ restart_insert:
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * dirty and release the modified page.  if the page we modified was an
-	 * overflow page, we also need to separately drop the pin we retained on
-	 * the primary bucket page.
-	 */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap->hashm_ntuples += 1;
 
 	/* Make sure this stays in sync with _hash_expandtable() */
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
+
 	/* Attempt to split if a split is needed */
 	if (do_expand)
 		_hash_expandtable(rel, metabuf);
@@ -275,3 +306,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index e8928ef..1b5bc52 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,10 +18,11 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -84,7 +85,9 @@ blkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -102,13 +105,37 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
+	HashMetaPage metap;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
+	BlockNumber blkno;
+	uint32		orig_firstfree;
+	uint32		splitnum;
+	uint32	   *freep = NULL;
+	uint32		max_ovflpg;
+	uint32		bit;
+	uint32		bitmap_page_bit;
+	uint32		first_page;
+	uint32		last_bit;
+	uint32		last_page;
+	uint32		i,
+				j;
+	bool		page_found = false;
 
 	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
+	 * Write-lock the tail page.  Here, we need to maintain locking order such
+	 * that, first acquire the lock on tail page of bucket, then on meta page
+	 * to find and lock the bitmap page and if it is found, then lock on meta
+	 * page is released, then finally acquire the lock on new overflow buffer.
+	 * We need this locking order to avoid deadlock with backends that are
+	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -142,60 +169,6 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
 
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	MarkBufferDirty(buf);
-	if (retain_pin)
-	{
-		/* pin will be retained only for the primary bucket page */
-		Assert(pageopaque->hasho_flag & LH_BUCKET_PAGE);
-		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-	}
-	else
-		_hash_relbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
-	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
-	BlockNumber blkno;
-	uint32		orig_firstfree;
-	uint32		splitnum;
-	uint32	   *freep = NULL;
-	uint32		max_ovflpg;
-	uint32		bit;
-	uint32		first_page;
-	uint32		last_bit;
-	uint32		last_page;
-	uint32		i,
-				j;
-
 	/* Get exclusive lock on the meta page */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -244,11 +217,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -272,8 +265,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -284,7 +284,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	}
 
 	/* Calculate address of the new overflow page */
-	bit = metap->hashm_spares[splitnum];
+	bit = BufferIsValid(newmapbuf) ?
+		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -292,41 +293,51 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * relation length stays in sync with ours.  XXX It's annoying to do this
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
+	 *
+	 * It is okay to hold two buffer locks here (one on tail page of bucket
+	 * and other on new overflow page) since there cannot be anyone else
+	 * contending for access to ovflbuf.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case the new page is added.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
-
-	/* Write updated metapage and release lock, but not pin */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+	START_CRIT_SECTION();
 
-	return newbuf;
-
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	MarkBufferDirty(mapbuf);
-	_hash_relbuf(rel, mapbuf);
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
 
-	/* Reacquire exclusive lock on the meta page */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -335,19 +346,101 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
 		MarkBufferDirty(metabuf);
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 	}
-	else
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* We didn't change the metapage, so no need to write */
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
 	}
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	END_CRIT_SECTION();
+
+	if (retain_pin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+	LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);
+
+	return ovflbuf;
 }
 
 /*
@@ -380,6 +473,12 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -392,13 +491,14 @@ _hash_firstfreebit(uint32 map)
  *	has a lock on same.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
-	Buffer		prevbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
@@ -412,6 +512,9 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -424,15 +527,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	bucket = ovflopaque->hasho_bucket;
 
 	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	MarkBufferDirty(ovflbuf);
-	_hash_relbuf(rel, ovflbuf);
-
-	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
@@ -440,8 +534,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Page		prevpage;
-		HashPageOpaque prevopaque;
 
 		if (prevblkno == writeblkno)
 			prevbuf = wbuf;
@@ -451,32 +543,13 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 												 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
 												 bstrategy);
-
-		prevpage = BufferGetPage(prevbuf);
-		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-
-		MarkBufferDirty(prevbuf);
-		if (prevblkno != writeblkno)
-			_hash_relbuf(rel, prevbuf);
 	}
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		MarkBufferDirty(nextbuf);
-		_hash_relbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -497,62 +570,184 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	/* Release metapage lock while we access the bitmap page */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
-	/* Clear the bitmap bit to indicate that this overflow page is free */
+	/* read the bitmap page to clear the bitmap bit */
 	mapbuf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BITMAP_PAGE);
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	MarkBufferDirty(mapbuf);
-	_hash_relbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed the
+	 * page as WAL replay routines expect pages to be initialized. See
+	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	/* Clear the bitmap bit to indicate that this overflow page is free */
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
+		update_metap = true;
 		MarkBufferDirty(metabuf);
 	}
-	_hash_relbuf(rel, metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* release previous bucket if it is not same as write bucket */
+	if (BufferIsValid(prevbuf) && prevblkno != writeblkno)
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
 
 	return nextblkno;
 }
 
 
 /*
- *	_hash_initbitmap()
+ *	_hash_initbitmapbuffer()
  *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
- *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
  */
 void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 {
-	Buffer		buf;
 	Page		pg;
 	HashPageOpaque op;
 	uint32	   *freep;
 
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
 	pg = BufferGetPage(buf);
 
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
 	/* initialize the page's special space */
 	op = (HashPageOpaque) PageGetSpecialPointer(pg);
 	op->hasho_prevblkno = InvalidBlockNumber;
@@ -563,23 +758,14 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 	/* set all of the bits to 1 */
 	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* dirty the new bitmap page, and release write lock and pin */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
+	MemSet(freep, 0xFF, bmsize);
 
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
+	/*
+	 * Set pd_lower just past the end of the bitmap page data.  We could even
+	 * set pd_lower equal to pd_upper, but this is more precise and makes the
+	 * page look compressible to xlog.c.
+	 */
+	((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
 }
 
 
@@ -629,7 +815,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
 
 	/*
 	 * start squeezing into the primary bucket page.
@@ -675,15 +860,23 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		int			i;
 		bool		retain_pin = false;
 
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
+
+readpage:
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
 		for (roffnum = FirstOffsetNumber;
@@ -704,11 +897,13 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMulTups(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
 				Buffer		next_wbuf = InvalidBuffer;
+				bool		tups_moved = false;
 
 				Assert(!PageIsEmpty(wpage));
 
@@ -725,50 +920,130 @@ _hash_squeezebucket(Relation rel,
 														   HASH_WRITE,
 														   LH_OVERFLOW_PAGE,
 														   bstrategy);
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+					MarkBufferDirty(wbuf);
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
+					tups_moved = true;
+				}
 
 				/*
 				 * release the lock on previous page after acquiring the lock
 				 * on next page
 				 */
-				if (wbuf_dirty)
-					MarkBufferDirty(wbuf);
 				if (retain_pin)
 					LockBuffer(wbuf, BUFFER_LOCK_UNLOCK);
 				else
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_SQUEEZE_PAGE on standby users will be able to
+				 * view the results of partial deletion on rblkno.
+				 */
+				LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						MarkBufferDirty(rbuf);
-					}
-					_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				LockBuffer(rbuf, BUFFER_LOCK_EXCLUSIVE);
+
 				wbuf = next_wbuf;
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
 				retain_pin = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
 
+				/*
+				 * after moving the tuples, rpage would have been compacted,
+				 * so we need to rescan it.
+				 */
+				if (tups_moved)
+					goto readpage;
+			}
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -786,10 +1061,14 @@ _hash_squeezebucket(Relation rel,
 		Assert(BlockNumberIsValid(rblkno));
 
 		/* free this overflow page (releases rbuf) */
-		_hash_freeovflpage(rel, rbuf, wbuf, bstrategy);
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
 
-		if (wbuf_dirty)
-			MarkBufferDirty(wbuf);
+		pfree(itup_offsets);
 
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 9430794..d1d21bd 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -40,12 +41,11 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -160,6 +160,29 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag, bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+	pageopaque->hasho_prevblkno = InvalidBlockNumber;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -291,7 +314,7 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -303,19 +326,18 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -337,6 +359,154 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -356,30 +526,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -396,7 +561,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -415,47 +580,11 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_firstfree = 0;
 
 	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
-	 */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-	/*
-	 * Initialize the first N buckets
-	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-		pageopaque->hasho_prevblkno = InvalidBlockNumber;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		MarkBufferDirty(buf);
-		_hash_relbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Initialize first bitmap page
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
 	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	MarkBufferDirty(metabuf);
-	_hash_relbuf(rel, metabuf);
-
-	return num_buckets;
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -464,7 +593,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
@@ -492,10 +620,14 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -531,7 +663,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
@@ -655,7 +787,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -680,18 +816,43 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress.  At
+	 * operation end, we clear split-in-progress flag.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = InvalidBlockNumber;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -699,6 +860,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -711,10 +873,10 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -727,16 +889,71 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
-	/* Write out the metapage and drop lock, but keep pin */
-	MarkBufferDirty(metabuf);
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(buf_nblkno, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buf_nblkno, BUFFER_LOCK_EXCLUSIVE);
+
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -776,6 +993,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
 	char		zerobuf[BLCKSZ];
+	Page		page;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -786,7 +1004,21 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	MemSet(zerobuf, 0, sizeof(zerobuf));
+	page = (Page) zerobuf;
+
+	/*
+	 * Initialise the new bucket page, here we can't complete zeroed the page
+	 * as WAL replay routines expect pages to be initialized. See explanation
+	 * of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(page, BLCKSZ);
+
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
 
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
@@ -794,14 +1026,19 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	return true;
 }
 
-
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, the caller needs to fill htab.  If htab is set,
+ * then we skip the movement of tuples that exists in htab, otherwise NULL
+ * value of htab indicates movement of all the tuples that belong to the new
+ * bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -827,69 +1064,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress.  At
-	 * operation end, we clear split-in-progress flag.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = InvalidBlockNumber;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * to finish incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -974,11 +1153,14 @@ _hash_splitbucket_guts(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
-					/* write out nbuf and drop lock, but keep pin */
-					MarkBufferDirty(nbuf);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
+					/* drop lock, but keep pin */
 					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
@@ -986,6 +1168,13 @@ _hash_splitbucket_guts(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -994,6 +1183,8 @@ _hash_splitbucket_guts(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1016,7 +1207,16 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
+			if (nbuf == bucket_nbuf)
+				LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1032,17 +1232,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nbuf == bucket_nbuf)
-	{
-		MarkBufferDirty(bucket_nbuf);
-		LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
-	}
-	else
-	{
-		MarkBufferDirty(nbuf);
-		_hash_relbuf(rel, nbuf);
-	}
-
 	LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1051,6 +1240,8 @@ _hash_splitbucket_guts(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
 
@@ -1067,6 +1258,29 @@ _hash_splitbucket_guts(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1183,11 +1397,44 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nbucket = npageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	_hash_relbuf(rel, bucket_nbuf);
 	LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
 	hash_destroy(tidhtab);
 }
+
+/*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index c0bdfe6..9aaee1e 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -123,6 +123,7 @@ _hash_readnext(IndexScanDesc scan,
 	if (block_found)
 	{
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 	}
 }
@@ -159,6 +160,7 @@ _hash_readprev(IndexScanDesc scan,
 		*bufp = _hash_getbuf(rel, blkno, HASH_READ,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 
 		/*
@@ -328,6 +330,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 	_hash_dropbuf(rel, metabuf);
 
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->hasho_bucket == bucket);
 
@@ -362,6 +365,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+		TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf));
 
 		/*
 		 * remember the split bucket buffer so as to use it later for
@@ -564,6 +568,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 7eac819..5e3f7d8 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,143 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "split_complete_old_bucket %c, split_complete_new_bucket %c",
+				(xlrec->old_bucket_flag & LH_BUCKET_BEING_SPLIT) ? 'F' : 'T',
+								 (xlrec->new_bucket_flag & LH_BUCKET_BEING_POPULATED) ? 'F' : 'T');
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			id = "SPLIT_CLEANUP";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b6fa5a0..5bfa3b4 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -502,11 +502,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6fc5fa4..0e0605a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -598,6 +598,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMulTups
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMulTups(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 24678fc..b37181c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5731,13 +5731,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5759,8 +5756,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index b0a1131..8328fc5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -310,13 +310,15 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
-				   BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
@@ -328,6 +330,8 @@ extern Buffer _hash_getbuf(Relation rel, BlockNumber blkno,
 extern Buffer _hash_getbuf_with_condlock_cleanup(Relation rel,
 								   BlockNumber blkno, int flags);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 num_bucket, uint32 flag,
+			  bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -336,8 +340,10 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index ed3c37f..c53f878 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -17,6 +17,245 @@
 #include "access/hash.h"
 #include "access/xlogreader.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_SPLIT_CLEANUP 0xA0	/* clear split-cleanup flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * Data to regenerate the meta-data page
+ */
+typedef struct xl_hash_metadata
+{
+	HashMetaPageData metadata;
+}	xl_hash_metadata;
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (xl_hash_metadata)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 294f9cb..dfbc55e 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -425,6 +425,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMulTups(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index e663f9a..6b2f693 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index f8b9f02..0a18efa 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -201,7 +201,6 @@ SELECT h.seqno AS f20000
 --
 CREATE TABLE hash_split_heap (keycol INT);
 CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
 VACUUM FULL hash_split_heap;
 -- Let's do a backward scan.
@@ -230,5 +229,4 @@ DROP TABLE hash_temp_heap CASCADE;
 CREATE TABLE hash_heap_float4 (x float4, y int);
 INSERT INTO hash_heap_float4 VALUES (1.1,1);
 CREATE INDEX hash_idx ON hash_heap_float4 USING hash (x);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 DROP TABLE hash_heap_float4 CASCADE;
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index 1a04ec5..f0bb253 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 59cb1e0..d907519 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail

#68

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#67)

Re: Write Ahead Logging for Hash Indexes

On Fri, Jan 13, 2017 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 12/27/2016 01:58 AM, Amit Kapila wrote:

After recent commit's 7819ba1e and 25216c98, this patch requires a
rebase. Attached is the rebased patch.

This needs a rebase after commit e898437.

Attached find the rebased patch.

The patch applies, and there have been no reviews since the last
version has been submitted. So moved to CF 2017-03.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Jeff Janes

jeff.janes@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#67)

Re: Write Ahead Logging for Hash Indexes

On Thu, Jan 12, 2017 at 7:23 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 12/27/2016 01:58 AM, Amit Kapila wrote:

After recent commit's 7819ba1e and 25216c98, this patch requires a
rebase. Attached is the rebased patch.

This needs a rebase after commit e898437.

Attached find the rebased patch.

Thanks!

I've put this through a lot of crash-recovery testing using different
variations of my previously posted testing harness, and have not had any
problems.

I occasionally get deadlocks which are properly detected. I think this is
just a user-space problem, not a bug, but will describe it anyway just in
case...

What happens is that connections occasionally create 10,000 nuisance tuples
all using the same randomly chosen negative integer index value (the one
the hash index is on), and then some time later delete those tuples using
the memorized negative index value, to force the page split and squeeze to
get exercised. If two connections happen to choose the same negative index
for their nuisance tuples, and then try to delete "their" tuples at the
same time, and they are deleting the same 20,000 "nuisance" tuples
concurrently with a bucket split/squeeze/something, they might see the
tuples in a different order from each other and so deadlock against each
other waiting on each others transaction locks due to "locked" tuples. Is
that a plausible explanation?

Cheers,

Jeff

#70

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Jeff Janes (#69)

Re: Write Ahead Logging for Hash Indexes

On Sat, Feb 4, 2017 at 4:37 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Thu, Jan 12, 2017 at 7:23 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 12/27/2016 01:58 AM, Amit Kapila wrote:

After recent commit's 7819ba1e and 25216c98, this patch requires a
rebase. Attached is the rebased patch.

This needs a rebase after commit e898437.

Attached find the rebased patch.

Thanks!

I've put this through a lot of crash-recovery testing using different
variations of my previously posted testing harness, and have not had any
problems.

Thanks for detailed testing of this patch.

I occasionally get deadlocks which are properly detected. I think this is
just a user-space problem, not a bug, but will describe it anyway just in
case...

What happens is that connections occasionally create 10,000 nuisance tuples
all using the same randomly chosen negative integer index value (the one the
hash index is on), and then some time later delete those tuples using the
memorized negative index value, to force the page split and squeeze to get
exercised. If two connections happen to choose the same negative index for
their nuisance tuples, and then try to delete "their" tuples at the same
time, and they are deleting the same 20,000 "nuisance" tuples concurrently
with a bucket split/squeeze/something, they might see the tuples in a
different order from each other and so deadlock against each other waiting
on each others transaction locks due to "locked" tuples. Is that a
plausible explanation?

Yes, that is right explanation for the deadlock you are seeing. A few
days back Jesper and Ashutosh Sharma have also faced the same problem
[1]: /messages/by-id/CAE9k0PmBGrNO_s+yKn72VO--ypaXWemOsizsWr5cHjEv0X4ERg@mail.gmail.com
explain in brief, how this can happen is that squeeze operation moves
tuples from the far end of the bucket to rear end and it is quite
possible that different backends can see tuples with same values in a
different order.

[1]: /messages/by-id/CAE9k0PmBGrNO_s+yKn72VO--ypaXWemOsizsWr5cHjEv0X4ERg@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#67)

Re: Write Ahead Logging for Hash Indexes

On Thu, Jan 12, 2017 at 10:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 12/27/2016 01:58 AM, Amit Kapila wrote:

After recent commit's 7819ba1e and 25216c98, this patch requires a
rebase. Attached is the rebased patch.

This needs a rebase after commit e898437.

Attached find the rebased patch.

Well, I've managed to break this again by committing more things.

I think it would be helpful if you broke this into a series of
preliminary refactoring patches and then a final patch that actually
adds WAL-logging. The things that look like preliminary refactoring to
me are:

- Adding _hash_pgaddmultitup and using it in various places.
- Adding and freeing overflow pages has been extensively reworked.
- Similarly, there is some refactoring of how bitmap pages get initialized.
- Index initialization has been rejiggered significantly.
- Bucket splits have been rejiggered.

Individually those changes don't look all that troublesome to review,
but altogether it's quite a lot.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#71)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Thu, Feb 9, 2017 at 10:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 12, 2017 at 10:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jan 13, 2017 at 1:04 AM, Jesper Pedersen
<jesper.pedersen@redhat.com> wrote:

On 12/27/2016 01:58 AM, Amit Kapila wrote:

After recent commit's 7819ba1e and 25216c98, this patch requires a
rebase. Attached is the rebased patch.

This needs a rebase after commit e898437.

Attached find the rebased patch.

Well, I've managed to break this again by committing more things.

I have again rebased it, this time it requires few more changes due to
the previous commits. As we have to store hashm_maxbucket number in
prevblock number of primary bucket page, we need to take of same
during replay wherever applicable. Also, there were few new tests in
contrib modules which were failing. Attached patch is still a rebased
version of the previous patch, I will work on splitting it as per your
suggestion below, but I thought it is better to post the rebased patch
as well.

I think it would be helpful if you broke this into a series of
preliminary refactoring patches and then a final patch that actually
adds WAL-logging.

Okay, that makes sense to me as well.

The things that look like preliminary refactoring to
me are:

- Adding _hash_pgaddmultitup and using it in various places.
- Adding and freeing overflow pages has been extensively reworked.

Freeing the overflow page is too tightly coupled with changes related
to _hash_pgaddmultitup, so it might be better to keep it along with
it. However, I think we can prepare a separate patch for changes
related to adding the overflow page.

- Similarly, there is some refactoring of how bitmap pages get initialized.
- Index initialization has been rejiggered significantly.
- Bucket splits have been rejiggered.

Will try to prepare separate patches for the above three.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal_hash_index_v9.patchapplication/octet-stream; name=wal_hash_index_v9.patchDownload

diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 8ed60bc..3ba01f6 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -1,7 +1,6 @@
 CREATE TABLE test_hash (a int, b text);
 INSERT INTO test_hash VALUES (1, 'one');
 CREATE INDEX test_hash_a_idx ON test_hash USING hash (a);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 \x
 SELECT hash_page_type(get_raw_page('test_hash_a_idx', 0));
 -[ RECORD 1 ]--+---------
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 169d193..7d7ade8 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -131,7 +131,6 @@ select * from pgstatginindex('test_ginidx');
 (1 row)
 
 create index test_hashidx on test using hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 select * from pgstathashindex('test_hashidx');
  version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | live_items | dead_items | free_percent 
 ---------+--------------+----------------+--------------+------------+------------+------------+--------------
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 5f009ee..5a7c730 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1537,19 +1537,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index dc63d7d..a9d926d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2152,10 +2152,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 48de2ce..0e61991 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2353,12 +2353,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 271c135..e40750e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index fcb7a60..7163b03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index e2e7e91..b154569 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
-       hashsort.o hashutil.o hashvalidate.o
+       hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 703ae98..1c1f0db 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -287,13 +287,17 @@ The insertion algorithm is rather similar:
 	if current page is full, release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
+	take buffer content lock in exclusive mode on metapage
 	insert tuple at appropriate place in page
-	mark current page dirty and release buffer content lock and pin
-	if the current page is not a bucket page, release the pin on bucket page
-	pin meta page and take buffer content lock in exclusive mode
+	mark current page dirty
 	increment tuple count, decide if split needed
-	mark meta page dirty and release buffer content lock and pin
-	done if no split needed, else enter Split algorithm below
+	mark meta page dirty
+	write WAL for insertion of tuple
+	release the buffer content lock on metapage
+	release buffer content lock on current page
+	if current page is not a bucket page, release the pin on bucket page
+	if split is needed, enter Split algorithm below
+	release the pin on metapage
 
 To speed searches, the index entries within any individual index page are
 kept sorted by hash code; the insertion code must take care to insert new
@@ -328,12 +332,17 @@ existing bucket in two, thereby lowering the fill ratio:
        try to finish the split and the cleanup work
        if that succeeds, start over; if it fails, give up
 	mark the old and new buckets indicating split is in progress
+	mark both old and new buckets as dirty
+	write WAL for allocation of new page for split
 	copy the tuples that belongs to new bucket from old bucket, marking
      them as moved-by-split
+	write WAL record for moving tuples to new page once the new page is full
+	or all the pages of old bucket are finished
 	release lock but not pin for primary bucket page of old bucket,
 	 read/shared-lock next page; repeat as needed
 	clear the bucket-being-split and bucket-being-populated flags
 	mark the old bucket indicating split-cleanup
+	write WAL for changing the flags on both old and new buckets
 
 The split operation's attempt to acquire cleanup-lock on the old bucket number
 could fail if another process holds any lock or pin on it.  We do not want to
@@ -369,6 +378,8 @@ The fourth operation is garbage collection (bulk deletion):
 		acquire cleanup lock on primary bucket page
 		loop:
 			scan and remove tuples
+			mark the target page dirty
+			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
 			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
@@ -383,7 +394,8 @@ The fourth operation is garbage collection (bulk deletion):
 	check if number of buckets changed
 	if so, release content lock and pin and return to for-each-bucket loop
 	else update metapage tuple count
-	mark meta page dirty and release buffer content lock and pin
+	 mark meta page dirty and write WAL for update of metapage
+	 release buffer content lock and pin
 
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
@@ -425,18 +437,16 @@ Obtaining an overflow page:
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		mark bitmap page dirty and release content lock
+		mark bitmap page dirty
 		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
 			update it and mark meta page dirty
-		release meta page buffer content lock
-		return page number
 	else (not found):
 	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -456,12 +466,15 @@ like this:
 
 	-- having determined that no space is free in the target bucket:
 	remember last page of bucket, drop write lock on it
-	call free-page-acquire routine
 	re-write-lock last page of bucket
 	if it is not last anymore, step to the last page
-	update (former) last page to point to new page
+	execute free-page-acquire (Obtaining an overflow page) mechanism described above
+	update (former) last page to point to the new page and mark the  buffer dirty.
 	write-lock and initialize new page, with back link to former last page
-	write and release former last page
+	write WAL for addition of overflow page
+	release the locks on meta page and bitmap page acquired in free-page-acquire algorithm
+	release the lock on former last page
+	release the lock on new overflow page
 	insert tuple into new page
 	-- etc.
 
@@ -488,12 +501,14 @@ accessors of pages in the bucket.  The algorithm is:
 	determine which bitmap page contains the free space bit for page
 	release meta page buffer content lock
 	pin bitmap page and take buffer content lock in exclusive mode
-	update bitmap bit
-	mark bitmap page dirty and release buffer content lock and pin
-	if page number is less than what we saw as first-free-bit in meta:
 	retake meta page buffer content lock in exclusive mode
+	move (insert) tuples that belong to the overflow page being freed
+	update bitmap bit
+	mark bitmap page dirty
 	if page number is still less than first-free-bit,
 		update first-free-bit field and mark meta page dirty
+	write WAL for delinking overflow page operation
+	release buffer content lock and pin
 	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
@@ -504,8 +519,101 @@ page acquirer will scan more bitmap bits than he needs to.  What must be
 avoided is having first-free-bit greater than the actual first free bit,
 because then that free page would never be found by searchers.
 
-All the freespace operations should be called while holding no buffer
-locks.  Since they need no lmgr locks, deadlock is not possible.
+The reason of moving tuples from overflow page while delinking the later is
+to make that as an atomic operation.  Not doing so could lead to spurious reads
+on standby.  Basically, the user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which get
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolled back.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only the fixed number of pages XLR_MAX_BLOCK_ID (32)
+with current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform the split operation if the number of tuples are more than what can be
+accomodated in the initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of the old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with bucket-being-split and bucket-being-populated flags respectively
+which indicates that split is in progress for those buckets.  The reader
+algorithm works correctly, as it will scan both the old and new buckets when
+the split is in progress as explained in the reader algorithm section above.
+
+We finish the split at next insert or split operation on the old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vacuum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeeze the bucket completely.
 
 
 Other Notes
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index bca77a8..bb9a363 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -28,6 +28,7 @@
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -119,7 +120,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -181,7 +182,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
@@ -302,6 +303,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -622,6 +628,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -648,6 +655,26 @@ loop_top:
 	}
 
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
@@ -815,9 +842,40 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -865,8 +923,25 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
 		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
@@ -880,9 +955,3 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..e527d71
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,972 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, InvalidBlockNumber, *num_bucket, LH_OVERFLOW_PAGE,
+				  true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
+				  xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay split cleanup flag operation for primary bucket page.
+ */
+static void
+hash_xlog_split_cleanup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			hash_xlog_split_cleanup(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index dc63063..241728f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -40,6 +42,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -158,25 +161,20 @@ restart_insert:
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * dirty and release the modified page.  if the page we modified was an
-	 * overflow page, we also need to separately drop the pin we retained on
-	 * the primary bucket page.
-	 */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap = HashPageGetMeta(metapage);
 	metap->hashm_ntuples += 1;
 
@@ -184,10 +182,43 @@ restart_insert:
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
+
 	/* Attempt to split if a split is needed */
 	if (do_expand)
 		_hash_expandtable(rel, metabuf);
@@ -228,3 +259,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 3334089..69f3882 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,10 +18,11 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -95,7 +96,9 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -113,13 +116,37 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
+	HashMetaPage metap;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
+	BlockNumber blkno;
+	uint32		orig_firstfree;
+	uint32		splitnum;
+	uint32	   *freep = NULL;
+	uint32		max_ovflpg;
+	uint32		bit;
+	uint32		bitmap_page_bit;
+	uint32		first_page;
+	uint32		last_bit;
+	uint32		last_page;
+	uint32		i,
+				j;
+	bool		page_found = false;
 
 	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
+	 * Write-lock the tail page.  Here, we need to maintain locking order such
+	 * that, first acquire the lock on tail page of bucket, then on meta page
+	 * to find and lock the bitmap page and if it is found, then lock on meta
+	 * page is released, then finally acquire the lock on new overflow buffer.
+	 * We need this locking order to avoid deadlock with backends that are
+	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -153,60 +180,6 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
 
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	MarkBufferDirty(buf);
-	if (retain_pin)
-	{
-		/* pin will be retained only for the primary bucket page */
-		Assert(pageopaque->hasho_flag & LH_BUCKET_PAGE);
-		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-	}
-	else
-		_hash_relbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
-	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
-	BlockNumber blkno;
-	uint32		orig_firstfree;
-	uint32		splitnum;
-	uint32	   *freep = NULL;
-	uint32		max_ovflpg;
-	uint32		bit;
-	uint32		first_page;
-	uint32		last_bit;
-	uint32		last_page;
-	uint32		i,
-				j;
-
 	/* Get exclusive lock on the meta page */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -255,11 +228,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -283,8 +276,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -295,7 +295,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	}
 
 	/* Calculate address of the new overflow page */
-	bit = metap->hashm_spares[splitnum];
+	bit = BufferIsValid(newmapbuf) ?
+		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -303,41 +304,51 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * relation length stays in sync with ours.  XXX It's annoying to do this
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
+	 *
+	 * It is okay to hold two buffer locks here (one on tail page of bucket
+	 * and other on new overflow page) since there cannot be anyone else
+	 * contending for access to ovflbuf.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case the new page is added.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
-
-	/* Write updated metapage and release lock, but not pin */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+	START_CRIT_SECTION();
 
-	return newbuf;
-
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	MarkBufferDirty(mapbuf);
-	_hash_relbuf(rel, mapbuf);
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
 
-	/* Reacquire exclusive lock on the meta page */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -346,19 +357,101 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
 		MarkBufferDirty(metabuf);
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 	}
-	else
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
 	{
-		/* We didn't change the metapage, so no need to write */
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
 	}
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	END_CRIT_SECTION();
+
+	if (retain_pin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	else
+		_hash_relbuf(rel, buf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+	LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);
+
+	return ovflbuf;
 }
 
 /*
@@ -391,6 +484,12 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -403,13 +502,14 @@ _hash_firstfreebit(uint32 map)
  *	has a lock on same.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
-	Buffer		prevbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
@@ -423,6 +523,9 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -435,15 +538,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	bucket = ovflopaque->hasho_bucket;
 
 	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	MarkBufferDirty(ovflbuf);
-	_hash_relbuf(rel, ovflbuf);
-
-	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
@@ -451,8 +545,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Page		prevpage;
-		HashPageOpaque prevopaque;
 
 		if (prevblkno == writeblkno)
 			prevbuf = wbuf;
@@ -462,32 +554,13 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 												 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
 												 bstrategy);
-
-		prevpage = BufferGetPage(prevbuf);
-		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-
-		MarkBufferDirty(prevbuf);
-		if (prevblkno != writeblkno)
-			_hash_relbuf(rel, prevbuf);
 	}
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		MarkBufferDirty(nextbuf);
-		_hash_relbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -508,62 +581,184 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	/* Release metapage lock while we access the bitmap page */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
-	/* Clear the bitmap bit to indicate that this overflow page is free */
+	/* read the bitmap page to clear the bitmap bit */
 	mapbuf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BITMAP_PAGE);
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	MarkBufferDirty(mapbuf);
-	_hash_relbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed the
+	 * page as WAL replay routines expect pages to be initialized. See
+	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	/* Clear the bitmap bit to indicate that this overflow page is free */
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
+		update_metap = true;
 		MarkBufferDirty(metabuf);
 	}
-	_hash_relbuf(rel, metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* release previous bucket if it is not same as write bucket */
+	if (BufferIsValid(prevbuf) && prevblkno != writeblkno)
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
 
 	return nextblkno;
 }
 
 
 /*
- *	_hash_initbitmap()
+ *	_hash_initbitmapbuffer()
  *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
- *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
  */
 void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 {
-	Buffer		buf;
 	Page		pg;
 	HashPageOpaque op;
 	uint32	   *freep;
 
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
 	pg = BufferGetPage(buf);
 
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
 	/* initialize the page's special space */
 	op = (HashPageOpaque) PageGetSpecialPointer(pg);
 	op->hasho_prevblkno = InvalidBlockNumber;
@@ -574,23 +769,14 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 	/* set all of the bits to 1 */
 	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* dirty the new bitmap page, and release write lock and pin */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
+	MemSet(freep, 0xFF, bmsize);
 
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
+	/*
+	 * Set pd_lower just past the end of the bitmap page data.  We could even
+	 * set pd_lower equal to pd_upper, but this is more precise and makes the
+	 * page look compressible to xlog.c.
+	 */
+	((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
 }
 
 
@@ -640,7 +826,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
 
 	/*
 	 * start squeezing into the primary bucket page.
@@ -686,15 +871,23 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		int			i;
 		bool		retain_pin = false;
 
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
+
+readpage:
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
 		for (roffnum = FirstOffsetNumber;
@@ -715,11 +908,13 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMulTups(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
 				Buffer		next_wbuf = InvalidBuffer;
+				bool		tups_moved = false;
 
 				Assert(!PageIsEmpty(wpage));
 
@@ -736,50 +931,130 @@ _hash_squeezebucket(Relation rel,
 														   HASH_WRITE,
 														   LH_OVERFLOW_PAGE,
 														   bstrategy);
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+					MarkBufferDirty(wbuf);
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
+					tups_moved = true;
+				}
 
 				/*
 				 * release the lock on previous page after acquiring the lock
 				 * on next page
 				 */
-				if (wbuf_dirty)
-					MarkBufferDirty(wbuf);
 				if (retain_pin)
 					LockBuffer(wbuf, BUFFER_LOCK_UNLOCK);
 				else
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_SQUEEZE_PAGE on standby users will be able to
+				 * view the results of partial deletion on rblkno.
+				 */
+				LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						MarkBufferDirty(rbuf);
-					}
-					_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				LockBuffer(rbuf, BUFFER_LOCK_EXCLUSIVE);
+
 				wbuf = next_wbuf;
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
 				retain_pin = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
 
+				/*
+				 * after moving the tuples, rpage would have been compacted,
+				 * so we need to rescan it.
+				 */
+				if (tups_moved)
+					goto readpage;
+			}
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -797,10 +1072,14 @@ _hash_squeezebucket(Relation rel,
 		Assert(BlockNumberIsValid(rblkno));
 
 		/* free this overflow page (releases rbuf) */
-		_hash_freeovflpage(rel, rbuf, wbuf, bstrategy);
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
 
-		if (wbuf_dirty)
-			MarkBufferDirty(wbuf);
+		pfree(itup_offsets);
 
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 9485978..300a15d 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -40,12 +41,11 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -160,6 +160,36 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket, uint32 flag,
+			  bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Set hasho_prevblkno with current hashm_maxbucket. This value will
+	 * be used to validate cached HashMetaPageData. See
+	 * _hash_getbucketbuf_from_hashkey().
+	 */
+	pageopaque->hasho_prevblkno = max_bucket;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -291,7 +321,7 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -303,19 +333,18 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -337,6 +366,154 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -356,30 +533,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -396,7 +568,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -415,53 +587,11 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_firstfree = 0;
 
 	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
-	 */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-	/*
-	 * Initialize the first N buckets
-	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-
-		/*
-		 * Set hasho_prevblkno with current hashm_maxbucket. This value will
-		 * be used to validate cached HashMetaPageData. See
-		 * _hash_getbucketbuf_from_hashkey().
-		 */
-		pageopaque->hasho_prevblkno = metap->hashm_maxbucket;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		MarkBufferDirty(buf);
-		_hash_relbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Initialize first bitmap page
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
 	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	MarkBufferDirty(metabuf);
-	_hash_relbuf(rel, metabuf);
-
-	return num_buckets;
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -470,7 +600,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
@@ -498,10 +627,14 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -537,7 +670,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
@@ -661,7 +794,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -686,18 +823,17 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -705,6 +841,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -717,10 +854,10 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -733,16 +870,101 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
-	/* Write out the metapage and drop lock, but keep pin */
-	MarkBufferDirty(metabuf);
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress.  (At
+	 * operation end, we will clear the split-in-progress flag.)  Also,
+	 * for a primary bucket page, hasho_prevblkno stores the number of
+	 * buckets that existed as of the last split, so we must update that
+	 * value here.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
+	oopaque->hasho_prevblkno = maxbucket;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = maxbucket;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(buf_nblkno, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buf_nblkno, BUFFER_LOCK_EXCLUSIVE);
+
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -782,6 +1004,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
 	char		zerobuf[BLCKSZ];
+	Page		page;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -792,7 +1015,21 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	MemSet(zerobuf, 0, sizeof(zerobuf));
+	page = (Page) zerobuf;
+
+	/*
+	 * Initialise the new bucket page, here we can't complete zeroed the page
+	 * as WAL replay routines expect pages to be initialized. See explanation
+	 * of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(page, BLCKSZ);
+
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
 
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
@@ -800,14 +1037,19 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	return true;
 }
 
-
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, the caller needs to fill htab.  If htab is set,
+ * then we skip the movement of tuples that exists in htab, otherwise NULL
+ * value of htab indicates movement of all the tuples that belong to the new
+ * bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -833,73 +1075,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress.  (At
-	 * operation end, we will clear the split-in-progress flag.)  Also,
-	 * for a primary bucket page, hasho_prevblkno stores the number of
-	 * buckets that existed as of the last split, so we must update that
-	 * value here.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
-	oopaque->hasho_prevblkno = maxbucket;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = maxbucket;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * to finish incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -984,11 +1164,14 @@ _hash_splitbucket_guts(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
-					/* write out nbuf and drop lock, but keep pin */
-					MarkBufferDirty(nbuf);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
+					/* drop lock, but keep pin */
 					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
@@ -996,6 +1179,13 @@ _hash_splitbucket_guts(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -1004,6 +1194,8 @@ _hash_splitbucket_guts(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1026,7 +1218,16 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
+			if (nbuf == bucket_nbuf)
+				LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1042,17 +1243,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nbuf == bucket_nbuf)
-	{
-		MarkBufferDirty(bucket_nbuf);
-		LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
-	}
-	else
-	{
-		MarkBufferDirty(nbuf);
-		_hash_relbuf(rel, nbuf);
-	}
-
 	LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1061,6 +1251,8 @@ _hash_splitbucket_guts(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
 
@@ -1077,6 +1269,29 @@ _hash_splitbucket_guts(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1193,9 +1408,9 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nbucket = npageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	_hash_relbuf(rel, bucket_nbuf);
 	LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
@@ -1203,6 +1418,38 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 }
 
 /*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
  *	_hash_getcachedmetap() -- Returns cached metapage data.
  *
  *	If metabuf is not InvalidBuffer, caller must hold a pin, but no lock, on
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9e5d7e4..d733770 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -123,6 +123,7 @@ _hash_readnext(IndexScanDesc scan,
 	if (block_found)
 	{
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 	}
 }
@@ -168,6 +169,7 @@ _hash_readprev(IndexScanDesc scan,
 		*bufp = _hash_getbuf(rel, blkno, HASH_READ,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 
 		/*
@@ -283,6 +285,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 
 	buf = _hash_getbucketbuf_from_hashkey(rel, hashkey, HASH_READ, NULL);
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	bucket = opaque->hasho_bucket;
 
@@ -318,6 +321,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+		TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf));
 
 		/*
 		 * remember the split bucket buffer so as to use it later for
@@ -520,6 +524,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 7eac819..5e3f7d8 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,143 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "split_complete_old_bucket %c, split_complete_new_bucket %c",
+				(xlrec->old_bucket_flag & LH_BUCKET_BEING_SPLIT) ? 'F' : 'T',
+								 (xlrec->new_bucket_flag & LH_BUCKET_BEING_POPULATED) ? 'F' : 'T');
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			id = "SPLIT_CLEANUP";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 265e9b3..a66bc09 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -505,11 +505,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6fc5fa4..0e0605a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -598,6 +598,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMulTups
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMulTups(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9001e20..ce55fc5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5880,13 +5880,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5908,8 +5905,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3bf587b..bfdfed8 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -303,13 +303,15 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
-				   BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
@@ -327,6 +329,8 @@ extern Buffer _hash_getbucketbuf_from_hashkey(Relation rel, uint32 hashkey,
 								int access,
 								HashMetaPage *cachedmetap);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket,
+				uint32 flag, bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -335,8 +339,10 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index ed3c37f..c53f878 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -17,6 +17,245 @@
 #include "access/hash.h"
 #include "access/xlogreader.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_SPLIT_CLEANUP 0xA0	/* clear split-cleanup flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * Data to regenerate the meta-data page
+ */
+typedef struct xl_hash_metadata
+{
+	HashMetaPageData metadata;
+}	xl_hash_metadata;
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (xl_hash_metadata)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 294f9cb..dfbc55e 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -425,6 +425,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMulTups(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index e519fdb..26cd059 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index f8b9f02..0a18efa 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -201,7 +201,6 @@ SELECT h.seqno AS f20000
 --
 CREATE TABLE hash_split_heap (keycol INT);
 CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
 VACUUM FULL hash_split_heap;
 -- Let's do a backward scan.
@@ -230,5 +229,4 @@ DROP TABLE hash_temp_heap CASCADE;
 CREATE TABLE hash_heap_float4 (x float4, y int);
 INSERT INTO hash_heap_float4 VALUES (1.1,1);
 CREATE INDEX hash_idx ON hash_heap_float4 USING hash (x);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 DROP TABLE hash_heap_float4 CASCADE;
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index fa63235..67c34a9 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 423f277..db66dc7 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail

#73

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#72)

6 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Fri, Feb 10, 2017 at 12:15 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 9, 2017 at 10:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

The things that look like preliminary refactoring to
me are:

- Adding _hash_pgaddmultitup and using it in various places.
- Adding and freeing overflow pages has been extensively reworked.

Freeing the overflow page is too tightly coupled with changes related
to _hash_pgaddmultitup, so it might be better to keep it along with
it. However, I think we can prepare a separate patch for changes
related to adding the overflow page.

- Similarly, there is some refactoring of how bitmap pages get initialized.
- Index initialization has been rejiggered significantly.
- Bucket splits have been rejiggered.

As discussed, attached are refactoring patches and a patch to enable
WAL for the hash index on top of them.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Expose-a-new-API-_hash_pgaddmultitup-to-add-multiple.patchapplication/octet-stream; name=0001-Expose-a-new-API-_hash_pgaddmultitup-to-add-multiple.patchDownload

From 3fc46c7b27ca66371e6b365c90f6a5bb2833d934 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Mon, 13 Feb 2017 16:17:30 +0530
Subject: [PATCH 1/6] Expose a new API _hash_pgaddmultitup to add multiple
 tuples on a page at one shot.  This helps in making the free overflow
 operation atomic.

---
 src/backend/access/hash/hashinsert.c |  41 +++++++
 src/backend/access/hash/hashovfl.c   | 224 ++++++++++++++++++++++++-----------
 src/backend/access/hash/hashpage.c   |   1 -
 src/backend/storage/page/bufpage.c   |  27 +++++
 src/include/access/hash.h            |   7 +-
 src/include/storage/bufpage.h        |   1 +
 6 files changed, 230 insertions(+), 71 deletions(-)

diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index dc63063..354e733 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -228,3 +228,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 3334089..8b59cad 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,6 +18,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -391,6 +392,12 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for this operation.  Addition of tuples and removal of overflow
+ *	page has to done as an atomic operation, otherwise during replay on standby
+ *	users might find duplicate records.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -403,13 +410,16 @@ _hash_firstfreebit(uint32 map)
  *	has a lock on same.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
 	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
@@ -435,15 +445,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	bucket = ovflopaque->hasho_bucket;
 
 	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	MarkBufferDirty(ovflbuf);
-	_hash_relbuf(rel, ovflbuf);
-
-	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
@@ -451,8 +452,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Page		prevpage;
-		HashPageOpaque prevopaque;
 
 		if (prevblkno == writeblkno)
 			prevbuf = wbuf;
@@ -462,32 +461,13 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 												 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
 												 bstrategy);
-
-		prevpage = BufferGetPage(prevbuf);
-		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-
-		MarkBufferDirty(prevbuf);
-		if (prevblkno != writeblkno)
-			_hash_relbuf(rel, prevbuf);
 	}
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		MarkBufferDirty(nextbuf);
-		_hash_relbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -508,25 +488,83 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	/* Release metapage lock while we access the bitmap page */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
-	/* Clear the bitmap bit to indicate that this overflow page is free */
+	/* read the bitmap page to clear the bitmap bit */
 	mapbuf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BITMAP_PAGE);
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	MarkBufferDirty(mapbuf);
-	_hash_relbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	START_CRIT_SECTION();
+
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed the
+	 * page as WAL replay routines expect pages to be initialized. See
+	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	/* Clear the bitmap bit to indicate that this overflow page is free */
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
 		MarkBufferDirty(metabuf);
 	}
-	_hash_relbuf(rel, metabuf);
+
+	END_CRIT_SECTION();
+
+	/* release previous bucket if it is not same as write bucket */
+	if (BufferIsValid(prevbuf) && prevblkno != writeblkno)
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
 
 	return nextblkno;
 }
@@ -640,7 +678,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
 
 	/*
 	 * start squeezing into the primary bucket page.
@@ -686,15 +723,23 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		int			i;
 		bool		retain_pin = false;
 
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
+
+readpage:
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
 		for (roffnum = FirstOffsetNumber;
@@ -715,11 +760,13 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMulTups(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
 				Buffer		next_wbuf = InvalidBuffer;
+				bool		tups_moved = false;
 
 				Assert(!PageIsEmpty(wpage));
 
@@ -737,49 +784,86 @@ _hash_squeezebucket(Relation rel,
 														   LH_OVERFLOW_PAGE,
 														   bstrategy);
 
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					START_CRIT_SECTION();
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+					MarkBufferDirty(wbuf);
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					END_CRIT_SECTION();
+
+					tups_moved = true;
+				}
+
 				/*
 				 * release the lock on previous page after acquiring the lock
 				 * on next page
 				 */
-				if (wbuf_dirty)
-					MarkBufferDirty(wbuf);
 				if (retain_pin)
 					LockBuffer(wbuf, BUFFER_LOCK_UNLOCK);
 				else
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.
+				 */
+				LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						MarkBufferDirty(rbuf);
-					}
-					_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				LockBuffer(rbuf, BUFFER_LOCK_EXCLUSIVE);
+
 				wbuf = next_wbuf;
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
 				retain_pin = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
 
+				/*
+				 * after moving the tuples, rpage would have been compacted,
+				 * so we need to rescan it.
+				 */
+				if (tups_moved)
+					goto readpage;
+			}
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -797,10 +881,14 @@ _hash_squeezebucket(Relation rel,
 		Assert(BlockNumberIsValid(rblkno));
 
 		/* free this overflow page (releases rbuf) */
-		_hash_freeovflpage(rel, rbuf, wbuf, bstrategy);
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
 
-		if (wbuf_dirty)
-			MarkBufferDirty(wbuf);
+		pfree(itup_offsets);
 
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 9485978..00f3ea8 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -470,7 +470,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6fc5fa4..0e0605a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -598,6 +598,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMulTups
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMulTups(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3bf587b..5767deb 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -303,11 +303,14 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
-				   BufferAccessStrategy bstrategy);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 294f9cb..dfbc55e 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -425,6 +425,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMulTups(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
-- 
1.8.4.msysgit.0

0002-Expose-hashinitbitmapbuffer-to-initialize-bitmap-pag.patchapplication/octet-stream; name=0002-Expose-hashinitbitmapbuffer-to-initialize-bitmap-pag.patchDownload

From 5b4869f657fdba8443e1870ead8b563ba71e4b43 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Mon, 13 Feb 2017 16:18:18 +0530
Subject: [PATCH 2/6] Expose hashinitbitmapbuffer to initialize bitmap page
 buffer.  This will be used by future patches.

---
 src/backend/access/hash/hashovfl.c | 40 ++++++++++++++++++++++++++++++++++++++
 src/include/access/hash.h          |  1 +
 2 files changed, 41 insertions(+)

diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 8b59cad..b3cc7b9 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -633,6 +633,46 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 
 /*
+ *	_hash_initbitmapbuffer()
+ *
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
+ */
+void
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
+{
+	Page		pg;
+	HashPageOpaque op;
+	uint32	   *freep;
+
+	pg = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
+	/* initialize the page's special space */
+	op = (HashPageOpaque) PageGetSpecialPointer(pg);
+	op->hasho_prevblkno = InvalidBlockNumber;
+	op->hasho_nextblkno = InvalidBlockNumber;
+	op->hasho_bucket = -1;
+	op->hasho_flag = LH_BITMAP_PAGE;
+	op->hasho_page_id = HASHO_PAGE_ID;
+
+	/* set all of the bits to 1 */
+	freep = HashPageGetBitmap(pg);
+	MemSet(freep, 0xFF, bmsize);
+
+	/*
+	 * Set pd_lower just past the end of the bitmap page data.  We could even
+	 * set pd_lower equal to pd_upper, but this is more precise and makes the
+	 * page look compressible to xlog.c.
+	 */
+	((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
+}
+
+
+/*
  *	_hash_squeezebucket(rel, bucket)
  *
  *	Try to squeeze the tuples onto pages occurring earlier in the
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 5767deb..9c0b79f 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -313,6 +313,7 @@ extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovf
 			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
-- 
1.8.4.msysgit.0

0003-Restructure-_hash_addovflpage-so-that-the-operation-.patchapplication/octet-stream; name=0003-Restructure-_hash_addovflpage-so-that-the-operation-.patchDownload

From 564259f35a3d30df8cc1848b5afe06448d028494 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Mon, 13 Feb 2017 16:19:06 +0530
Subject: [PATCH 3/6] Restructure _hash_addovflpage so that the operation can
 be performed atomically and can be WAL-logged.

---
 src/backend/access/hash/hashovfl.c | 232 +++++++++++++++++++++----------------
 src/backend/access/hash/hashpage.c |   2 +-
 2 files changed, 135 insertions(+), 99 deletions(-)

diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index b3cc7b9..051410b 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -22,7 +22,6 @@
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -96,7 +95,9 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -114,13 +115,37 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
+	HashMetaPage metap;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
+	BlockNumber blkno;
+	uint32		orig_firstfree;
+	uint32		splitnum;
+	uint32	   *freep = NULL;
+	uint32		max_ovflpg;
+	uint32		bit;
+	uint32		bitmap_page_bit;
+	uint32		first_page;
+	uint32		last_bit;
+	uint32		last_page;
+	uint32		i,
+				j;
+	bool		page_found = false;
 
 	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
+	 * Write-lock the tail page.  Here, we need to maintain locking order such
+	 * that, first acquire the lock on tail page of bucket, then on meta page
+	 * to find and lock the bitmap page and if it is found, then lock on meta
+	 * page is released, then finally acquire the lock on new overflow buffer.
+	 * We need this locking order to avoid deadlock with backends that are
+	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -154,60 +179,6 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
 
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	MarkBufferDirty(buf);
-	if (retain_pin)
-	{
-		/* pin will be retained only for the primary bucket page */
-		Assert(pageopaque->hasho_flag & LH_BUCKET_PAGE);
-		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-	}
-	else
-		_hash_relbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
-	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
-	BlockNumber blkno;
-	uint32		orig_firstfree;
-	uint32		splitnum;
-	uint32	   *freep = NULL;
-	uint32		max_ovflpg;
-	uint32		bit;
-	uint32		first_page;
-	uint32		last_bit;
-	uint32		last_page;
-	uint32		i,
-				j;
-
 	/* Get exclusive lock on the meta page */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -256,11 +227,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -284,8 +275,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -296,7 +294,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	}
 
 	/* Calculate address of the new overflow page */
-	bit = metap->hashm_spares[splitnum];
+	bit = BufferIsValid(newmapbuf) ?
+		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -304,41 +303,49 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * relation length stays in sync with ours.  XXX It's annoying to do this
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
+	 *
+	 * It is okay to hold two buffer locks here (one on tail page of bucket
+	 * and other on new overflow page) since there cannot be anyone else
+	 * contending for access to ovflbuf.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Do the update.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
-
-	/* Write updated metapage and release lock, but not pin */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-	return newbuf;
+	START_CRIT_SECTION();
 
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	MarkBufferDirty(mapbuf);
-	_hash_relbuf(rel, mapbuf);
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
 
-	/* Reacquire exclusive lock on the meta page */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -347,19 +354,48 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
 		MarkBufferDirty(metabuf);
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 	}
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	END_CRIT_SECTION();
+
+	if (retain_pin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
-	{
-		/* We didn't change the metapage, so no need to write */
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-	}
+		_hash_relbuf(rel, buf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+	LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);
+
+	return ovflbuf;
 }
 
 /*
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 00f3ea8..ea72892 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -983,7 +983,7 @@ _hash_splitbucket_guts(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
 					/* write out nbuf and drop lock, but keep pin */
 					MarkBufferDirty(nbuf);
-- 
1.8.4.msysgit.0

0004-restructure-split-bucket-code-so-that-the-operation-.patchapplication/octet-stream; name=0004-restructure-split-bucket-code-so-that-the-operation-.patchDownload

From aaa599f695aa18043af474b2679c4afc476f4ccd Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Mon, 13 Feb 2017 16:20:22 +0530
Subject: [PATCH 4/6] restructure split bucket code so that the operation can
 be performed atomically and can be WAL-logged.

---
 src/backend/access/hash/hashpage.c | 186 ++++++++++++++++++-------------------
 1 file changed, 89 insertions(+), 97 deletions(-)

diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index ea72892..14b685e 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -40,12 +40,9 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -497,7 +494,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
@@ -685,18 +684,17 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -718,8 +716,7 @@ restart_expand:
 		metap->hashm_ovflpoint = spare_ndx;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -732,16 +729,58 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
-	/* Write out the metapage and drop lock, but keep pin */
-	MarkBufferDirty(metabuf);
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress.  (At
+	 * operation end, we will clear the split-in-progress flag.)  Also,
+	 * for a primary bucket page, hasho_prevblkno stores the number of
+	 * buckets that existed as of the last split, so we must update that
+	 * value here.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
+	oopaque->hasho_prevblkno = maxbucket;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = maxbucket;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	END_CRIT_SECTION();
+
+	/* drop lock, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(buf_nblkno, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buf_nblkno, BUFFER_LOCK_EXCLUSIVE);
+
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -803,10 +842,16 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, the caller needs to fill htab.  If htab is set,
+ * then we skip the movement of tuples that exists in htab, otherwise NULL
+ * value of htab indicates movement of all the tuples that belong to the new
+ * bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -832,73 +877,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress.  (At
-	 * operation end, we will clear the split-in-progress flag.)  Also,
-	 * for a primary bucket page, hasho_prevblkno stores the number of
-	 * buckets that existed as of the last split, so we must update that
-	 * value here.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
-	oopaque->hasho_prevblkno = maxbucket;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = maxbucket;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * to finish incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -985,9 +968,9 @@ _hash_splitbucket_guts(Relation rel,
 
 				while (PageGetFreeSpace(npage) < itemsz)
 				{
-					/* write out nbuf and drop lock, but keep pin */
-					MarkBufferDirty(nbuf);
+					/* drop lock, but keep pin */
 					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
@@ -995,6 +978,13 @@ _hash_splitbucket_guts(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -1003,6 +993,8 @@ _hash_splitbucket_guts(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1025,7 +1017,13 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			if (nbuf == bucket_nbuf)
+				LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1041,17 +1039,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nbuf == bucket_nbuf)
-	{
-		MarkBufferDirty(bucket_nbuf);
-		LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
-	}
-	else
-	{
-		MarkBufferDirty(nbuf);
-		_hash_relbuf(rel, nbuf);
-	}
-
 	LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1060,6 +1047,8 @@ _hash_splitbucket_guts(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
 
@@ -1076,8 +1065,11 @@ _hash_splitbucket_guts(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	END_CRIT_SECTION();
 }
 
+
 /*
  *	_hash_finish_split() -- Finish the previously interrupted split operation
  *
@@ -1192,9 +1184,9 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nbucket = npageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	_hash_relbuf(rel, bucket_nbuf);
 	LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
-- 
1.8.4.msysgit.0

0005-Restructure-hash-index-creation-and-metapage-initial.patchapplication/octet-stream; name=0005-Restructure-hash-index-creation-and-metapage-initial.patchDownload

From 94de0641a7776bcc9a7ed9fa2c25b2812a43402d Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Mon, 13 Feb 2017 16:38:12 +0530
Subject: [PATCH 5/6] Restructure hash index creation and metapage
 initialization code so that the operation can be performed atomically and can
 be WAL-logged.

---
 src/backend/access/hash/hash.c     |   4 +-
 src/backend/access/hash/hashovfl.c |  62 -----------
 src/backend/access/hash/hashpage.c | 203 +++++++++++++++++++++++++------------
 src/include/access/hash.h          |  10 +-
 4 files changed, 146 insertions(+), 133 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index bca77a8..da4e6a2 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -119,7 +119,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -181,7 +181,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 051410b..084ad92 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -607,68 +607,6 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 
 
 /*
- *	_hash_initbitmap()
- *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
- *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
- */
-void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
-{
-	Buffer		buf;
-	Page		pg;
-	HashPageOpaque op;
-	uint32	   *freep;
-
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
-	pg = BufferGetPage(buf);
-
-	/* initialize the page's special space */
-	op = (HashPageOpaque) PageGetSpecialPointer(pg);
-	op->hasho_prevblkno = InvalidBlockNumber;
-	op->hasho_nextblkno = InvalidBlockNumber;
-	op->hasho_bucket = -1;
-	op->hasho_flag = LH_BITMAP_PAGE;
-	op->hasho_page_id = HASHO_PAGE_ID;
-
-	/* set all of the bits to 1 */
-	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* dirty the new bitmap page, and release write lock and pin */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
-
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
-}
-
-
-/*
  *	_hash_initbitmapbuffer()
  *
  *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 14b685e..8523409 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -157,6 +157,36 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket, uint32 flag,
+			  bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Set hasho_prevblkno with current hashm_maxbucket. This value will
+	 * be used to validate cached HashMetaPageData. See
+	 * _hash_getbucketbuf_from_hashkey().
+	 */
+	pageopaque->hasho_prevblkno = max_bucket;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -288,7 +318,7 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -300,19 +330,18 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -334,6 +363,97 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -353,30 +473,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -393,7 +508,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -412,53 +527,11 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_firstfree = 0;
 
 	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
-	 */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-	/*
-	 * Initialize the first N buckets
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
 	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-
-		/*
-		 * Set hasho_prevblkno with current hashm_maxbucket. This value will
-		 * be used to validate cached HashMetaPageData. See
-		 * _hash_getbucketbuf_from_hashkey().
-		 */
-		pageopaque->hasho_prevblkno = metap->hashm_maxbucket;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		MarkBufferDirty(buf);
-		_hash_relbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Initialize first bitmap page
-	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	MarkBufferDirty(metabuf);
-	_hash_relbuf(rel, metabuf);
-
-	return num_buckets;
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -535,7 +608,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9c0b79f..bfdfed8 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -311,8 +311,6 @@ extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool r
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
 			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
@@ -331,6 +329,8 @@ extern Buffer _hash_getbucketbuf_from_hashkey(Relation rel, uint32 hashkey,
 								int access,
 								HashMetaPage *cachedmetap);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket,
+				uint32 flag, bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -339,8 +339,10 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
-- 
1.8.4.msysgit.0

0006-Enable-WAL-for-Hash-Indexes.patchapplication/octet-stream; name=0006-Enable-WAL-for-Hash-Indexes.patchDownload

From 67ef33c4cecb4d008de20f8c115152c4b04b8c9d Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Mon, 13 Feb 2017 19:09:35 +0530
Subject: [PATCH 6/6] Enable WAL for Hash Indexes.

This will make hash index durable which means that data
for hash indexes can be recovered after crash.  This will
also allow us to replicate hash indexes.
---
 contrib/pageinspect/expected/hash.out          |   1 -
 contrib/pgstattuple/expected/pgstattuple.out   |   1 -
 doc/src/sgml/backup.sgml                       |  13 -
 doc/src/sgml/config.sgml                       |   7 +-
 doc/src/sgml/high-availability.sgml            |   6 -
 doc/src/sgml/indices.sgml                      |  12 -
 doc/src/sgml/ref/create_index.sgml             |  13 -
 src/backend/access/hash/Makefile               |   2 +-
 src/backend/access/hash/README                 | 144 +++-
 src/backend/access/hash/hash.c                 |  81 +-
 src/backend/access/hash/hash_xlog.c            | 973 +++++++++++++++++++++++++
 src/backend/access/hash/hashinsert.c           |  59 +-
 src/backend/access/hash/hashovfl.c             | 193 ++++-
 src/backend/access/hash/hashpage.c             | 191 ++++-
 src/backend/access/hash/hashsearch.c           |   5 +
 src/backend/access/rmgrdesc/hashdesc.c         | 135 +++-
 src/backend/commands/indexcmds.c               |   5 -
 src/backend/utils/cache/relcache.c             |  12 +-
 src/include/access/hash_xlog.h                 | 239 ++++++
 src/test/regress/expected/create_index.out     |   5 -
 src/test/regress/expected/enum.out             |   1 -
 src/test/regress/expected/hash_index.out       |   2 -
 src/test/regress/expected/macaddr.out          |   1 -
 src/test/regress/expected/replica_identity.out |   1 -
 src/test/regress/expected/uuid.out             |   1 -
 25 files changed, 1977 insertions(+), 126 deletions(-)
 create mode 100644 src/backend/access/hash/hash_xlog.c

diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 8ed60bc..3ba01f6 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -1,7 +1,6 @@
 CREATE TABLE test_hash (a int, b text);
 INSERT INTO test_hash VALUES (1, 'one');
 CREATE INDEX test_hash_a_idx ON test_hash USING hash (a);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 \x
 SELECT hash_page_type(get_raw_page('test_hash_a_idx', 0));
 -[ RECORD 1 ]--+---------
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 169d193..7d7ade8 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -131,7 +131,6 @@ select * from pgstatginindex('test_ginidx');
 (1 row)
 
 create index test_hashidx on test using hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 select * from pgstathashindex('test_hashidx');
  version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | live_items | dead_items | free_percent 
 ---------+--------------+----------------+--------------+------------+------------+------------+--------------
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 5f009ee..5a7c730 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1537,19 +1537,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index dc63d7d..a9d926d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2152,10 +2152,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 48de2ce..0e61991 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2353,12 +2353,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 271c135..e40750e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index fcb7a60..7163b03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index e2e7e91..b154569 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
-       hashsort.o hashutil.o hashvalidate.o
+       hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 703ae98..1c1f0db 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -287,13 +287,17 @@ The insertion algorithm is rather similar:
 	if current page is full, release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
+	take buffer content lock in exclusive mode on metapage
 	insert tuple at appropriate place in page
-	mark current page dirty and release buffer content lock and pin
-	if the current page is not a bucket page, release the pin on bucket page
-	pin meta page and take buffer content lock in exclusive mode
+	mark current page dirty
 	increment tuple count, decide if split needed
-	mark meta page dirty and release buffer content lock and pin
-	done if no split needed, else enter Split algorithm below
+	mark meta page dirty
+	write WAL for insertion of tuple
+	release the buffer content lock on metapage
+	release buffer content lock on current page
+	if current page is not a bucket page, release the pin on bucket page
+	if split is needed, enter Split algorithm below
+	release the pin on metapage
 
 To speed searches, the index entries within any individual index page are
 kept sorted by hash code; the insertion code must take care to insert new
@@ -328,12 +332,17 @@ existing bucket in two, thereby lowering the fill ratio:
        try to finish the split and the cleanup work
        if that succeeds, start over; if it fails, give up
 	mark the old and new buckets indicating split is in progress
+	mark both old and new buckets as dirty
+	write WAL for allocation of new page for split
 	copy the tuples that belongs to new bucket from old bucket, marking
      them as moved-by-split
+	write WAL record for moving tuples to new page once the new page is full
+	or all the pages of old bucket are finished
 	release lock but not pin for primary bucket page of old bucket,
 	 read/shared-lock next page; repeat as needed
 	clear the bucket-being-split and bucket-being-populated flags
 	mark the old bucket indicating split-cleanup
+	write WAL for changing the flags on both old and new buckets
 
 The split operation's attempt to acquire cleanup-lock on the old bucket number
 could fail if another process holds any lock or pin on it.  We do not want to
@@ -369,6 +378,8 @@ The fourth operation is garbage collection (bulk deletion):
 		acquire cleanup lock on primary bucket page
 		loop:
 			scan and remove tuples
+			mark the target page dirty
+			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
 			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
@@ -383,7 +394,8 @@ The fourth operation is garbage collection (bulk deletion):
 	check if number of buckets changed
 	if so, release content lock and pin and return to for-each-bucket loop
 	else update metapage tuple count
-	mark meta page dirty and release buffer content lock and pin
+	 mark meta page dirty and write WAL for update of metapage
+	 release buffer content lock and pin
 
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
@@ -425,18 +437,16 @@ Obtaining an overflow page:
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		mark bitmap page dirty and release content lock
+		mark bitmap page dirty
 		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
 			update it and mark meta page dirty
-		release meta page buffer content lock
-		return page number
 	else (not found):
 	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -456,12 +466,15 @@ like this:
 
 	-- having determined that no space is free in the target bucket:
 	remember last page of bucket, drop write lock on it
-	call free-page-acquire routine
 	re-write-lock last page of bucket
 	if it is not last anymore, step to the last page
-	update (former) last page to point to new page
+	execute free-page-acquire (Obtaining an overflow page) mechanism described above
+	update (former) last page to point to the new page and mark the  buffer dirty.
 	write-lock and initialize new page, with back link to former last page
-	write and release former last page
+	write WAL for addition of overflow page
+	release the locks on meta page and bitmap page acquired in free-page-acquire algorithm
+	release the lock on former last page
+	release the lock on new overflow page
 	insert tuple into new page
 	-- etc.
 
@@ -488,12 +501,14 @@ accessors of pages in the bucket.  The algorithm is:
 	determine which bitmap page contains the free space bit for page
 	release meta page buffer content lock
 	pin bitmap page and take buffer content lock in exclusive mode
-	update bitmap bit
-	mark bitmap page dirty and release buffer content lock and pin
-	if page number is less than what we saw as first-free-bit in meta:
 	retake meta page buffer content lock in exclusive mode
+	move (insert) tuples that belong to the overflow page being freed
+	update bitmap bit
+	mark bitmap page dirty
 	if page number is still less than first-free-bit,
 		update first-free-bit field and mark meta page dirty
+	write WAL for delinking overflow page operation
+	release buffer content lock and pin
 	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
@@ -504,8 +519,101 @@ page acquirer will scan more bitmap bits than he needs to.  What must be
 avoided is having first-free-bit greater than the actual first free bit,
 because then that free page would never be found by searchers.
 
-All the freespace operations should be called while holding no buffer
-locks.  Since they need no lmgr locks, deadlock is not possible.
+The reason of moving tuples from overflow page while delinking the later is
+to make that as an atomic operation.  Not doing so could lead to spurious reads
+on standby.  Basically, the user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which get
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolled back.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only the fixed number of pages XLR_MAX_BLOCK_ID (32)
+with current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform the split operation if the number of tuples are more than what can be
+accomodated in the initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of the old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with bucket-being-split and bucket-being-populated flags respectively
+which indicates that split is in progress for those buckets.  The reader
+algorithm works correctly, as it will scan both the old and new buckets when
+the split is in progress as explained in the reader algorithm section above.
+
+We finish the split at next insert or split operation on the old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vacuum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeeze the bucket completely.
 
 
 Other Notes
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index da4e6a2..bb9a363 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -28,6 +28,7 @@
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -302,6 +303,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -622,6 +628,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -648,6 +655,26 @@ loop_top:
 	}
 
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
@@ -815,9 +842,40 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -865,8 +923,25 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
 		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
@@ -880,9 +955,3 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..e7c1e7e
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,973 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, InvalidBlockNumber, *num_bucket, LH_OVERFLOW_PAGE,
+				  true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+		oldopaque->hasho_prevblkno = xlrec->new_bucket;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
+				  xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay split cleanup flag operation for primary bucket page.
+ */
+static void
+hash_xlog_split_cleanup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			hash_xlog_split_cleanup(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 354e733..241728f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -40,6 +42,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -158,25 +161,20 @@ restart_insert:
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * dirty and release the modified page.  if the page we modified was an
-	 * overflow page, we also need to separately drop the pin we retained on
-	 * the primary bucket page.
-	 */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap = HashPageGetMeta(metapage);
 	metap->hashm_ntuples += 1;
 
@@ -184,10 +182,43 @@ restart_insert:
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
+
 	/* Attempt to split if a split is needed */
 	if (do_expand)
 		_hash_expandtable(rel, metabuf);
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 084ad92..69f3882 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,6 +18,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "utils/rel.h"
 
@@ -313,7 +314,9 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 found:
 
 	/*
-	 * Do the update.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case the new page is added.
 	 */
 	START_CRIT_SECTION();
 
@@ -373,6 +376,59 @@ found:
 
 	MarkBufferDirty(buf);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
+	}
+
 	END_CRIT_SECTION();
 
 	if (retain_pin)
@@ -430,9 +486,9 @@ _hash_firstfreebit(uint32 map)
  *
  *	Add the tuples (itups) to wbuf in this function, we could do that in the
  *	caller as well.  The advantage of doing it here is we can easily write
- *	the WAL for this operation.  Addition of tuples and removal of overflow
- *	page has to done as an atomic operation, otherwise during replay on standby
- *	users might find duplicate records.
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
@@ -454,8 +510,6 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
-	Buffer		prevbuf = InvalidBuffer;
-	Buffer		nextbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
@@ -469,6 +523,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -533,6 +590,10 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
 	START_CRIT_SECTION();
 
 	/*
@@ -577,13 +638,86 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	CLRBIT(freep, bitmapbit);
 	MarkBufferDirty(mapbuf);
 
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
+		update_metap = true;
 		MarkBufferDirty(metabuf);
 	}
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	END_CRIT_SECTION();
 
 	/* release previous bucket if it is not same as write bucket */
@@ -797,11 +931,17 @@ readpage:
 														   HASH_WRITE,
 														   LH_OVERFLOW_PAGE,
 														   bstrategy);
-
 				if (nitups > 0)
 				{
 					Assert(nitups == ndeletable);
 
+					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
 					START_CRIT_SECTION();
 
 					/*
@@ -817,6 +957,41 @@ readpage:
 					PageIndexMultiDelete(rpage, deletable, ndeletable);
 					MarkBufferDirty(rbuf);
 
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
 					END_CRIT_SECTION();
 
 					tups_moved = true;
@@ -834,7 +1009,9 @@ readpage:
 				/*
 				 * We need to release and if required reacquire the lock on
 				 * rbuf to ensure that standby shouldn't see an intermediate
-				 * state of it.
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_SQUEEZE_PAGE on standby users will be able to
+				 * view the results of partial deletion on rblkno.
 				 */
 				LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 8523409..300a15d 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -43,6 +44,8 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -381,6 +384,31 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 	pg = BufferGetPage(metabuf);
 	metap = HashPageGetMeta(pg);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	num_buckets = metap->hashm_maxbucket + 1;
 
 	/*
@@ -405,6 +433,12 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 		buf = _hash_getnewbuf(rel, blkno, forkNum);
 		_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
 		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
 		_hash_relbuf(rel, buf);
 	}
 
@@ -431,6 +465,32 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_nmaps++;
 	MarkBufferDirty(metabuf);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	/* all done */
 	_hash_relbuf(rel, bitmapbuf);
 	_hash_relbuf(rel, metabuf);
@@ -573,6 +633,8 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -732,7 +794,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -775,6 +841,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -787,6 +854,7 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
 	MarkBufferDirty(metabuf);
@@ -832,6 +900,49 @@ restart_expand:
 
 	MarkBufferDirty(buf_nblkno);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	END_CRIT_SECTION();
 
 	/* drop lock, but keep pin */
@@ -893,6 +1004,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
 	char		zerobuf[BLCKSZ];
+	Page		page;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -903,7 +1015,21 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	MemSet(zerobuf, 0, sizeof(zerobuf));
+	page = (Page) zerobuf;
+
+	/*
+	 * Initialise the new bucket page, here we can't complete zeroed the page
+	 * as WAL replay routines expect pages to be initialized. See explanation
+	 * of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(page, BLCKSZ);
+
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
 
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
@@ -911,7 +1037,6 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	return true;
 }
 
-
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
@@ -1041,6 +1166,9 @@ _hash_splitbucket(Relation rel,
 
 				while (PageGetFreeSpace(npage) < itemsz)
 				{
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
 					/* drop lock, but keep pin */
 					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
 
@@ -1091,6 +1219,9 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
 			if (nbuf == bucket_nbuf)
 				LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
 			else
@@ -1139,10 +1270,30 @@ _hash_splitbucket(Relation rel,
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
 
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
 	END_CRIT_SECTION();
 }
 
-
 /*
  *	_hash_finish_split() -- Finish the previously interrupted split operation
  *
@@ -1267,6 +1418,38 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 }
 
 /*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
  *	_hash_getcachedmetap() -- Returns cached metapage data.
  *
  *	If metabuf is not InvalidBuffer, caller must hold a pin, but no lock, on
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9e5d7e4..d733770 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -123,6 +123,7 @@ _hash_readnext(IndexScanDesc scan,
 	if (block_found)
 	{
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 	}
 }
@@ -168,6 +169,7 @@ _hash_readprev(IndexScanDesc scan,
 		*bufp = _hash_getbuf(rel, blkno, HASH_READ,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 
 		/*
@@ -283,6 +285,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 
 	buf = _hash_getbucketbuf_from_hashkey(rel, hashkey, HASH_READ, NULL);
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	bucket = opaque->hasho_bucket;
 
@@ -318,6 +321,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+		TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf));
 
 		/*
 		 * remember the split bucket buffer so as to use it later for
@@ -520,6 +524,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 7eac819..5e3f7d8 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,143 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "split_complete_old_bucket %c, split_complete_new_bucket %c",
+				(xlrec->old_bucket_flag & LH_BUCKET_BEING_SPLIT) ? 'F' : 'T',
+								 (xlrec->new_bucket_flag & LH_BUCKET_BEING_POPULATED) ? 'F' : 'T');
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			id = "SPLIT_CLEANUP";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 72bb06c..9618032 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -506,11 +506,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9001e20..ce55fc5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5880,13 +5880,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5908,8 +5905,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index ed3c37f..c53f878 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -17,6 +17,245 @@
 #include "access/hash.h"
 #include "access/xlogreader.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_SPLIT_CLEANUP 0xA0	/* clear split-cleanup flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * Data to regenerate the meta-data page
+ */
+typedef struct xl_hash_metadata
+{
+	HashMetaPageData metadata;
+}	xl_hash_metadata;
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (xl_hash_metadata)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index e519fdb..26cd059 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index f8b9f02..0a18efa 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -201,7 +201,6 @@ SELECT h.seqno AS f20000
 --
 CREATE TABLE hash_split_heap (keycol INT);
 CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
 VACUUM FULL hash_split_heap;
 -- Let's do a backward scan.
@@ -230,5 +229,4 @@ DROP TABLE hash_temp_heap CASCADE;
 CREATE TABLE hash_heap_float4 (x float4, y int);
 INSERT INTO hash_heap_float4 VALUES (1.1,1);
 CREATE INDEX hash_idx ON hash_heap_float4 USING hash (x);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 DROP TABLE hash_heap_float4 CASCADE;
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index fa63235..67c34a9 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 423f277..db66dc7 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail
-- 
1.8.4.msysgit.0

#74

Kuntal Ghosh

kuntalghosh.2007@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#73)

Re: Write Ahead Logging for Hash Indexes

On Mon, Feb 13, 2017 at 8:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As discussed, attached are refactoring patches and a patch to enable
WAL for the hash index on top of them.

0006-Enable-WAL-for-Hash-Indexes.patch needs to be rebased after
commit 8da9a226369e9ceec7cef1.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#73)

Re: Write Ahead Logging for Hash Indexes

On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As discussed, attached are refactoring patches and a patch to enable
WAL for the hash index on top of them.

Thanks. I think that the refactoring patches shouldn't add
START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that
for the final patch. Nor should their comments reference the idea of
writing WAL; that should also be for the final patch.

PageGetFreeSpaceForMulTups doesn't seem like a great name.
PageGetFreeSpaceForMultipleTuples? Or maybe just use
PageGetExactFreeSpace and then do the account in the caller. I'm not
sure it's really worth having a function just to subtract a multiple
of sizeof(ItemIdData), and it would actually be more efficient to have
the caller take care of this, since you wouldn't need to keep
recalculating the value for every iteration of the loop.

I think we ought to just rip (as a preliminary patch) out the support
for freeing an overflow page in the middle of a bucket chain. That
code is impossible to hit, I believe, and instead of complicating the
WAL machinery to cater to a nonexistent case, maybe we ought to just
get rid of it.

+                               /*
+                                * We need to release and if required
reacquire the lock on
+                                * rbuf to ensure that standby
shouldn't see an intermediate
+                                * state of it.
+                                */

How does releasing and reacquiring the lock on the master affect
whether the standby can see an intermediate state? That sounds
totally wrong to me (and also doesn't belong in a refactoring patch
anyway since we're not emitting any WAL here).

- Assert(PageIsNew(page));

This worries me a bit. What's going on here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#75)

5 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Feb 13, 2017 at 10:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As discussed, attached are refactoring patches and a patch to enable
WAL for the hash index on top of them.

Thanks. I think that the refactoring patches shouldn't add
START_CRIT_SECTION() and END_CRIT_SECTION() calls; let's leave that
for the final patch. Nor should their comments reference the idea of
writing WAL; that should also be for the final patch.

Okay, changed as per suggestion.

PageGetFreeSpaceForMulTups doesn't seem like a great name.
PageGetFreeSpaceForMultipleTuples?

Okay, changed as per suggestion.

Or maybe just use
PageGetExactFreeSpace and then do the account in the caller. I'm not
sure it's really worth having a function just to subtract a multiple
of sizeof(ItemIdData), and it would actually be more efficient to have
the caller take care of this, since you wouldn't need to keep
recalculating the value for every iteration of the loop.

I have tried this one, but I think even if track outside we need
something like PageGetFreeSpaceForMultipleTuples for the cases when we
have to try next write page/'s where data (set of index tuples) can be
accommodated.

I think we ought to just rip (as a preliminary patch) out the support
for freeing an overflow page in the middle of a bucket chain. That
code is impossible to hit, I believe, and instead of complicating the
WAL machinery to cater to a nonexistent case, maybe we ought to just
get rid of it.

I also could not think of a case where we need to care about the page
in the middle of bucket chain as per it's current usage. In
particular, are you worried about the code related to nextblock number
in _hash_freeovflpage()? Surely, we can remove it, but what
complexity are you referring to here? There is some additional
book-keeping for primary bucket page, but there is nothing much about
maintaining a backward link. One more thing to note is that this is
an exposed API, so to avoid getting it used in some wrong way, we
might want to make it static.

+                               /*
+                                * We need to release and if required
reacquire the lock on
+                                * rbuf to ensure that standby
shouldn't see an intermediate
+                                * state of it.
+                                */

How does releasing and reacquiring the lock on the master affect
whether the standby can see an intermediate state?

I am talking about the intermediate state of standby with respect to
the master. We will release the lock on standby after replaying the
WAL for moving tuples from read page to write page whereas master will
hold that for the much longer period (till we move all the tuples from
read page and free that page). I understand that there is no
correctness issue here, but was trying to make operations on master
and standby similar and another thing is that it will help us in
obeying the coding rule of releasing the lock on buffers after writing
WAL record. I see no harm in maintaining the existing coding pattern,
however, if you think that is not important in this case, then I can
change the code to avoid releasing the lock on read page.

That sounds
totally wrong to me (and also doesn't belong in a refactoring patch
anyway since we're not emitting any WAL here).

Agreed that even if we want to do such a change, then also it should
be part of WAL patch. So for now, I have reverted that change.

- Assert(PageIsNew(page));

This worries me a bit. What's going on here?

This is related to below change.

+ /*
+ * Initialise the freed overflow page, here we can't complete zeroed the
+ *
page as WAL replay routines expect pages to be initialized. See
+ * explanation of RBM_NORMAL
mode atop XLogReadBufferExtended.
+ */
+ _hash_pageinit(ovflpage, BufferGetPageSize
(ovflbuf));

Basically, we need this routine to initialize freed overflow pages.
Also, if you see the similar function in btree (_bt_pageinit(Page
page, Size size)), it doesn't have any such Assertion.

Attached are refactoring patches. WAL patch needs some changes based
on above comments, so will post it later.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Expose-a-new-API-_hash_pgaddmultitup.patchapplication/octet-stream; name=0001-Expose-a-new-API-_hash_pgaddmultitup.patchDownload

From 00f6a761a0a704e6e3e256bb54232e6be57ec39e Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Thu, 16 Feb 2017 17:54:51 +0530
Subject: [PATCH 1/5] Expose a new API _hash_pgaddmultitup.

The purpose of this API is to add multiple tuples on a page
at one shot.  This helps in making the free overflow operation
atomic which is a prerequisite for WAL logging.
---
 src/backend/access/hash/hashinsert.c |  41 ++++++++
 src/backend/access/hash/hashovfl.c   | 196 +++++++++++++++++++++++------------
 src/backend/access/hash/hashpage.c   |   1 -
 src/backend/storage/page/bufpage.c   |  27 +++++
 src/include/access/hash.h            |   7 +-
 src/include/storage/bufpage.h        |   1 +
 6 files changed, 203 insertions(+), 70 deletions(-)

diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index dc63063..354e733 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -228,3 +228,44 @@ _hash_pgaddtup(Relation rel, Buffer buf, Size itemsize, IndexTuple itup)
 
 	return itup_off;
 }
+
+/*
+ *	_hash_pgaddmultitup() -- add a tuple vector to a particular page in the
+ *							 index.
+ *
+ * This routine has same requirements for locking and tuple ordering as
+ * _hash_pgaddtup().
+ *
+ * Returns the offset number array at which the tuples were inserted.
+ */
+void
+_hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups)
+{
+	OffsetNumber itup_off;
+	Page		page;
+	uint32		hashkey;
+	int			i;
+
+	_hash_checkpage(rel, buf, LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
+	page = BufferGetPage(buf);
+
+	for (i = 0; i < nitups; i++)
+	{
+		Size		itemsize;
+
+		itemsize = IndexTupleDSize(*itups[i]);
+		itemsize = MAXALIGN(itemsize);
+
+		/* Find where to insert the tuple (preserving page's hashkey ordering) */
+		hashkey = _hash_get_indextuple_hashkey(itups[i]);
+		itup_off = _hash_binsearch(page, hashkey);
+
+		itup_offsets[i] = itup_off;
+
+		if (PageAddItem(page, (Item) itups[i], itemsize, itup_off, false, false)
+			== InvalidOffsetNumber)
+			elog(ERROR, "failed to add index item to \"%s\"",
+				 RelationGetRelationName(rel));
+	}
+}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 3334089..52491e0 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -391,6 +391,8 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
+ *	Add the tuples (itups) to wbuf.
+ *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
  *
@@ -403,13 +405,16 @@ _hash_firstfreebit(uint32 map)
  *	has a lock on same.
  */
 BlockNumber
-_hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
+_hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+				   Size *tups_size, uint16 nitups,
 				   BufferAccessStrategy bstrategy)
 {
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
 	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
@@ -435,15 +440,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	bucket = ovflopaque->hasho_bucket;
 
 	/*
-	 * Zero the page for debugging's sake; then write and release it. (Note:
-	 * if we failed to zero the page here, we'd have problems with the Assert
-	 * in _hash_pageinit() when the page is reused.)
-	 */
-	MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));
-	MarkBufferDirty(ovflbuf);
-	_hash_relbuf(rel, ovflbuf);
-
-	/*
 	 * Fix up the bucket chain.  this is a doubly-linked list, so we must fix
 	 * up the bucket chain members behind and ahead of the overflow page being
 	 * deleted.  Concurrency issues are avoided by using lock chaining as
@@ -451,8 +447,6 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	 */
 	if (BlockNumberIsValid(prevblkno))
 	{
-		Page		prevpage;
-		HashPageOpaque prevopaque;
 
 		if (prevblkno == writeblkno)
 			prevbuf = wbuf;
@@ -462,32 +456,13 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 												 HASH_WRITE,
 										   LH_BUCKET_PAGE | LH_OVERFLOW_PAGE,
 												 bstrategy);
-
-		prevpage = BufferGetPage(prevbuf);
-		prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
-
-		Assert(prevopaque->hasho_bucket == bucket);
-		prevopaque->hasho_nextblkno = nextblkno;
-
-		MarkBufferDirty(prevbuf);
-		if (prevblkno != writeblkno)
-			_hash_relbuf(rel, prevbuf);
 	}
 	if (BlockNumberIsValid(nextblkno))
-	{
-		Buffer		nextbuf = _hash_getbuf_with_strategy(rel,
-														 nextblkno,
-														 HASH_WRITE,
-														 LH_OVERFLOW_PAGE,
-														 bstrategy);
-		Page		nextpage = BufferGetPage(nextbuf);
-		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
-
-		Assert(nextopaque->hasho_bucket == bucket);
-		nextopaque->hasho_prevblkno = prevblkno;
-		MarkBufferDirty(nextbuf);
-		_hash_relbuf(rel, nextbuf);
-	}
+		nextbuf = _hash_getbuf_with_strategy(rel,
+											 nextblkno,
+											 HASH_WRITE,
+											 LH_OVERFLOW_PAGE,
+											 bstrategy);
 
 	/* Note: bstrategy is intentionally not used for metapage and bitmap */
 
@@ -508,25 +483,75 @@ _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
 	/* Release metapage lock while we access the bitmap page */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
-	/* Clear the bitmap bit to indicate that this overflow page is free */
+	/* read the bitmap page to clear the bitmap bit */
 	mapbuf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BITMAP_PAGE);
 	mappage = BufferGetPage(mapbuf);
 	freep = HashPageGetBitmap(mappage);
 	Assert(ISSET(freep, bitmapbit));
-	CLRBIT(freep, bitmapbit);
-	MarkBufferDirty(mapbuf);
-	_hash_relbuf(rel, mapbuf);
 
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/*
+	 * we have to insert tuples on the "write" page, being careful to preserve
+	 * hashkey ordering.  (If we insert many tuples into the same "write" page
+	 * it would be worth qsort'ing them).
+	 */
+	if (nitups > 0)
+	{
+		_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+		MarkBufferDirty(wbuf);
+	}
+
+	/* Initialise the freed overflow page. */
+	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+	MarkBufferDirty(ovflbuf);
+
+	if (BufferIsValid(prevbuf))
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		Assert(prevopaque->hasho_bucket == bucket);
+		prevopaque->hasho_nextblkno = nextblkno;
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(nextbuf))
+	{
+		Page		nextpage = BufferGetPage(nextbuf);
+		HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+		Assert(nextopaque->hasho_bucket == bucket);
+		nextopaque->hasho_prevblkno = prevblkno;
+		MarkBufferDirty(nextbuf);
+	}
+
+	/* Clear the bitmap bit to indicate that this overflow page is free */
+	CLRBIT(freep, bitmapbit);
+	MarkBufferDirty(mapbuf);
+
 	/* if this is now the first free page, update hashm_firstfree */
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
 		MarkBufferDirty(metabuf);
 	}
-	_hash_relbuf(rel, metabuf);
+
+	/* release previous bucket if it is not same as write bucket */
+	if (BufferIsValid(prevbuf) && prevblkno != writeblkno)
+		_hash_relbuf(rel, prevbuf);
+
+	if (BufferIsValid(ovflbuf))
+		_hash_relbuf(rel, ovflbuf);
+
+	if (BufferIsValid(nextbuf))
+		_hash_relbuf(rel, nextbuf);
+
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	if (BufferIsValid(metabuf))
+		_hash_relbuf(rel, metabuf);
 
 	return nextblkno;
 }
@@ -640,7 +665,6 @@ _hash_squeezebucket(Relation rel,
 	Page		rpage;
 	HashPageOpaque wopaque;
 	HashPageOpaque ropaque;
-	bool		wbuf_dirty;
 
 	/*
 	 * start squeezing into the primary bucket page.
@@ -686,15 +710,23 @@ _hash_squeezebucket(Relation rel,
 	/*
 	 * squeeze the tuples.
 	 */
-	wbuf_dirty = false;
 	for (;;)
 	{
 		OffsetNumber roffnum;
 		OffsetNumber maxroffnum;
 		OffsetNumber deletable[MaxOffsetNumber];
-		int			ndeletable = 0;
+		IndexTuple	itups[MaxIndexTuplesPerPage];
+		Size		tups_size[MaxIndexTuplesPerPage];
+		OffsetNumber *itup_offsets;
+		uint16		ndeletable = 0;
+		uint16		nitups = 0;
+		Size		all_tups_size = 0;
+		int			i;
 		bool		retain_pin = false;
 
+		itup_offsets = (OffsetNumber *) palloc(MaxIndexTuplesPerPage * sizeof(OffsetNumber));
+
+readpage:
 		/* Scan each tuple in "read" page */
 		maxroffnum = PageGetMaxOffsetNumber(rpage);
 		for (roffnum = FirstOffsetNumber;
@@ -715,11 +747,13 @@ _hash_squeezebucket(Relation rel,
 
 			/*
 			 * Walk up the bucket chain, looking for a page big enough for
-			 * this item.  Exit if we reach the read page.
+			 * this item and all other accumulated items.  Exit if we reach
+			 * the read page.
 			 */
-			while (PageGetFreeSpace(wpage) < itemsz)
+			while (PageGetFreeSpaceForMultipleTuples(wpage, nitups + 1) < (all_tups_size + itemsz))
 			{
 				Buffer		next_wbuf = InvalidBuffer;
+				bool		tups_moved = false;
 
 				Assert(!PageIsEmpty(wpage));
 
@@ -737,12 +771,30 @@ _hash_squeezebucket(Relation rel,
 														   LH_OVERFLOW_PAGE,
 														   bstrategy);
 
+				if (nitups > 0)
+				{
+					Assert(nitups == ndeletable);
+
+					/*
+					 * we have to insert tuples on the "write" page, being
+					 * careful to preserve hashkey ordering.  (If we insert
+					 * many tuples into the same "write" page it would be
+					 * worth qsort'ing them).
+					 */
+					_hash_pgaddmultitup(rel, wbuf, itups, itup_offsets, nitups);
+					MarkBufferDirty(wbuf);
+
+					/* Delete tuples we already moved off read page */
+					PageIndexMultiDelete(rpage, deletable, ndeletable);
+					MarkBufferDirty(rbuf);
+
+					tups_moved = true;
+				}
+
 				/*
 				 * release the lock on previous page after acquiring the lock
 				 * on next page
 				 */
-				if (wbuf_dirty)
-					MarkBufferDirty(wbuf);
 				if (retain_pin)
 					LockBuffer(wbuf, BUFFER_LOCK_UNLOCK);
 				else
@@ -751,12 +803,6 @@ _hash_squeezebucket(Relation rel,
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					if (ndeletable > 0)
-					{
-						/* Delete tuples we already moved off read page */
-						PageIndexMultiDelete(rpage, deletable, ndeletable);
-						MarkBufferDirty(rbuf);
-					}
 					_hash_relbuf(rel, rbuf);
 					return;
 				}
@@ -765,21 +811,33 @@ _hash_squeezebucket(Relation rel,
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
 				Assert(wopaque->hasho_bucket == bucket);
-				wbuf_dirty = false;
 				retain_pin = false;
-			}
 
-			/*
-			 * we have found room so insert on the "write" page, being careful
-			 * to preserve hashkey ordering.  (If we insert many tuples into
-			 * the same "write" page it would be worth qsort'ing instead of
-			 * doing repeated _hash_pgaddtup.)
-			 */
-			(void) _hash_pgaddtup(rel, wbuf, itemsz, itup);
-			wbuf_dirty = true;
+				/* be tidy */
+				for (i = 0; i < nitups; i++)
+					pfree(itups[i]);
+				nitups = 0;
+				all_tups_size = 0;
+				ndeletable = 0;
 
+				/*
+				 * after moving the tuples, rpage would have been compacted,
+				 * so we need to rescan it.
+				 */
+				if (tups_moved)
+					goto readpage;
+			}
 			/* remember tuple for deletion from "read" page */
 			deletable[ndeletable++] = roffnum;
+
+			/*
+			 * we need a copy of index tuples as they can be freed as part of
+			 * overflow page, however we need them to write a WAL record in
+			 * _hash_freeovflpage.
+			 */
+			itups[nitups] = CopyIndexTuple(itup);
+			tups_size[nitups++] = itemsz;
+			all_tups_size += itemsz;
 		}
 
 		/*
@@ -797,10 +855,14 @@ _hash_squeezebucket(Relation rel,
 		Assert(BlockNumberIsValid(rblkno));
 
 		/* free this overflow page (releases rbuf) */
-		_hash_freeovflpage(rel, rbuf, wbuf, bstrategy);
+		_hash_freeovflpage(rel, bucket_buf, rbuf, wbuf, itups, itup_offsets,
+						   tups_size, nitups, bstrategy);
+
+		/* be tidy */
+		for (i = 0; i < nitups; i++)
+			pfree(itups[i]);
 
-		if (wbuf_dirty)
-			MarkBufferDirty(wbuf);
+		pfree(itup_offsets);
 
 		/* are we freeing the page adjacent to wbuf? */
 		if (rblkno == wblkno)
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 9485978..00f3ea8 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -470,7 +470,6 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 void
 _hash_pageinit(Page page, Size size)
 {
-	Assert(PageIsNew(page));
 	PageInit(page, size, sizeof(HashPageOpaqueData));
 }
 
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6fc5fa4..fdf045a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -598,6 +598,33 @@ PageGetFreeSpace(Page page)
 }
 
 /*
+ * PageGetFreeSpaceForMultipleTuples
+ *		Returns the size of the free (allocatable) space on a page,
+ *		reduced by the space needed for multiple new line pointers.
+ *
+ * Note: this should usually only be used on index pages.  Use
+ * PageGetHeapFreeSpace on heap pages.
+ */
+Size
+PageGetFreeSpaceForMultipleTuples(Page page, int ntups)
+{
+	int			space;
+
+	/*
+	 * Use signed arithmetic here so that we behave sensibly if pd_lower >
+	 * pd_upper.
+	 */
+	space = (int) ((PageHeader) page)->pd_upper -
+		(int) ((PageHeader) page)->pd_lower;
+
+	if (space < (int) (ntups * sizeof(ItemIdData)))
+		return 0;
+	space -= ntups * sizeof(ItemIdData);
+
+	return (Size) space;
+}
+
+/*
  * PageGetExactFreeSpace
  *		Returns the size of the free (allocatable) space on a page,
  *		without any consideration for adding/removing line pointers.
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 3bf587b..5767deb 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -303,11 +303,14 @@ extern Datum hash_uint32(uint32 k);
 extern void _hash_doinsert(Relation rel, IndexTuple itup);
 extern OffsetNumber _hash_pgaddtup(Relation rel, Buffer buf,
 			   Size itemsize, IndexTuple itup);
+extern void _hash_pgaddmultitup(Relation rel, Buffer buf, IndexTuple *itups,
+					OffsetNumber *itup_offsets, uint16 nitups);
 
 /* hashovfl.c */
 extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin);
-extern BlockNumber _hash_freeovflpage(Relation rel, Buffer ovflbuf, Buffer wbuf,
-				   BufferAccessStrategy bstrategy);
+extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
+				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
+			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_squeezebucket(Relation rel,
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 294f9cb..e956dc3 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -425,6 +425,7 @@ extern Page PageGetTempPageCopySpecial(Page page);
 extern void PageRestoreTempPage(Page tempPage, Page oldPage);
 extern void PageRepairFragmentation(Page page);
 extern Size PageGetFreeSpace(Page page);
+extern Size PageGetFreeSpaceForMultipleTuples(Page page, int ntups);
 extern Size PageGetExactFreeSpace(Page page);
 extern Size PageGetHeapFreeSpace(Page page);
 extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
-- 
1.8.4.msysgit.0

0002-Expose-an-API-hashinitbitmapbuffer.patchapplication/octet-stream; name=0002-Expose-an-API-hashinitbitmapbuffer.patchDownload

From 160b7415080616740565a31e4a2450bb2336dfb6 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Thu, 16 Feb 2017 17:57:54 +0530
Subject: [PATCH 2/5] Expose an API hashinitbitmapbuffer.

This API will be used to initialize bitmap page buffer.
This patch will be used by future patches to enable WAL
logging.
---
 src/backend/access/hash/hashovfl.c | 36 ++++++++++++++++++++++++++++++++++++
 src/include/access/hash.h          |  1 +
 2 files changed, 37 insertions(+)

diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 52491e0..cee0efd 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -620,6 +620,42 @@ _hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
 
 
 /*
+ *	_hash_initbitmapbuffer()
+ *
+ *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
+ *	 "1", indicating "in use".
+ */
+void
+_hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
+{
+	Page		pg;
+	HashPageOpaque op;
+	uint32	   *freep;
+
+	pg = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(pg, BufferGetPageSize(buf));
+
+	/* initialize the page's special space */
+	op = (HashPageOpaque) PageGetSpecialPointer(pg);
+	op->hasho_prevblkno = InvalidBlockNumber;
+	op->hasho_nextblkno = InvalidBlockNumber;
+	op->hasho_bucket = -1;
+	op->hasho_flag = LH_BITMAP_PAGE;
+	op->hasho_page_id = HASHO_PAGE_ID;
+
+	/* set all of the bits to 1 */
+	freep = HashPageGetBitmap(pg);
+	MemSet(freep, 0xFF, bmsize);
+
+	/* Set pd_lower just past the end of the bitmap page data. */
+	((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
+}
+
+
+/*
  *	_hash_squeezebucket(rel, bucket)
  *
  *	Try to squeeze the tuples onto pages occurring earlier in the
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 5767deb..9c0b79f 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -313,6 +313,7 @@ extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovf
 			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
 extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
 				 BlockNumber blkno, ForkNumber forkNum);
+extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
 					Buffer bucket_buf,
-- 
1.8.4.msysgit.0

0003-Restructure-_hash_addovflpage.patchapplication/octet-stream; name=0003-Restructure-_hash_addovflpage.patchDownload

From e318cf607b6df67d397dbda9cacaafe655aeec6b Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Thu, 16 Feb 2017 18:04:29 +0530
Subject: [PATCH 3/5] Restructure _hash_addovflpage.

The restructuring is done so that the add overflow page
operation can be performed atomically and can be
WAL-logged.
---
 src/backend/access/hash/hashovfl.c | 212 ++++++++++++++++++++-----------------
 1 file changed, 114 insertions(+), 98 deletions(-)

diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index cee0efd..3c2383f 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -21,7 +21,6 @@
 #include "utils/rel.h"
 
 
-static Buffer _hash_getovflpage(Relation rel, Buffer metabuf);
 static uint32 _hash_firstfreebit(uint32 map);
 
 
@@ -113,13 +112,30 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	Page		ovflpage;
 	HashPageOpaque pageopaque;
 	HashPageOpaque ovflopaque;
-
-	/* allocate and lock an empty overflow page */
-	ovflbuf = _hash_getovflpage(rel, metabuf);
+	HashMetaPage metap;
+	Buffer		mapbuf = InvalidBuffer;
+	Buffer		newmapbuf = InvalidBuffer;
+	BlockNumber blkno;
+	uint32		orig_firstfree;
+	uint32		splitnum;
+	uint32	   *freep = NULL;
+	uint32		max_ovflpg;
+	uint32		bit;
+	uint32		bitmap_page_bit;
+	uint32		first_page;
+	uint32		last_bit;
+	uint32		last_page;
+	uint32		i,
+				j;
+	bool		page_found = false;
 
 	/*
-	 * Write-lock the tail page.  It is okay to hold two buffer locks here
-	 * since there cannot be anyone else contending for access to ovflbuf.
+	 * Write-lock the tail page.  Here, we need to maintain locking order such
+	 * that, first acquire the lock on tail page of bucket, then on meta page
+	 * to find and lock the bitmap page and if it is found, then lock on meta
+	 * page is released, then finally acquire the lock on new overflow buffer.
+	 * We need this locking order to avoid deadlock with backends that are
+	 * doing inserts.
 	 */
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -153,60 +169,6 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 		buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
 	}
 
-	/* now that we have correct backlink, initialize new overflow page */
-	ovflpage = BufferGetPage(ovflbuf);
-	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
-	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
-	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
-	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
-	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
-	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	MarkBufferDirty(ovflbuf);
-
-	/* logically chain overflow page to previous page */
-	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
-	MarkBufferDirty(buf);
-	if (retain_pin)
-	{
-		/* pin will be retained only for the primary bucket page */
-		Assert(pageopaque->hasho_flag & LH_BUCKET_PAGE);
-		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-	}
-	else
-		_hash_relbuf(rel, buf);
-
-	return ovflbuf;
-}
-
-/*
- *	_hash_getovflpage()
- *
- *	Find an available overflow page and return it.  The returned buffer
- *	is pinned and write-locked, and has had _hash_pageinit() applied,
- *	but it is caller's responsibility to fill the special space.
- *
- * The caller must hold a pin, but no lock, on the metapage buffer.
- * That buffer is left in the same state at exit.
- */
-static Buffer
-_hash_getovflpage(Relation rel, Buffer metabuf)
-{
-	HashMetaPage metap;
-	Buffer		mapbuf = 0;
-	Buffer		newbuf;
-	BlockNumber blkno;
-	uint32		orig_firstfree;
-	uint32		splitnum;
-	uint32	   *freep = NULL;
-	uint32		max_ovflpg;
-	uint32		bit;
-	uint32		first_page;
-	uint32		last_bit;
-	uint32		last_page;
-	uint32		i,
-				j;
-
 	/* Get exclusive lock on the meta page */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -255,11 +217,31 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
 		{
 			if (freep[j] != ALL_SET)
+			{
+				page_found = true;
+
+				/* Reacquire exclusive lock on the meta page */
+				LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+				/* convert bit to bit number within page */
+				bit += _hash_firstfreebit(freep[j]);
+				bitmap_page_bit = bit;
+
+				/* convert bit to absolute bit number */
+				bit += (i << BMPG_SHIFT(metap));
+				/* Calculate address of the recycled overflow page */
+				blkno = bitno_to_blkno(metap, bit);
+
+				/* Fetch and init the recycled page */
+				ovflbuf = _hash_getinitbuf(rel, blkno);
+
 				goto found;
+			}
 		}
 
 		/* No free space here, try to advance to next map page */
 		_hash_relbuf(rel, mapbuf);
+		mapbuf = InvalidBuffer;
 		i++;
 		j = 0;					/* scan from start of next map page */
 		bit = 0;
@@ -283,8 +265,15 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 		 * convenient to pre-mark them as "in use" too.
 		 */
 		bit = metap->hashm_spares[splitnum];
-		_hash_initbitmap(rel, metap, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
-		metap->hashm_spares[splitnum]++;
+		newmapbuf = _hash_getnewbuf(rel, bitno_to_blkno(metap, bit), MAIN_FORKNUM);
+
+		/* add the new bitmap page to the metapage's list of bitmaps */
+		/* metapage already has a write lock */
+		if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+			ereport(ERROR,
+					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+					 errmsg("out of overflow pages in hash index \"%s\"",
+							RelationGetRelationName(rel))));
 	}
 	else
 	{
@@ -295,7 +284,8 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	}
 
 	/* Calculate address of the new overflow page */
-	bit = metap->hashm_spares[splitnum];
+	bit = BufferIsValid(newmapbuf) ?
+		metap->hashm_spares[splitnum] + 1 : metap->hashm_spares[splitnum];
 	blkno = bitno_to_blkno(metap, bit);
 
 	/*
@@ -303,41 +293,47 @@ _hash_getovflpage(Relation rel, Buffer metabuf)
 	 * relation length stays in sync with ours.  XXX It's annoying to do this
 	 * with metapage write lock held; would be better to use a lock that
 	 * doesn't block incoming searches.
+	 *
+	 * It is okay to hold two buffer locks here (one on tail page of bucket
+	 * and other on new overflow page) since there cannot be anyone else
+	 * contending for access to ovflbuf.
 	 */
-	newbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
+	ovflbuf = _hash_getnewbuf(rel, blkno, MAIN_FORKNUM);
 
-	metap->hashm_spares[splitnum]++;
+found:
 
 	/*
-	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
-	 * changing it if someone moved it while we were searching bitmap pages.
+	 * Do the update.
 	 */
-	if (metap->hashm_firstfree == orig_firstfree)
-		metap->hashm_firstfree = bit + 1;
-
-	/* Write updated metapage and release lock, but not pin */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-	return newbuf;
-
-found:
-	/* convert bit to bit number within page */
-	bit += _hash_firstfreebit(freep[j]);
+	if (page_found)
+	{
+		Assert(BufferIsValid(mapbuf));
 
-	/* mark page "in use" in the bitmap */
-	SETBIT(freep, bit);
-	MarkBufferDirty(mapbuf);
-	_hash_relbuf(rel, mapbuf);
+		/* mark page "in use" in the bitmap */
+		SETBIT(freep, bitmap_page_bit);
+		MarkBufferDirty(mapbuf);
+	}
+	else
+	{
+		/* update the count to indicate new overflow page is added */
+		metap->hashm_spares[splitnum]++;
 
-	/* Reacquire exclusive lock on the meta page */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+		if (BufferIsValid(newmapbuf))
+		{
+			_hash_initbitmapbuffer(newmapbuf, metap->hashm_bmsize, false);
+			MarkBufferDirty(newmapbuf);
 
-	/* convert bit to absolute bit number */
-	bit += (i << BMPG_SHIFT(metap));
+			metap->hashm_mapp[metap->hashm_nmaps] = BufferGetBlockNumber(newmapbuf);
+			metap->hashm_nmaps++;
+			metap->hashm_spares[splitnum]++;
+			MarkBufferDirty(metabuf);
+		}
 
-	/* Calculate address of the recycled overflow page */
-	blkno = bitno_to_blkno(metap, bit);
+		/*
+		 * for new overflow page, we don't need to explicitly set the bit in
+		 * bitmap page, as by default that will be set to "in use".
+		 */
+	}
 
 	/*
 	 * Adjust hashm_firstfree to avoid redundant searches.  But don't risk
@@ -346,19 +342,39 @@ found:
 	if (metap->hashm_firstfree == orig_firstfree)
 	{
 		metap->hashm_firstfree = bit + 1;
-
-		/* Write updated metapage and release lock, but not pin */
 		MarkBufferDirty(metabuf);
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 	}
+
+	/* now that we have correct backlink, initialize new overflow page */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = BufferGetBlockNumber(buf);
+	ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+	ovflopaque->hasho_bucket = pageopaque->hasho_bucket;
+	ovflopaque->hasho_flag = LH_OVERFLOW_PAGE;
+	ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(ovflbuf);
+
+	/* logically chain overflow page to previous page */
+	pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf);
+
+	MarkBufferDirty(buf);
+
+	if (retain_pin)
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
-	{
-		/* We didn't change the metapage, so no need to write */
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-	}
+		_hash_relbuf(rel, buf);
 
-	/* Fetch, init, and return the recycled page */
-	return _hash_getinitbuf(rel, blkno);
+	if (BufferIsValid(mapbuf))
+		_hash_relbuf(rel, mapbuf);
+
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	if (BufferIsValid(newmapbuf))
+		_hash_relbuf(rel, newmapbuf);
+
+	return ovflbuf;
 }
 
 /*
-- 
1.8.4.msysgit.0

0004-Restructure-split-bucket-code.patchapplication/octet-stream; name=0004-Restructure-split-bucket-code.patchDownload

From a0aea2a8989e644f883c87e0146a505cc8bd568b Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Thu, 16 Feb 2017 18:10:21 +0530
Subject: [PATCH 4/5] Restructure split bucket code.

This restructuring is done so that split operation can be
performed atomically and can be WAL-logged.
---
 src/backend/access/hash/hashpage.c | 162 +++++++++++++++----------------------
 1 file changed, 67 insertions(+), 95 deletions(-)

diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 00f3ea8..fd7344b 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -40,12 +40,9 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -497,7 +494,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
@@ -685,18 +684,18 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
 	 * to disk.  We don't really expect any failure, but just to be sure,
 	 * establish a critical section.
 	 */
 	START_CRIT_SECTION();
 
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -718,8 +717,7 @@ restart_expand:
 		metap->hashm_ovflpoint = spare_ndx;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -732,16 +730,51 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
-	/* Write out the metapage and drop lock, but keep pin */
-	MarkBufferDirty(metabuf);
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress.  (At
+	 * operation end, we will clear the split-in-progress flag.)  Also,
+	 * for a primary bucket page, hasho_prevblkno stores the number of
+	 * buckets that existed as of the last split, so we must update that
+	 * value here.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
+	oopaque->hasho_prevblkno = maxbucket;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = maxbucket;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	END_CRIT_SECTION();
+
+	/* drop lock, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -803,10 +836,16 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, the caller needs to fill htab.  If htab is set,
+ * then we skip the movement of tuples that exists in htab, otherwise NULL
+ * value of htab indicates movement of all the tuples that belong to the new
+ * bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -832,73 +871,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress.  (At
-	 * operation end, we will clear the split-in-progress flag.)  Also,
-	 * for a primary bucket page, hasho_prevblkno stores the number of
-	 * buckets that existed as of the last split, so we must update that
-	 * value here.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
-	oopaque->hasho_prevblkno = maxbucket;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = maxbucket;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * to finish incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -985,9 +962,9 @@ _hash_splitbucket_guts(Relation rel,
 
 				if (PageGetFreeSpace(npage) < itemsz)
 				{
-					/* write out nbuf and drop lock, but keep pin */
-					MarkBufferDirty(nbuf);
+					/* drop lock, but keep pin */
 					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
@@ -1025,7 +1002,13 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			if (nbuf == bucket_nbuf)
+				LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1041,17 +1024,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nbuf == bucket_nbuf)
-	{
-		MarkBufferDirty(bucket_nbuf);
-		LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
-	}
-	else
-	{
-		MarkBufferDirty(nbuf);
-		_hash_relbuf(rel, nbuf);
-	}
-
 	LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1192,9 +1164,9 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nbucket = npageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	_hash_relbuf(rel, bucket_nbuf);
 	LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
-- 
1.8.4.msysgit.0

0005-Restructure-hash-index-creation.patchapplication/octet-stream; name=0005-Restructure-hash-index-creation.patchDownload

From 621a67a172fe2eab6fc59f2e388e0379e4837278 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Thu, 16 Feb 2017 19:05:12 +0530
Subject: [PATCH 5/5] Restructure hash index creation.

Restructure metapage initialization code so that the operation
can be performed atomically and can be WAL-logged.
---
 src/backend/access/hash/hash.c     |   4 +-
 src/backend/access/hash/hashovfl.c |  62 -----------
 src/backend/access/hash/hashpage.c | 203 +++++++++++++++++++++++++------------
 src/include/access/hash.h          |  10 +-
 4 files changed, 144 insertions(+), 135 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 24510e7..1f8a7f6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -120,7 +120,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -182,7 +182,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 3c2383f..d35089c 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -574,68 +574,6 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 
 
 /*
- *	_hash_initbitmap()
- *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
- *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
- */
-void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
-{
-	Buffer		buf;
-	Page		pg;
-	HashPageOpaque op;
-	uint32	   *freep;
-
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
-	pg = BufferGetPage(buf);
-
-	/* initialize the page's special space */
-	op = (HashPageOpaque) PageGetSpecialPointer(pg);
-	op->hasho_prevblkno = InvalidBlockNumber;
-	op->hasho_nextblkno = InvalidBlockNumber;
-	op->hasho_bucket = -1;
-	op->hasho_flag = LH_BITMAP_PAGE;
-	op->hasho_page_id = HASHO_PAGE_ID;
-
-	/* set all of the bits to 1 */
-	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* dirty the new bitmap page, and release write lock and pin */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
-
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
-}
-
-
-/*
  *	_hash_initbitmapbuffer()
  *
  *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index fd7344b..07cb4f2 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -157,6 +157,36 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket, uint32 flag,
+			  bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Set hasho_prevblkno with current hashm_maxbucket. This value will
+	 * be used to validate cached HashMetaPageData. See
+	 * _hash_getbucketbuf_from_hashkey().
+	 */
+	pageopaque->hasho_prevblkno = max_bucket;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -288,7 +318,7 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -300,19 +330,18 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -334,6 +363,96 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -353,30 +472,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -393,7 +507,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -411,54 +525,9 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_ovflpoint = log2_num_buckets;
 	metap->hashm_firstfree = 0;
 
-	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
-	 */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-	/*
-	 * Initialize the first N buckets
-	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-
-		/*
-		 * Set hasho_prevblkno with current hashm_maxbucket. This value will
-		 * be used to validate cached HashMetaPageData. See
-		 * _hash_getbucketbuf_from_hashkey().
-		 */
-		pageopaque->hasho_prevblkno = metap->hashm_maxbucket;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		MarkBufferDirty(buf);
-		_hash_relbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Initialize first bitmap page
-	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	MarkBufferDirty(metabuf);
-	_hash_relbuf(rel, metabuf);
-
-	return num_buckets;
+	/* Set pd_lower just past the end of the metadata. */
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -535,7 +604,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9c0b79f..bfdfed8 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -311,8 +311,6 @@ extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool r
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
 			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
@@ -331,6 +329,8 @@ extern Buffer _hash_getbucketbuf_from_hashkey(Relation rel, uint32 hashkey,
 								int access,
 								HashMetaPage *cachedmetap);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket,
+				uint32 flag, bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -339,8 +339,10 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
-- 
1.8.4.msysgit.0

#77

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#76)

1 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Feb 16, 2017 at 7:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Attached are refactoring patches. WAL patch needs some changes based
on above comments, so will post it later.

Attached is a rebased patch to enable WAL logging for hash indexes.
Note, that this needs to be applied on top of refactoring patches [1]/messages/by-id/CAA4eK1JCk-abtsVMMP8xqGZktUH73ZmLaZ6b_+-oCRtRkdqPrQ@mail.gmail.com
sent in the previous e-mail.

[1]: /messages/by-id/CAA4eK1JCk-abtsVMMP8xqGZktUH73ZmLaZ6b_+-oCRtRkdqPrQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0006-Enable-WAL-for-Hash-Indexes.patchapplication/octet-stream; name=0006-Enable-WAL-for-Hash-Indexes.patchDownload

From 8fa4be92fcc91c40c45d1f6de9dfa70ccdd62f8f Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Fri, 17 Feb 2017 11:08:29 +0530
Subject: [PATCH] Enable WAL for Hash Indexes.

This will make hash index durable which means that data
for hash indexes can be recovered after crash.  This will
also allow us to replicate hash indexes.
---
 contrib/pageinspect/expected/hash.out          |   1 -
 contrib/pgstattuple/expected/pgstattuple.out   |   1 -
 doc/src/sgml/backup.sgml                       |  13 -
 doc/src/sgml/config.sgml                       |   7 +-
 doc/src/sgml/high-availability.sgml            |   6 -
 doc/src/sgml/indices.sgml                      |  12 -
 doc/src/sgml/ref/create_index.sgml             |  13 -
 src/backend/access/hash/Makefile               |   2 +-
 src/backend/access/hash/README                 | 144 +++-
 src/backend/access/hash/hash.c                 |  81 +-
 src/backend/access/hash/hash_xlog.c            | 974 +++++++++++++++++++++++++
 src/backend/access/hash/hashinsert.c           |  59 +-
 src/backend/access/hash/hashovfl.c             | 243 +++++-
 src/backend/access/hash/hashpage.c             | 219 +++++-
 src/backend/access/hash/hashsearch.c           |   5 +
 src/backend/access/rmgrdesc/hashdesc.c         | 134 +++-
 src/backend/commands/indexcmds.c               |   5 -
 src/backend/utils/cache/relcache.c             |  12 +-
 src/include/access/hash_xlog.h                 | 232 ++++++
 src/test/regress/expected/create_index.out     |   5 -
 src/test/regress/expected/enum.out             |   1 -
 src/test/regress/expected/hash_index.out       |   2 -
 src/test/regress/expected/macaddr.out          |   1 -
 src/test/regress/expected/replica_identity.out |   1 -
 src/test/regress/expected/uuid.out             |   1 -
 25 files changed, 2046 insertions(+), 128 deletions(-)
 create mode 100644 src/backend/access/hash/hash_xlog.c

diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 8ed60bc..3ba01f6 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -1,7 +1,6 @@
 CREATE TABLE test_hash (a int, b text);
 INSERT INTO test_hash VALUES (1, 'one');
 CREATE INDEX test_hash_a_idx ON test_hash USING hash (a);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 \x
 SELECT hash_page_type(get_raw_page('test_hash_a_idx', 0));
 -[ RECORD 1 ]--+---------
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 169d193..7d7ade8 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -131,7 +131,6 @@ select * from pgstatginindex('test_ginidx');
 (1 row)
 
 create index test_hashidx on test using hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 select * from pgstathashindex('test_hashidx');
  version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | live_items | dead_items | free_percent 
 ---------+--------------+----------------+--------------+------------+------------+------------+--------------
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 5f009ee..5a7c730 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1537,19 +1537,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 95afc2c..e970671 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2152,10 +2152,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 48de2ce..0e61991 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2353,12 +2353,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 271c135..e40750e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index fcb7a60..7163b03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index e2e7e91..b154569 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
-       hashsort.o hashutil.o hashvalidate.o
+       hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 703ae98..1c1f0db 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -287,13 +287,17 @@ The insertion algorithm is rather similar:
 	if current page is full, release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
+	take buffer content lock in exclusive mode on metapage
 	insert tuple at appropriate place in page
-	mark current page dirty and release buffer content lock and pin
-	if the current page is not a bucket page, release the pin on bucket page
-	pin meta page and take buffer content lock in exclusive mode
+	mark current page dirty
 	increment tuple count, decide if split needed
-	mark meta page dirty and release buffer content lock and pin
-	done if no split needed, else enter Split algorithm below
+	mark meta page dirty
+	write WAL for insertion of tuple
+	release the buffer content lock on metapage
+	release buffer content lock on current page
+	if current page is not a bucket page, release the pin on bucket page
+	if split is needed, enter Split algorithm below
+	release the pin on metapage
 
 To speed searches, the index entries within any individual index page are
 kept sorted by hash code; the insertion code must take care to insert new
@@ -328,12 +332,17 @@ existing bucket in two, thereby lowering the fill ratio:
        try to finish the split and the cleanup work
        if that succeeds, start over; if it fails, give up
 	mark the old and new buckets indicating split is in progress
+	mark both old and new buckets as dirty
+	write WAL for allocation of new page for split
 	copy the tuples that belongs to new bucket from old bucket, marking
      them as moved-by-split
+	write WAL record for moving tuples to new page once the new page is full
+	or all the pages of old bucket are finished
 	release lock but not pin for primary bucket page of old bucket,
 	 read/shared-lock next page; repeat as needed
 	clear the bucket-being-split and bucket-being-populated flags
 	mark the old bucket indicating split-cleanup
+	write WAL for changing the flags on both old and new buckets
 
 The split operation's attempt to acquire cleanup-lock on the old bucket number
 could fail if another process holds any lock or pin on it.  We do not want to
@@ -369,6 +378,8 @@ The fourth operation is garbage collection (bulk deletion):
 		acquire cleanup lock on primary bucket page
 		loop:
 			scan and remove tuples
+			mark the target page dirty
+			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
 			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
@@ -383,7 +394,8 @@ The fourth operation is garbage collection (bulk deletion):
 	check if number of buckets changed
 	if so, release content lock and pin and return to for-each-bucket loop
 	else update metapage tuple count
-	mark meta page dirty and release buffer content lock and pin
+	 mark meta page dirty and write WAL for update of metapage
+	 release buffer content lock and pin
 
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
@@ -425,18 +437,16 @@ Obtaining an overflow page:
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		mark bitmap page dirty and release content lock
+		mark bitmap page dirty
 		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
 			update it and mark meta page dirty
-		release meta page buffer content lock
-		return page number
 	else (not found):
 	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -456,12 +466,15 @@ like this:
 
 	-- having determined that no space is free in the target bucket:
 	remember last page of bucket, drop write lock on it
-	call free-page-acquire routine
 	re-write-lock last page of bucket
 	if it is not last anymore, step to the last page
-	update (former) last page to point to new page
+	execute free-page-acquire (Obtaining an overflow page) mechanism described above
+	update (former) last page to point to the new page and mark the  buffer dirty.
 	write-lock and initialize new page, with back link to former last page
-	write and release former last page
+	write WAL for addition of overflow page
+	release the locks on meta page and bitmap page acquired in free-page-acquire algorithm
+	release the lock on former last page
+	release the lock on new overflow page
 	insert tuple into new page
 	-- etc.
 
@@ -488,12 +501,14 @@ accessors of pages in the bucket.  The algorithm is:
 	determine which bitmap page contains the free space bit for page
 	release meta page buffer content lock
 	pin bitmap page and take buffer content lock in exclusive mode
-	update bitmap bit
-	mark bitmap page dirty and release buffer content lock and pin
-	if page number is less than what we saw as first-free-bit in meta:
 	retake meta page buffer content lock in exclusive mode
+	move (insert) tuples that belong to the overflow page being freed
+	update bitmap bit
+	mark bitmap page dirty
 	if page number is still less than first-free-bit,
 		update first-free-bit field and mark meta page dirty
+	write WAL for delinking overflow page operation
+	release buffer content lock and pin
 	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
@@ -504,8 +519,101 @@ page acquirer will scan more bitmap bits than he needs to.  What must be
 avoided is having first-free-bit greater than the actual first free bit,
 because then that free page would never be found by searchers.
 
-All the freespace operations should be called while holding no buffer
-locks.  Since they need no lmgr locks, deadlock is not possible.
+The reason of moving tuples from overflow page while delinking the later is
+to make that as an atomic operation.  Not doing so could lead to spurious reads
+on standby.  Basically, the user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which get
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolled back.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only the fixed number of pages XLR_MAX_BLOCK_ID (32)
+with current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform the split operation if the number of tuples are more than what can be
+accomodated in the initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of the old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with bucket-being-split and bucket-being-populated flags respectively
+which indicates that split is in progress for those buckets.  The reader
+algorithm works correctly, as it will scan both the old and new buckets when
+the split is in progress as explained in the reader algorithm section above.
+
+We finish the split at next insert or split operation on the old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vacuum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeeze the bucket completely.
 
 
 Other Notes
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 1f8a7f6..7f030f8 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -28,6 +28,7 @@
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -303,6 +304,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -623,6 +629,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -649,6 +656,26 @@ loop_top:
 	}
 
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
@@ -816,9 +843,40 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -866,8 +924,25 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
 		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
@@ -881,9 +956,3 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..aff9b53
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,974 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, InvalidBlockNumber, *num_bucket, LH_OVERFLOW_PAGE,
+				  true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+		oldopaque->hasho_prevblkno = xlrec->new_bucket;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
+				  xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay split cleanup flag operation for primary bucket page.
+ */
+static void
+hash_xlog_split_cleanup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			hash_xlog_split_cleanup(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 354e733..241728f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -40,6 +42,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -158,25 +161,20 @@ restart_insert:
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * dirty and release the modified page.  if the page we modified was an
-	 * overflow page, we also need to separately drop the pin we retained on
-	 * the primary bucket page.
-	 */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap = HashPageGetMeta(metapage);
 	metap->hashm_ntuples += 1;
 
@@ -184,10 +182,43 @@ restart_insert:
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
+
 	/* Attempt to split if a split is needed */
 	if (do_expand)
 		_hash_expandtable(rel, metabuf);
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index d35089c..12aeac3 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,6 +18,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -94,7 +96,9 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -136,6 +140,13 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	 * page is released, then finally acquire the lock on new overflow buffer.
 	 * We need this locking order to avoid deadlock with backends that are
 	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -303,8 +314,12 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 found:
 
 	/*
-	 * Do the update.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case the new page is added.
 	 */
+	START_CRIT_SECTION();
+
 	if (page_found)
 	{
 		Assert(BufferIsValid(mapbuf));
@@ -361,6 +376,61 @@ found:
 
 	MarkBufferDirty(buf);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	if (retain_pin)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
@@ -374,6 +444,13 @@ found:
 	if (BufferIsValid(newmapbuf))
 		_hash_relbuf(rel, newmapbuf);
 
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+	LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);
+
 	return ovflbuf;
 }
 
@@ -407,7 +484,11 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
- *	Add the tuples (itups) to wbuf.
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
@@ -429,8 +510,6 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
-	Buffer		prevbuf = InvalidBuffer;
-	Buffer		nextbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
@@ -444,6 +523,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -508,6 +590,12 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
 	/*
 	 * we have to insert tuples on the "write" page, being careful to preserve
 	 * hashkey ordering.  (If we insert many tuples into the same "write" page
@@ -519,7 +607,11 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		MarkBufferDirty(wbuf);
 	}
 
-	/* Initialise the freed overflow page. */
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed the
+	 * page as WAL replay routines expect pages to be initialized. See
+	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
 	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
 	MarkBufferDirty(ovflbuf);
 
@@ -550,9 +642,83 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
+		update_metap = true;
 		MarkBufferDirty(metabuf);
 	}
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	/* release previous bucket if it is not same as write bucket */
 	if (BufferIsValid(prevbuf) && prevblkno != writeblkno)
 		_hash_relbuf(rel, prevbuf);
@@ -604,7 +770,11 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 	freep = HashPageGetBitmap(pg);
 	MemSet(freep, 0xFF, bmsize);
 
-	/* Set pd_lower just past the end of the bitmap page data. */
+	/*
+	 * Set pd_lower just past the end of the bitmap page data.  We could even
+	 * set pd_lower equal to pd_upper, but this is more precise and makes the
+	 * page look compressible to xlog.c.
+	 */
 	((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
 }
 
@@ -766,6 +936,15 @@ readpage:
 					Assert(nitups == ndeletable);
 
 					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
 					 * we have to insert tuples on the "write" page, being
 					 * careful to preserve hashkey ordering.  (If we insert
 					 * many tuples into the same "write" page it would be
@@ -778,6 +957,43 @@ readpage:
 					PageIndexMultiDelete(rpage, deletable, ndeletable);
 					MarkBufferDirty(rbuf);
 
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
 					tups_moved = true;
 				}
 
@@ -790,13 +1006,24 @@ readpage:
 				else
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will be able to
+				 * view the results of partial deletion on rblkno.
+				 */
+				LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				LockBuffer(rbuf, BUFFER_LOCK_EXCLUSIVE);
+
 				wbuf = next_wbuf;
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 07cb4f2..bac030a 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -43,6 +44,8 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -381,6 +384,31 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 	pg = BufferGetPage(metabuf);
 	metap = HashPageGetMeta(pg);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	num_buckets = metap->hashm_maxbucket + 1;
 
 	/*
@@ -405,6 +433,12 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 		buf = _hash_getnewbuf(rel, blkno, forkNum);
 		_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
 		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
 		_hash_relbuf(rel, buf);
 	}
 
@@ -431,6 +465,32 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_nmaps++;
 	MarkBufferDirty(metabuf);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	/* all done */
 	_hash_relbuf(rel, bitmapbuf);
 	_hash_relbuf(rel, metabuf);
@@ -525,7 +585,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	metap->hashm_ovflpoint = log2_num_buckets;
 	metap->hashm_firstfree = 0;
 
-	/* Set pd_lower just past the end of the metadata. */
+	/*
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
+	 */
 	((PageHeader) page)->pd_lower =
 		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
@@ -569,6 +632,8 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -728,7 +793,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -757,8 +826,7 @@ restart_expand:
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
@@ -772,6 +840,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -784,6 +853,7 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
 	MarkBufferDirty(metabuf);
@@ -829,11 +899,61 @@ restart_expand:
 
 	MarkBufferDirty(buf_nblkno);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	END_CRIT_SECTION();
 
 	/* drop lock, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(buf_nblkno, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buf_nblkno, BUFFER_LOCK_EXCLUSIVE);
+
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
@@ -883,6 +1003,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
 	char		zerobuf[BLCKSZ];
+	Page		page;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -893,7 +1014,21 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	MemSet(zerobuf, 0, sizeof(zerobuf));
+	page = (Page) zerobuf;
+
+	/*
+	 * Initialise the new bucket page, here we can't complete zeroed the page
+	 * as WAL replay routines expect pages to be initialized. See explanation
+	 * of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(page, BLCKSZ);
+
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
 
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
@@ -1029,8 +1164,11 @@ _hash_splitbucket(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
 					/* drop lock, but keep pin */
 					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
 
@@ -1041,6 +1179,13 @@ _hash_splitbucket(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -1049,6 +1194,8 @@ _hash_splitbucket(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1072,6 +1219,9 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
 			if (nbuf == bucket_nbuf)
 				LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
 			else
@@ -1101,6 +1251,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
 
@@ -1117,6 +1269,29 @@ _hash_splitbucket(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1243,6 +1418,38 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 }
 
 /*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
  *	_hash_getcachedmetap() -- Returns cached metapage data.
  *
  *	If metabuf is not InvalidBuffer, caller must hold a pin, but no lock, on
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9e5d7e4..d733770 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -123,6 +123,7 @@ _hash_readnext(IndexScanDesc scan,
 	if (block_found)
 	{
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 	}
 }
@@ -168,6 +169,7 @@ _hash_readprev(IndexScanDesc scan,
 		*bufp = _hash_getbuf(rel, blkno, HASH_READ,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 
 		/*
@@ -283,6 +285,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 
 	buf = _hash_getbucketbuf_from_hashkey(rel, hashkey, HASH_READ, NULL);
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	bucket = opaque->hasho_bucket;
 
@@ -318,6 +321,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+		TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf));
 
 		/*
 		 * remember the split bucket buffer so as to use it later for
@@ -520,6 +524,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 7eac819..f00ec2d 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,142 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "old_bucket_flag %u, new_bucket_flag %u",
+								 xlrec->old_bucket_flag, xlrec->new_bucket_flag);
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			id = "SPLIT_CLEANUP";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 72bb06c..9618032 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -506,11 +506,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9001e20..ce55fc5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5880,13 +5880,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5908,8 +5905,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index cc23163..8400022 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -16,7 +16,239 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "storage/off.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_SPLIT_CLEANUP 0xA0	/* clear split-cleanup flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (HashMetaPageData)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index e519fdb..26cd059 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index f8b9f02..0a18efa 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -201,7 +201,6 @@ SELECT h.seqno AS f20000
 --
 CREATE TABLE hash_split_heap (keycol INT);
 CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
 VACUUM FULL hash_split_heap;
 -- Let's do a backward scan.
@@ -230,5 +229,4 @@ DROP TABLE hash_temp_heap CASCADE;
 CREATE TABLE hash_heap_float4 (x float4, y int);
 INSERT INTO hash_heap_float4 VALUES (1.1,1);
 CREATE INDEX hash_idx ON hash_heap_float4 USING hash (x);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 DROP TABLE hash_heap_float4 CASCADE;
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index fa63235..67c34a9 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 423f277..db66dc7 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail
-- 
1.8.4.msysgit.0

#78

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#76)

Re: Write Ahead Logging for Hash Indexes

On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached are refactoring patches. WAL patch needs some changes based
on above comments, so will post it later.

After some study, I have committed 0001, and also committed 0002 and
0003 as a single commit, with only minor tweaks.

I haven't made a full study of 0004 yet, but I'm a bit concerned about
a couple of the hunks, specifically this one:

@@ -985,9 +962,9 @@ _hash_splitbucket_guts(Relation rel,

                 if (PageGetFreeSpace(npage) < itemsz)
                 {
-                    /* write out nbuf and drop lock, but keep pin */
-                    MarkBufferDirty(nbuf);
+                    /* drop lock, but keep pin */
                     LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+
                     /* chain to a new overflow page */
                     nbuf = _hash_addovflpage(rel, metabuf, nbuf,
(nbuf == bucket_nbuf) ? true : false);
                     npage = BufferGetPage(nbuf);

And also this one:

@@ -1041,17 +1024,6 @@ _hash_splitbucket_guts(Relation rel,
      * To avoid deadlocks due to locking order of buckets, first lock the old
      * bucket and then the new bucket.
      */
-    if (nbuf == bucket_nbuf)
-    {
-        MarkBufferDirty(bucket_nbuf);
-        LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
-    }
-    else
-    {
-        MarkBufferDirty(nbuf);
-        _hash_relbuf(rel, nbuf);
-    }
-
     LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
     opage = BufferGetPage(bucket_obuf);
     oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);

I haven't quite grasped what's going on here, but it looks like those
MarkBufferDirty() calls didn't move someplace else, but rather just
vanished. That would seem to be not good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#78)

3 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Mon, Feb 27, 2017 at 11:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Feb 16, 2017 at 8:16 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Attached are refactoring patches. WAL patch needs some changes based
on above comments, so will post it later.

After some study, I have committed 0001, and also committed 0002 and
0003 as a single commit, with only minor tweaks.

I haven't made a full study of 0004 yet, but I'm a bit concerned about
a couple of the hunks, specifically this one:

@@ -985,9 +962,9 @@ _hash_splitbucket_guts(Relation rel,
if (PageGetFreeSpace(npage) < itemsz)
{
-                    /* write out nbuf and drop lock, but keep pin */
-                    MarkBufferDirty(nbuf);
+                    /* drop lock, but keep pin */
LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+
/* chain to a new overflow page */
nbuf = _hash_addovflpage(rel, metabuf, nbuf,
(nbuf == bucket_nbuf) ? true : false);
npage = BufferGetPage(nbuf);
And also this one:
@@ -1041,17 +1024,6 @@ _hash_splitbucket_guts(Relation rel,
* To avoid deadlocks due to locking order of buckets, first lock the old
* bucket and then the new bucket.
*/
-    if (nbuf == bucket_nbuf)
-    {
-        MarkBufferDirty(bucket_nbuf);
-        LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
-    }
-    else
-    {
-        MarkBufferDirty(nbuf);
-        _hash_relbuf(rel, nbuf);
-    }
-
LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
opage = BufferGetPage(bucket_obuf);
oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
I haven't quite grasped what's going on here, but it looks like those
MarkBufferDirty() calls didn't move someplace else, but rather just
vanished. That would seem to be not good.

Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
but I think as this patch is standalone, so we should not remove it
from their existing usage, I have added those back and rebased the
remaining patches.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Restructure-split-bucket-code.patchapplication/octet-stream; name=0001-Restructure-split-bucket-code.patchDownload

From 23b61146a4fa1d54073af95657cb5a36a4fc6eb5 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Tue, 28 Feb 2017 18:32:56 +0530
Subject: [PATCH 1/3] Restructure split bucket code.

This restructuring is done so that split operation can be
performed atomically and can be WAL-logged.
---
 src/backend/access/hash/hashpage.c | 160 ++++++++++++++++---------------------
 1 file changed, 67 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 00f3ea8..bb1ce75 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -40,12 +40,9 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  Bucket obucket, Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
-static void _hash_splitbucket_guts(Relation rel, Buffer metabuf,
-					   Bucket obucket, Bucket nbucket, Buffer obuf,
-					   Buffer nbuf, HTAB *htab, uint32 maxbucket,
-					   uint32 highmask, uint32 lowmask);
 
 
 /*
@@ -497,7 +494,9 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	Buffer		buf_nblkno;
 	Buffer		buf_oblkno;
 	Page		opage;
+	Page		npage;
 	HashPageOpaque oopaque;
+	HashPageOpaque nopaque;
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
@@ -685,18 +684,18 @@ restart_expand:
 		goto fail;
 	}
 
-
 	/*
-	 * Okay to proceed with split.  Update the metapage bucket mapping info.
-	 *
-	 * Since we are scribbling on the metapage data right in the shared
-	 * buffer, any failure in this next little bit leaves us with a big
+	 * Since we are scribbling on the pages in the shared buffers, establish a
+	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
 	 * to disk.  We don't really expect any failure, but just to be sure,
 	 * establish a critical section.
 	 */
 	START_CRIT_SECTION();
 
+	/*
+	 * Okay to proceed with split.  Update the metapage bucket mapping info.
+	 */
 	metap->hashm_maxbucket = new_bucket;
 
 	if (new_bucket > metap->hashm_highmask)
@@ -718,8 +717,7 @@ restart_expand:
 		metap->hashm_ovflpoint = spare_ndx;
 	}
 
-	/* Done mucking with metapage */
-	END_CRIT_SECTION();
+	MarkBufferDirty(metabuf);
 
 	/*
 	 * Copy bucket mapping info now; this saves re-accessing the meta page
@@ -732,16 +730,51 @@ restart_expand:
 	highmask = metap->hashm_highmask;
 	lowmask = metap->hashm_lowmask;
 
-	/* Write out the metapage and drop lock, but keep pin */
-	MarkBufferDirty(metabuf);
+	opage = BufferGetPage(buf_oblkno);
+	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
+
+	/*
+	 * Mark the old bucket to indicate that split is in progress.  (At
+	 * operation end, we will clear the split-in-progress flag.)  Also,
+	 * for a primary bucket page, hasho_prevblkno stores the number of
+	 * buckets that existed as of the last split, so we must update that
+	 * value here.
+	 */
+	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
+	oopaque->hasho_prevblkno = maxbucket;
+
+	MarkBufferDirty(buf_oblkno);
+
+	npage = BufferGetPage(buf_nblkno);
+
+	/*
+	 * initialize the new bucket's primary page and mark it to indicate that
+	 * split is in progress.
+	 */
+	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
+	nopaque->hasho_prevblkno = maxbucket;
+	nopaque->hasho_nextblkno = InvalidBlockNumber;
+	nopaque->hasho_bucket = new_bucket;
+	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
+	nopaque->hasho_page_id = HASHO_PAGE_ID;
+
+	MarkBufferDirty(buf_nblkno);
+
+	END_CRIT_SECTION();
+
+	/* drop lock, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
-					  buf_oblkno, buf_nblkno,
+					  buf_oblkno, buf_nblkno, NULL,
 					  maxbucket, highmask, lowmask);
 
+	/* all done, now release the locks and pins on primary buckets. */
+	_hash_relbuf(rel, buf_oblkno);
+	_hash_relbuf(rel, buf_nblkno);
+
 	return;
 
 	/* Here if decide not to split or fail to acquire old bucket lock */
@@ -803,10 +836,16 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 /*
  * _hash_splitbucket -- split 'obucket' into 'obucket' and 'nbucket'
  *
+ * This routine is used to partition the tuples between old and new bucket and
+ * is used to finish the incomplete split operations.  To finish the previously
+ * interrupted split operation, the caller needs to fill htab.  If htab is set,
+ * then we skip the movement of tuples that exists in htab, otherwise NULL
+ * value of htab indicates movement of all the tuples that belong to the new
+ * bucket.
+ *
  * We are splitting a bucket that consists of a base bucket page and zero
  * or more overflow (bucket chain) pages.  We must relocate tuples that
- * belong in the new bucket, and compress out any free space in the old
- * bucket.
+ * belong in the new bucket.
  *
  * The caller must hold cleanup locks on both buckets to ensure that
  * no one else is trying to access them (see README).
@@ -832,73 +871,11 @@ _hash_splitbucket(Relation rel,
 				  Bucket nbucket,
 				  Buffer obuf,
 				  Buffer nbuf,
+				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask,
 				  uint32 lowmask)
 {
-	Page		opage;
-	Page		npage;
-	HashPageOpaque oopaque;
-	HashPageOpaque nopaque;
-
-	opage = BufferGetPage(obuf);
-	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
-
-	/*
-	 * Mark the old bucket to indicate that split is in progress.  (At
-	 * operation end, we will clear the split-in-progress flag.)  Also,
-	 * for a primary bucket page, hasho_prevblkno stores the number of
-	 * buckets that existed as of the last split, so we must update that
-	 * value here.
-	 */
-	oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
-	oopaque->hasho_prevblkno = maxbucket;
-
-	npage = BufferGetPage(nbuf);
-
-	/*
-	 * initialize the new bucket's primary page and mark it to indicate that
-	 * split is in progress.
-	 */
-	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
-	nopaque->hasho_prevblkno = maxbucket;
-	nopaque->hasho_nextblkno = InvalidBlockNumber;
-	nopaque->hasho_bucket = nbucket;
-	nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;
-	nopaque->hasho_page_id = HASHO_PAGE_ID;
-
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, nbuf, NULL,
-						   maxbucket, highmask, lowmask);
-
-	/* all done, now release the locks and pins on primary buckets. */
-	_hash_relbuf(rel, obuf);
-	_hash_relbuf(rel, nbuf);
-}
-
-/*
- * _hash_splitbucket_guts -- Helper function to perform the split operation
- *
- * This routine is used to partition the tuples between old and new bucket and
- * to finish incomplete split operations.  To finish the previously
- * interrupted split operation, caller needs to fill htab.  If htab is set, then
- * we skip the movement of tuples that exists in htab, otherwise NULL value of
- * htab indicates movement of all the tuples that belong to new bucket.
- *
- * Caller needs to lock and unlock the old and new primary buckets.
- */
-static void
-_hash_splitbucket_guts(Relation rel,
-					   Buffer metabuf,
-					   Bucket obucket,
-					   Bucket nbucket,
-					   Buffer obuf,
-					   Buffer nbuf,
-					   HTAB *htab,
-					   uint32 maxbucket,
-					   uint32 highmask,
-					   uint32 lowmask)
-{
 	Buffer		bucket_obuf;
 	Buffer		bucket_nbuf;
 	Page		opage;
@@ -987,6 +964,7 @@ _hash_splitbucket_guts(Relation rel,
 				{
 					/* write out nbuf and drop lock, but keep pin */
 					MarkBufferDirty(nbuf);
+					/* drop lock, but keep pin */
 					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
@@ -1025,7 +1003,14 @@ _hash_splitbucket_guts(Relation rel,
 
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
+		{
+			MarkBufferDirty(nbuf);
+			if (nbuf == bucket_nbuf)
+				LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+			else
+				_hash_relbuf(rel, nbuf);
 			break;
+		}
 
 		/* Else, advance to next old page */
 		obuf = _hash_getbuf(rel, oblkno, HASH_READ, LH_OVERFLOW_PAGE);
@@ -1041,17 +1026,6 @@ _hash_splitbucket_guts(Relation rel,
 	 * To avoid deadlocks due to locking order of buckets, first lock the old
 	 * bucket and then the new bucket.
 	 */
-	if (nbuf == bucket_nbuf)
-	{
-		MarkBufferDirty(bucket_nbuf);
-		LockBuffer(bucket_nbuf, BUFFER_LOCK_UNLOCK);
-	}
-	else
-	{
-		MarkBufferDirty(nbuf);
-		_hash_relbuf(rel, nbuf);
-	}
-
 	LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE);
 	opage = BufferGetPage(bucket_obuf);
 	oopaque = (HashPageOpaque) PageGetSpecialPointer(opage);
@@ -1192,9 +1166,9 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 	npageopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 	nbucket = npageopaque->hasho_bucket;
 
-	_hash_splitbucket_guts(rel, metabuf, obucket,
-						   nbucket, obuf, bucket_nbuf, tidhtab,
-						   maxbucket, highmask, lowmask);
+	_hash_splitbucket(rel, metabuf, obucket,
+					  nbucket, obuf, bucket_nbuf, tidhtab,
+					  maxbucket, highmask, lowmask);
 
 	_hash_relbuf(rel, bucket_nbuf);
 	LockBuffer(obuf, BUFFER_LOCK_UNLOCK);
-- 
1.8.4.msysgit.0

0002-Restructure-hash-index-creation.patchapplication/octet-stream; name=0002-Restructure-hash-index-creation.patchDownload

From 5244165074a296f7e327b6e3c8c45a2efc56f028 Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Tue, 28 Feb 2017 18:35:58 +0530
Subject: [PATCH 2/3] Restructure hash index creation.

Restructure metapage initialization code so that the operation
can be performed atomically and can be WAL-logged.
---
 src/backend/access/hash/hash.c     |   4 +-
 src/backend/access/hash/hashovfl.c |  62 -----------
 src/backend/access/hash/hashpage.c | 203 +++++++++++++++++++++++++------------
 src/include/access/hash.h          |  10 +-
 4 files changed, 144 insertions(+), 135 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 24510e7..1f8a7f6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -120,7 +120,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
 
 	/* Initialize the hash index metadata page and initial buckets */
-	num_buckets = _hash_metapinit(index, reltuples, MAIN_FORKNUM);
+	num_buckets = _hash_init(index, reltuples, MAIN_FORKNUM);
 
 	/*
 	 * If we just insert the tuples into the index in scan order, then
@@ -182,7 +182,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 hashbuildempty(Relation index)
 {
-	_hash_metapinit(index, 0, INIT_FORKNUM);
+	_hash_init(index, 0, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 9d89e86..1087480 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -571,68 +571,6 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 
 
 /*
- *	_hash_initbitmap()
- *
- *	 Initialize a new bitmap page.  The metapage has a write-lock upon
- *	 entering the function, and must be written by caller after return.
- *
- * 'blkno' is the block number of the new bitmap page.
- *
- * All bits in the new bitmap page are set to "1", indicating "in use".
- */
-void
-_hash_initbitmap(Relation rel, HashMetaPage metap, BlockNumber blkno,
-				 ForkNumber forkNum)
-{
-	Buffer		buf;
-	Page		pg;
-	HashPageOpaque op;
-	uint32	   *freep;
-
-	/*
-	 * It is okay to write-lock the new bitmap page while holding metapage
-	 * write lock, because no one else could be contending for the new page.
-	 * Also, the metapage lock makes it safe to extend the index using
-	 * _hash_getnewbuf.
-	 *
-	 * There is some loss of concurrency in possibly doing I/O for the new
-	 * page while holding the metapage lock, but this path is taken so seldom
-	 * that it's not worth worrying about.
-	 */
-	buf = _hash_getnewbuf(rel, blkno, forkNum);
-	pg = BufferGetPage(buf);
-
-	/* initialize the page's special space */
-	op = (HashPageOpaque) PageGetSpecialPointer(pg);
-	op->hasho_prevblkno = InvalidBlockNumber;
-	op->hasho_nextblkno = InvalidBlockNumber;
-	op->hasho_bucket = -1;
-	op->hasho_flag = LH_BITMAP_PAGE;
-	op->hasho_page_id = HASHO_PAGE_ID;
-
-	/* set all of the bits to 1 */
-	freep = HashPageGetBitmap(pg);
-	MemSet(freep, 0xFF, BMPGSZ_BYTE(metap));
-
-	/* dirty the new bitmap page, and release write lock and pin */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
-
-	/* add the new bitmap page to the metapage's list of bitmaps */
-	/* metapage already has a write lock */
-	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
-		ereport(ERROR,
-				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-				 errmsg("out of overflow pages in hash index \"%s\"",
-						RelationGetRelationName(rel))));
-
-	metap->hashm_mapp[metap->hashm_nmaps] = blkno;
-
-	metap->hashm_nmaps++;
-}
-
-
-/*
  *	_hash_initbitmapbuffer()
  *
  *	 Initialize a new bitmap page.  All bits in the new bitmap page are set to
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index bb1ce75..c73929c 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -157,6 +157,36 @@ _hash_getinitbuf(Relation rel, BlockNumber blkno)
 }
 
 /*
+ *	_hash_initbuf() -- Get and initialize a buffer by bucket number.
+ */
+void
+_hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket, uint32 flag,
+			  bool initpage)
+{
+	HashPageOpaque pageopaque;
+	Page		page;
+
+	page = BufferGetPage(buf);
+
+	/* initialize the page */
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
+
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+	/*
+	 * Set hasho_prevblkno with current hashm_maxbucket. This value will
+	 * be used to validate cached HashMetaPageData. See
+	 * _hash_getbucketbuf_from_hashkey().
+	 */
+	pageopaque->hasho_prevblkno = max_bucket;
+	pageopaque->hasho_nextblkno = InvalidBlockNumber;
+	pageopaque->hasho_bucket = num_bucket;
+	pageopaque->hasho_flag = flag;
+	pageopaque->hasho_page_id = HASHO_PAGE_ID;
+}
+
+/*
  *	_hash_getnewbuf() -- Get a new page at the end of the index.
  *
  *		This has the same API as _hash_getinitbuf, except that we are adding
@@ -288,7 +318,7 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
 
 
 /*
- *	_hash_metapinit() -- Initialize the metadata page of a hash index,
+ *	_hash_init() -- Initialize the metadata page of a hash index,
  *				the initial buckets, and the initial bitmap page.
  *
  * The initial number of buckets is dependent on num_tuples, an estimate
@@ -300,19 +330,18 @@ _hash_dropscanbuf(Relation rel, HashScanOpaque so)
  * multiple buffer locks is ignored.
  */
 uint32
-_hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
+_hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 {
-	HashMetaPage metap;
-	HashPageOpaque pageopaque;
 	Buffer		metabuf;
 	Buffer		buf;
+	Buffer		bitmapbuf;
 	Page		pg;
+	HashMetaPage metap;
+	RegProcedure procid;
 	int32		data_width;
 	int32		item_width;
 	int32		ffactor;
-	double		dnumbuckets;
 	uint32		num_buckets;
-	uint32		log2_num_buckets;
 	uint32		i;
 
 	/* safety check */
@@ -334,6 +363,96 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	if (ffactor < 10)
 		ffactor = 10;
 
+	procid = index_getprocid(rel, 1, HASHPROC);
+
+	/*
+	 * We initialize the metapage, the first N bucket pages, and the first
+	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
+	 * calls to occur.  This ensures that the smgr level has the right idea of
+	 * the physical index length.
+	 *
+	 * Critical section not required, because on error the creation of the
+	 * whole relation will be rolled back.
+	 */
+	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
+	_hash_init_metabuffer(metabuf, num_tuples, procid, ffactor, false);
+	MarkBufferDirty(metabuf);
+
+	pg = BufferGetPage(metabuf);
+	metap = HashPageGetMeta(pg);
+
+	num_buckets = metap->hashm_maxbucket + 1;
+
+	/*
+	 * Release buffer lock on the metapage while we initialize buckets.
+	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
+	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
+	 * long intervals in any case, since that can block the bgwriter.
+	 */
+	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
+
+	/*
+	 * Initialize and WAL Log the first N buckets
+	 */
+	for (i = 0; i < num_buckets; i++)
+	{
+		BlockNumber blkno;
+
+		/* Allow interrupts, in case N is huge */
+		CHECK_FOR_INTERRUPTS();
+
+		blkno = BUCKET_TO_BLKNO(metap, i);
+		buf = _hash_getnewbuf(rel, blkno, forkNum);
+		_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
+		MarkBufferDirty(buf);
+		_hash_relbuf(rel, buf);
+	}
+
+	/* Now reacquire buffer lock on metapage */
+	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = _hash_getnewbuf(rel, num_buckets + 1, forkNum);
+	_hash_initbitmapbuffer(bitmapbuf, metap->hashm_bmsize, false);
+	MarkBufferDirty(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	/* metapage already has a write lock */
+	if (metap->hashm_nmaps >= HASH_MAX_BITMAPS)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("out of overflow pages in hash index \"%s\"",
+						RelationGetRelationName(rel))));
+
+	metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+
+	metap->hashm_nmaps++;
+	MarkBufferDirty(metabuf);
+
+	/* all done */
+	_hash_relbuf(rel, bitmapbuf);
+	_hash_relbuf(rel, metabuf);
+
+	return num_buckets;
+}
+
+/*
+ *	_hash_init_metabuffer() -- Initialize the metadata page of a hash index.
+ */
+void
+_hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
+					  uint16 ffactor, bool initpage)
+{
+	HashMetaPage metap;
+	HashPageOpaque pageopaque;
+	Page		page;
+	double		dnumbuckets;
+	uint32		num_buckets;
+	uint32		log2_num_buckets;
+	uint32		i;
+
 	/*
 	 * Choose the number of initial bucket pages to match the fill factor
 	 * given the estimated number of tuples.  We round up the result to the
@@ -353,30 +472,25 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	Assert(num_buckets == (((uint32) 1) << log2_num_buckets));
 	Assert(log2_num_buckets < HASH_MAX_SPLITPOINTS);
 
-	/*
-	 * We initialize the metapage, the first N bucket pages, and the first
-	 * bitmap page in sequence, using _hash_getnewbuf to cause smgrextend()
-	 * calls to occur.  This ensures that the smgr level has the right idea of
-	 * the physical index length.
-	 */
-	metabuf = _hash_getnewbuf(rel, HASH_METAPAGE, forkNum);
-	pg = BufferGetPage(metabuf);
+	page = BufferGetPage(buf);
+	if (initpage)
+		_hash_pageinit(page, BufferGetPageSize(buf));
 
-	pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
+	pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	pageopaque->hasho_prevblkno = InvalidBlockNumber;
 	pageopaque->hasho_nextblkno = InvalidBlockNumber;
 	pageopaque->hasho_bucket = -1;
 	pageopaque->hasho_flag = LH_META_PAGE;
 	pageopaque->hasho_page_id = HASHO_PAGE_ID;
 
-	metap = HashPageGetMeta(pg);
+	metap = HashPageGetMeta(page);
 
 	metap->hashm_magic = HASH_MAGIC;
 	metap->hashm_version = HASH_VERSION;
 	metap->hashm_ntuples = 0;
 	metap->hashm_nmaps = 0;
 	metap->hashm_ffactor = ffactor;
-	metap->hashm_bsize = HashGetMaxBitmapSize(pg);
+	metap->hashm_bsize = HashGetMaxBitmapSize(page);
 	/* find largest bitmap array size that will fit in page size */
 	for (i = _hash_log2(metap->hashm_bsize); i > 0; --i)
 	{
@@ -393,7 +507,7 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	 * pretty useless for normal operation (in fact, hashm_procid is not used
 	 * anywhere), but it might be handy for forensic purposes so we keep it.
 	 */
-	metap->hashm_procid = index_getprocid(rel, 1, HASHPROC);
+	metap->hashm_procid = procid;
 
 	/*
 	 * We initialize the index with N buckets, 0 .. N-1, occupying physical
@@ -411,54 +525,9 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_ovflpoint = log2_num_buckets;
 	metap->hashm_firstfree = 0;
 
-	/*
-	 * Release buffer lock on the metapage while we initialize buckets.
-	 * Otherwise, we'll be in interrupt holdoff and the CHECK_FOR_INTERRUPTS
-	 * won't accomplish anything.  It's a bad idea to hold buffer locks for
-	 * long intervals in any case, since that can block the bgwriter.
-	 */
-	MarkBufferDirty(metabuf);
-	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-	/*
-	 * Initialize the first N buckets
-	 */
-	for (i = 0; i < num_buckets; i++)
-	{
-		/* Allow interrupts, in case N is huge */
-		CHECK_FOR_INTERRUPTS();
-
-		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
-		pg = BufferGetPage(buf);
-		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-
-		/*
-		 * Set hasho_prevblkno with current hashm_maxbucket. This value will
-		 * be used to validate cached HashMetaPageData. See
-		 * _hash_getbucketbuf_from_hashkey().
-		 */
-		pageopaque->hasho_prevblkno = metap->hashm_maxbucket;
-		pageopaque->hasho_nextblkno = InvalidBlockNumber;
-		pageopaque->hasho_bucket = i;
-		pageopaque->hasho_flag = LH_BUCKET_PAGE;
-		pageopaque->hasho_page_id = HASHO_PAGE_ID;
-		MarkBufferDirty(buf);
-		_hash_relbuf(rel, buf);
-	}
-
-	/* Now reacquire buffer lock on metapage */
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Initialize first bitmap page
-	 */
-	_hash_initbitmap(rel, metap, num_buckets + 1, forkNum);
-
-	/* all done */
-	MarkBufferDirty(metabuf);
-	_hash_relbuf(rel, metabuf);
-
-	return num_buckets;
+	/* Set pd_lower just past the end of the metadata. */
+	((PageHeader) page)->pd_lower =
+		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
 
 /*
@@ -535,7 +604,7 @@ restart_expand:
 	 * than a disk block then this would be an independent constraint.
 	 *
 	 * If you change this, see also the maximum initial number of buckets in
-	 * _hash_metapinit().
+	 * _hash_init().
 	 */
 	if (metap->hashm_maxbucket >= (uint32) 0x7FFFFFFE)
 		goto fail;
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9c0b79f..bfdfed8 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -311,8 +311,6 @@ extern Buffer _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool r
 extern BlockNumber _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 				   Buffer wbuf, IndexTuple *itups, OffsetNumber *itup_offsets,
 			 Size *tups_size, uint16 nitups, BufferAccessStrategy bstrategy);
-extern void _hash_initbitmap(Relation rel, HashMetaPage metap,
-				 BlockNumber blkno, ForkNumber forkNum);
 extern void _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage);
 extern void _hash_squeezebucket(Relation rel,
 					Bucket bucket, BlockNumber bucket_blkno,
@@ -331,6 +329,8 @@ extern Buffer _hash_getbucketbuf_from_hashkey(Relation rel, uint32 hashkey,
 								int access,
 								HashMetaPage *cachedmetap);
 extern Buffer _hash_getinitbuf(Relation rel, BlockNumber blkno);
+extern void _hash_initbuf(Buffer buf, uint32 max_bucket, uint32 num_bucket,
+				uint32 flag, bool initpage);
 extern Buffer _hash_getnewbuf(Relation rel, BlockNumber blkno,
 				ForkNumber forkNum);
 extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
@@ -339,8 +339,10 @@ extern Buffer _hash_getbuf_with_strategy(Relation rel, BlockNumber blkno,
 extern void _hash_relbuf(Relation rel, Buffer buf);
 extern void _hash_dropbuf(Relation rel, Buffer buf);
 extern void _hash_dropscanbuf(Relation rel, HashScanOpaque so);
-extern uint32 _hash_metapinit(Relation rel, double num_tuples,
-				ForkNumber forkNum);
+extern uint32 _hash_init(Relation rel, double num_tuples,
+		   ForkNumber forkNum);
+extern void _hash_init_metabuffer(Buffer buf, double num_tuples,
+					  RegProcedure procid, uint16 ffactor, bool initpage);
 extern void _hash_pageinit(Page page, Size size);
 extern void _hash_expandtable(Relation rel, Buffer metabuf);
 extern void _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf,
-- 
1.8.4.msysgit.0

0003-Enable-WAL-for-Hash-Indexes.patchapplication/octet-stream; name=0003-Enable-WAL-for-Hash-Indexes.patchDownload

From cdba0d277848dbb1ccbf38d0010013e9ad2aaf9b Mon Sep 17 00:00:00 2001
From: Amit Kapila <amit.kapila@enterprisedb.com>
Date: Tue, 28 Feb 2017 19:19:47 +0530
Subject: [PATCH 3/3] Enable WAL for Hash Indexes.

This will make hash index durable which means that data
for hash indexes can be recovered after crash.  This will
also allow us to replicate hash indexes.
---
 contrib/pageinspect/expected/hash.out          |   1 -
 contrib/pgstattuple/expected/pgstattuple.out   |   1 -
 doc/src/sgml/backup.sgml                       |  13 -
 doc/src/sgml/config.sgml                       |   7 +-
 doc/src/sgml/high-availability.sgml            |   6 -
 doc/src/sgml/indices.sgml                      |  12 -
 doc/src/sgml/ref/create_index.sgml             |  13 -
 src/backend/access/hash/Makefile               |   2 +-
 src/backend/access/hash/README                 | 144 +++-
 src/backend/access/hash/hash.c                 |  81 +-
 src/backend/access/hash/hash_xlog.c            | 974 +++++++++++++++++++++++++
 src/backend/access/hash/hashinsert.c           |  59 +-
 src/backend/access/hash/hashovfl.c             | 243 +++++-
 src/backend/access/hash/hashpage.c             | 222 +++++-
 src/backend/access/hash/hashsearch.c           |   5 +
 src/backend/access/rmgrdesc/hashdesc.c         | 134 +++-
 src/backend/commands/indexcmds.c               |   5 -
 src/backend/utils/cache/relcache.c             |  12 +-
 src/include/access/hash_xlog.h                 | 232 ++++++
 src/test/regress/expected/create_index.out     |   5 -
 src/test/regress/expected/enum.out             |   1 -
 src/test/regress/expected/hash_index.out       |   2 -
 src/test/regress/expected/macaddr.out          |   1 -
 src/test/regress/expected/replica_identity.out |   1 -
 src/test/regress/expected/uuid.out             |   1 -
 25 files changed, 2046 insertions(+), 131 deletions(-)
 create mode 100644 src/backend/access/hash/hash_xlog.c

diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 8ed60bc..3ba01f6 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -1,7 +1,6 @@
 CREATE TABLE test_hash (a int, b text);
 INSERT INTO test_hash VALUES (1, 'one');
 CREATE INDEX test_hash_a_idx ON test_hash USING hash (a);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 \x
 SELECT hash_page_type(get_raw_page('test_hash_a_idx', 0));
 -[ RECORD 1 ]--+---------
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 169d193..7d7ade8 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -131,7 +131,6 @@ select * from pgstatginindex('test_ginidx');
 (1 row)
 
 create index test_hashidx on test using hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 select * from pgstathashindex('test_hashidx');
  version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | live_items | dead_items | free_percent 
 ---------+--------------+----------------+--------------+------------+------------+------------+--------------
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 12f2a14..69c599e 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1538,19 +1538,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1b390a2..8f6f5a2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2152,10 +2152,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting.  Example include system catalogs.  For
+         such tables this setting will neither reduce bloat nor create a
+         possibility of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 48de2ce..0e61991 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2353,12 +2353,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 271c135..e40750e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index fcb7a60..7163b03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index e2e7e91..b154569 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
-       hashsort.o hashutil.o hashvalidate.o
+       hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 703ae98..1c1f0db 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -287,13 +287,17 @@ The insertion algorithm is rather similar:
 	if current page is full, release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
+	take buffer content lock in exclusive mode on metapage
 	insert tuple at appropriate place in page
-	mark current page dirty and release buffer content lock and pin
-	if the current page is not a bucket page, release the pin on bucket page
-	pin meta page and take buffer content lock in exclusive mode
+	mark current page dirty
 	increment tuple count, decide if split needed
-	mark meta page dirty and release buffer content lock and pin
-	done if no split needed, else enter Split algorithm below
+	mark meta page dirty
+	write WAL for insertion of tuple
+	release the buffer content lock on metapage
+	release buffer content lock on current page
+	if current page is not a bucket page, release the pin on bucket page
+	if split is needed, enter Split algorithm below
+	release the pin on metapage
 
 To speed searches, the index entries within any individual index page are
 kept sorted by hash code; the insertion code must take care to insert new
@@ -328,12 +332,17 @@ existing bucket in two, thereby lowering the fill ratio:
        try to finish the split and the cleanup work
        if that succeeds, start over; if it fails, give up
 	mark the old and new buckets indicating split is in progress
+	mark both old and new buckets as dirty
+	write WAL for allocation of new page for split
 	copy the tuples that belongs to new bucket from old bucket, marking
      them as moved-by-split
+	write WAL record for moving tuples to new page once the new page is full
+	or all the pages of old bucket are finished
 	release lock but not pin for primary bucket page of old bucket,
 	 read/shared-lock next page; repeat as needed
 	clear the bucket-being-split and bucket-being-populated flags
 	mark the old bucket indicating split-cleanup
+	write WAL for changing the flags on both old and new buckets
 
 The split operation's attempt to acquire cleanup-lock on the old bucket number
 could fail if another process holds any lock or pin on it.  We do not want to
@@ -369,6 +378,8 @@ The fourth operation is garbage collection (bulk deletion):
 		acquire cleanup lock on primary bucket page
 		loop:
 			scan and remove tuples
+			mark the target page dirty
+			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
 			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
@@ -383,7 +394,8 @@ The fourth operation is garbage collection (bulk deletion):
 	check if number of buckets changed
 	if so, release content lock and pin and return to for-each-bucket loop
 	else update metapage tuple count
-	mark meta page dirty and release buffer content lock and pin
+	 mark meta page dirty and write WAL for update of metapage
+	 release buffer content lock and pin
 
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
@@ -425,18 +437,16 @@ Obtaining an overflow page:
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		mark bitmap page dirty and release content lock
+		mark bitmap page dirty
 		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
 			update it and mark meta page dirty
-		release meta page buffer content lock
-		return page number
 	else (not found):
 	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -456,12 +466,15 @@ like this:
 
 	-- having determined that no space is free in the target bucket:
 	remember last page of bucket, drop write lock on it
-	call free-page-acquire routine
 	re-write-lock last page of bucket
 	if it is not last anymore, step to the last page
-	update (former) last page to point to new page
+	execute free-page-acquire (Obtaining an overflow page) mechanism described above
+	update (former) last page to point to the new page and mark the  buffer dirty.
 	write-lock and initialize new page, with back link to former last page
-	write and release former last page
+	write WAL for addition of overflow page
+	release the locks on meta page and bitmap page acquired in free-page-acquire algorithm
+	release the lock on former last page
+	release the lock on new overflow page
 	insert tuple into new page
 	-- etc.
 
@@ -488,12 +501,14 @@ accessors of pages in the bucket.  The algorithm is:
 	determine which bitmap page contains the free space bit for page
 	release meta page buffer content lock
 	pin bitmap page and take buffer content lock in exclusive mode
-	update bitmap bit
-	mark bitmap page dirty and release buffer content lock and pin
-	if page number is less than what we saw as first-free-bit in meta:
 	retake meta page buffer content lock in exclusive mode
+	move (insert) tuples that belong to the overflow page being freed
+	update bitmap bit
+	mark bitmap page dirty
 	if page number is still less than first-free-bit,
 		update first-free-bit field and mark meta page dirty
+	write WAL for delinking overflow page operation
+	release buffer content lock and pin
 	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
@@ -504,8 +519,101 @@ page acquirer will scan more bitmap bits than he needs to.  What must be
 avoided is having first-free-bit greater than the actual first free bit,
 because then that free page would never be found by searchers.
 
-All the freespace operations should be called while holding no buffer
-locks.  Since they need no lmgr locks, deadlock is not possible.
+The reason of moving tuples from overflow page while delinking the later is
+to make that as an atomic operation.  Not doing so could lead to spurious reads
+on standby.  Basically, the user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which get
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolled back.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only the fixed number of pages XLR_MAX_BLOCK_ID (32)
+with current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform the split operation if the number of tuples are more than what can be
+accomodated in the initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+An insertion that causes an addition of an overflow page is logged as a single
+WAL entry preceded by a WAL entry for new overflow page required to insert
+a tuple.  There is a corner case where by the time we try to use newly
+allocated overflow page, it already gets used by concurrent insertions, for
+such a case, a new overflow page will be allocated and a separate WAL entry
+will be made for the same.
+
+An insertion that causes a bucket split is logged as a single WAL entry,
+followed by a WAL entry for allocating a new bucket, followed by a WAL entry
+for each overflow bucket page in the new bucket to which the tuples are moved
+from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of the old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with bucket-being-split and bucket-being-populated flags respectively
+which indicates that split is in progress for those buckets.  The reader
+algorithm works correctly, as it will scan both the old and new buckets when
+the split is in progress as explained in the reader algorithm section above.
+
+We finish the split at next insert or split operation on the old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vacuum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the
+system crashes before completing (b), it will again try to clean the bucket
+during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+Squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As Squeeze operation involves writing multiple atomic operations, it is
+quite possible, that system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and can impact performance of read and
+insert operations until the next vacuum squeeze the bucket completely.
 
 
 Other Notes
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 1f8a7f6..7f030f8 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -28,6 +28,7 @@
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -303,6 +304,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -623,6 +629,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -649,6 +656,26 @@ loop_top:
 	}
 
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
@@ -816,9 +843,40 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -866,8 +924,25 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
 		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
@@ -881,9 +956,3 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..aff9b53
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,974 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_insert_redo: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_addovflpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, InvalidBlockNumber, *num_bucket, LH_OVERFLOW_PAGE,
+				  true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	/*
+	 * We need to release the locks once the prev pointer of overflow bucket
+	 * and next of left bucket are set, otherwise concurrent read might skip
+	 * the bucket.
+	 */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages we updated.  But during replay
+	 * it's not necessary to hold those locks, since no other index updates
+	 * can be happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocpage(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+		oldopaque->hasho_prevblkno = xlrec->new_bucket;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/*
+	 * There is no harm in releasing the lock on old bucket before new bucket
+	 * during replay as no other bucket splits can be happening concurrently.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
+				  xlrec->new_bucket_flag, true);
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+
+	/*
+	 * We release locks on writebuf and bucketbuf at end of replay operation
+	 * to ensure that we hold lock on primary bucket page till end of
+	 * operation.  We can optimize by releasing the lock on write buffer as
+	 * soon as the operation for same is complete, if it is not same as
+	 * primary bucket page, but that doesn't seem to be worth complicating the
+	 * code.
+	 */
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		RelFileNode rnode;
+		BlockNumber blkno;
+
+		XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
+
+		bucketbuf = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno,
+										   RBM_NORMAL);
+		if (BufferIsValid(bucketbuf))
+			LockBufferForCleanup(bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay split cleanup flag operation for primary bucket page.
+ */
+static void
+hash_xlog_split_cleanup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_addovflpage(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocpage(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			hash_xlog_split_cleanup(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 354e733..241728f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -40,6 +42,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -158,25 +161,20 @@ restart_insert:
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * dirty and release the modified page.  if the page we modified was an
-	 * overflow page, we also need to separately drop the pin we retained on
-	 * the primary bucket page.
-	 */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap = HashPageGetMeta(metapage);
 	metap->hashm_ntuples += 1;
 
@@ -184,10 +182,43 @@ restart_insert:
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
+
 	/* Attempt to split if a split is needed */
 	if (do_expand)
 		_hash_expandtable(rel, metabuf);
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1087480..3e5a85a 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,6 +18,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -94,7 +96,9 @@ _hash_ovflblkno_to_bitno(HashMetaPage metap, BlockNumber ovflblkno)
  *	dropped before exiting (we assume the caller is not interested in 'buf'
  *	anymore) if not asked to retain.  The pin will be retained only for the
  *	primary bucket.  The returned overflow page will be pinned and
- *	write-locked; it is guaranteed to be empty.
+ *	write-locked; it is not guaranteed to be empty as we release and require
+ *	lock on it, so the caller must ensure that it has required space
+ *	in the page.
  *
  *	The caller must hold a pin, but no lock, on the metapage buffer.
  *	That buffer is returned in the same state.
@@ -136,6 +140,13 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	 * page is released, then finally acquire the lock on new overflow buffer.
 	 * We need this locking order to avoid deadlock with backends that are
 	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -303,8 +314,12 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 found:
 
 	/*
-	 * Do the update.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case the new page is added.
 	 */
+	START_CRIT_SECTION();
+
 	if (page_found)
 	{
 		Assert(BufferIsValid(mapbuf));
@@ -362,6 +377,61 @@ found:
 
 	MarkBufferDirty(buf);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_addovflpage xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			/*
+			 * As bitmap page doesn't have standard page layout, so this will
+			 * allow us to log the data.
+			 */
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * To replay meta page changes, we can log the entire metapage which
+		 * doesn't seem advisable considering size of hash metapage or we can
+		 * log the individual updated values which seems doable, but we prefer
+		 * to perform exact operations on metapage during replay as are done
+		 * during actual operation.  That looks straight forward and has an
+		 * advantage of using lesser space in WAL.
+		 */
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	if (retain_pin)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
@@ -375,6 +445,13 @@ found:
 	if (BufferIsValid(newmapbuf))
 		_hash_relbuf(rel, newmapbuf);
 
+	/*
+	 * we need to release and reacquire the lock on overflow buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+	LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);
+
 	return ovflbuf;
 }
 
@@ -408,7 +485,11 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
- *	Add the tuples (itups) to wbuf.
+ *	Add the tuples (itups) to wbuf in this function, we could do that in the
+ *	caller as well.  The advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
@@ -430,8 +511,6 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
-	Buffer		prevbuf = InvalidBuffer;
-	Buffer		nextbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
@@ -445,6 +524,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -508,6 +590,12 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, XLR_NORMAL_RDATAS + nitups);
+
+	START_CRIT_SECTION();
+
 	/*
 	 * we have to insert tuples on the "write" page, being careful to preserve
 	 * hashkey ordering.  (If we insert many tuples into the same "write" page
@@ -519,7 +607,11 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		MarkBufferDirty(wbuf);
 	}
 
-	/* Initialize the freed overflow page. */
+	/*
+	 * Initialise the freed overflow page, here we can't complete zeroed the
+	 * page as WAL replay routines expect pages to be initialized. See
+	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
 	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
 	MarkBufferDirty(ovflbuf);
 
@@ -550,9 +642,83 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
+		update_metap = true;
 		MarkBufferDirty(metabuf);
 	}
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	/* release previous bucket if it is not same as write bucket */
 	if (BufferIsValid(prevbuf) && prevblkno != writeblkno)
 		_hash_relbuf(rel, prevbuf);
@@ -601,7 +767,11 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 	freep = HashPageGetBitmap(pg);
 	MemSet(freep, 0xFF, bmsize);
 
-	/* Set pd_lower just past the end of the bitmap page data. */
+	/*
+	 * Set pd_lower just past the end of the bitmap page data.  We could even
+	 * set pd_lower equal to pd_upper, but this is more precise and makes the
+	 * page look compressible to xlog.c.
+	 */
 	((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
 }
 
@@ -761,6 +931,15 @@ readpage:
 					Assert(nitups == ndeletable);
 
 					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
 					 * we have to insert tuples on the "write" page, being
 					 * careful to preserve hashkey ordering.  (If we insert
 					 * many tuples into the same "write" page it would be
@@ -773,6 +952,43 @@ readpage:
 					PageIndexMultiDelete(rpage, deletable, ndeletable);
 					MarkBufferDirty(rbuf);
 
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
 					tups_moved = true;
 				}
 
@@ -785,13 +1001,24 @@ readpage:
 				else
 					_hash_relbuf(rel, wbuf);
 
+				/*
+				 * We need to release and if required reacquire the lock on
+				 * rbuf to ensure that standby shouldn't see an intermediate
+				 * state of it.  If we don't release the lock, after replay of
+				 * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will be able to
+				 * view the results of partial deletion on rblkno.
+				 */
+				LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
+
 				/* nothing more to do if we reached the read page */
 				if (rblkno == wblkno)
 				{
-					_hash_relbuf(rel, rbuf);
+					_hash_dropbuf(rel, rbuf);
 					return;
 				}
 
+				LockBuffer(rbuf, BUFFER_LOCK_EXCLUSIVE);
+
 				wbuf = next_wbuf;
 				wpage = BufferGetPage(wbuf);
 				wopaque = (HashPageOpaque) PageGetSpecialPointer(wpage);
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index c73929c..1b4cdb4 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -43,6 +44,8 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -381,6 +384,31 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 	pg = BufferGetPage(metabuf);
 	metap = HashPageGetMeta(pg);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		/*
+		 * Here, we could have copied the entire metapage as well to WAL
+		 * record and restore it as it is during replay.  However the size of
+		 * metapage is not small (more than 600 bytes), so recording just the
+		 * information required to construct metapage.
+		 */
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	num_buckets = metap->hashm_maxbucket + 1;
 
 	/*
@@ -405,6 +433,12 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 		buf = _hash_getnewbuf(rel, blkno, forkNum);
 		_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
 		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
 		_hash_relbuf(rel, buf);
 	}
 
@@ -431,6 +465,32 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_nmaps++;
 	MarkBufferDirty(metabuf);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * We can log the exact changes made to meta page, however as no
+		 * concurrent operation could see the index during the replay of this
+		 * record, we can perform the operations during replay as they are
+		 * done here.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	/* all done */
 	_hash_relbuf(rel, bitmapbuf);
 	_hash_relbuf(rel, metabuf);
@@ -525,7 +585,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	metap->hashm_ovflpoint = log2_num_buckets;
 	metap->hashm_firstfree = 0;
 
-	/* Set pd_lower just past the end of the metadata. */
+	/*
+	 * Set pd_lower just past the end of the metadata.  This is to log
+	 * full_page_image of metapage in xloginsert.c.
+	 */
 	((PageHeader) page)->pd_lower =
 		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
@@ -569,6 +632,8 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -728,7 +793,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak on standby, as next split will consume this space.
+		 * In any case, even without failure we don't use all the space in one
+		 * split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -757,8 +826,7 @@ restart_expand:
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
@@ -772,6 +840,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -784,6 +853,7 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
 	MarkBufferDirty(metabuf);
@@ -829,11 +899,61 @@ restart_expand:
 
 	MarkBufferDirty(buf_nblkno);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocpage xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	END_CRIT_SECTION();
 
 	/* drop lock, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * we need to release and reacquire the lock on new block buffer to ensure
+	 * that standby shouldn't see an intermediate state of it.
+	 */
+	LockBuffer(buf_nblkno, BUFFER_LOCK_UNLOCK);
+	LockBuffer(buf_nblkno, BUFFER_LOCK_EXCLUSIVE);
+
 	/* Relocate records to the new bucket */
 	_hash_splitbucket(rel, metabuf,
 					  old_bucket, new_bucket,
@@ -883,6 +1003,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
 	char		zerobuf[BLCKSZ];
+	Page		page;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -893,7 +1014,21 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	MemSet(zerobuf, 0, sizeof(zerobuf));
+	page = (Page) zerobuf;
+
+	/*
+	 * Initialise the new bucket page, here we can't complete zeroed the page
+	 * as WAL replay routines expect pages to be initialized. See explanation
+	 * of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
+	_hash_pageinit(page, BLCKSZ);
+
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
 
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
@@ -1029,10 +1164,11 @@ _hash_splitbucket(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				while (PageGetFreeSpace(npage) < itemsz)
 				{
-					/* write out nbuf and drop lock, but keep pin */
-					MarkBufferDirty(nbuf);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
 					/* drop lock, but keep pin */
 					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
 					/* chain to a new overflow page */
@@ -1042,6 +1178,13 @@ _hash_splitbucket(Relation rel,
 				}
 
 				/*
+				 * Change the shared buffer state in critical section,
+				 * otherwise any error could make it unrecoverable after
+				 * recovery.
+				 */
+				START_CRIT_SECTION();
+
+				/*
 				 * Insert tuple on new page, using _hash_pgaddtup to ensure
 				 * correct ordering by hashkey.  This is a tad inefficient
 				 * since we may have to shuffle itempointers repeatedly.
@@ -1050,6 +1193,8 @@ _hash_splitbucket(Relation rel,
 				 */
 				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
 
+				END_CRIT_SECTION();
+
 				/* be tidy */
 				pfree(new_itup);
 			}
@@ -1073,7 +1218,9 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
-			MarkBufferDirty(nbuf);
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
 			if (nbuf == bucket_nbuf)
 				LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
 			else
@@ -1103,6 +1250,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
 
@@ -1119,6 +1268,29 @@ _hash_splitbucket(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1245,6 +1417,38 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 }
 
 /*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	START_CRIT_SECTION();
+
+	MarkBufferDirty(buf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+
+	END_CRIT_SECTION();
+}
+
+/*
  *	_hash_getcachedmetap() -- Returns cached metapage data.
  *
  *	If metabuf is not InvalidBuffer, caller must hold a pin, but no lock, on
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9e5d7e4..d733770 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -123,6 +123,7 @@ _hash_readnext(IndexScanDesc scan,
 	if (block_found)
 	{
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 	}
 }
@@ -168,6 +169,7 @@ _hash_readprev(IndexScanDesc scan,
 		*bufp = _hash_getbuf(rel, blkno, HASH_READ,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 
 		/*
@@ -283,6 +285,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 
 	buf = _hash_getbucketbuf_from_hashkey(rel, hashkey, HASH_READ, NULL);
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	bucket = opaque->hasho_bucket;
 
@@ -318,6 +321,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+		TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf));
 
 		/*
 		 * remember the split bucket buffer so as to use it later for
@@ -520,6 +524,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 7eac819..f00ec2d 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,142 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_addovflpage *xlrec = (xl_hash_addovflpage *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocpage *xlrec = (xl_hash_split_allocpage *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "old_bucket_flag %u, new_bucket_flag %u",
+								 xlrec->old_bucket_flag, xlrec->new_bucket_flag);
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			id = "SPLIT_CLEANUP";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 72bb06c..9618032 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -506,11 +506,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9001e20..ce55fc5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5880,13 +5880,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5908,8 +5905,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index cc23163..8400022 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -16,7 +16,239 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "storage/off.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_SPLIT_CLEANUP 0xA0	/* clear split-cleanup flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocpage flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (HashMetaPageData)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_addovflpage
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_addovflpage;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_addovflpage, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocpage
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocpage;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocpage, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index e519fdb..26cd059 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index f8b9f02..0a18efa 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -201,7 +201,6 @@ SELECT h.seqno AS f20000
 --
 CREATE TABLE hash_split_heap (keycol INT);
 CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
 VACUUM FULL hash_split_heap;
 -- Let's do a backward scan.
@@ -230,5 +229,4 @@ DROP TABLE hash_temp_heap CASCADE;
 CREATE TABLE hash_heap_float4 (x float4, y int);
 INSERT INTO hash_heap_float4 VALUES (1.1,1);
 CREATE INDEX hash_idx ON hash_heap_float4 USING hash (x);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 DROP TABLE hash_heap_float4 CASCADE;
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index fa63235..67c34a9 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 423f277..db66dc7 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail
-- 
1.8.4.msysgit.0

#80

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#79)

Re: Write Ahead Logging for Hash Indexes

On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
but I think as this patch is standalone, so we should not remove it
from their existing usage, I have added those back and rebased the
remaining patches.

Great, thanks. 0001 looks good to me now, so committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Robert Haas (#80)

Re: Write Ahead Logging for Hash Indexes

On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
but I think as this patch is standalone, so we should not remove it
from their existing usage, I have added those back and rebased the
remaining patches.

Great, thanks. 0001 looks good to me now, so committed.

Committed 0002.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Robert Haas (#81)

Re: Write Ahead Logging for Hash Indexes

On Tue, Mar 7, 2017 at 5:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
but I think as this patch is standalone, so we should not remove it
from their existing usage, I have added those back and rebased the
remaining patches.

Great, thanks. 0001 looks good to me now, so committed.

Committed 0002.

Here are some initial review thoughts on 0003 based on a first read-through.

+ affected by this setting. Example include system catalogs. For

Not grammatical. Perhaps "affected by this setting, such as system catalogs".

-    mark meta page dirty and release buffer content lock and pin
+     mark meta page dirty and write WAL for update of metapage
+     release buffer content lock and pin

Was the change in the indentation level intentional?

+accomodated in the initial set of buckets and it is not unusual to have large

Spelling.

+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages
+(b) before clearing the garbage flag (c) before updating the metapage.  If the

Need two commas and an "or": (a) removing tuples from some of the
bucket pages, (b) before clearing the garbage flag, or (c)

+quite possible, that system crashes before completing the operation on

No comma. Also, "that the system crashes".

+the index will remain bloated and can impact performance of read and

and this can impact

+    /*
+     * we need to release and reacquire the lock on overflow buffer to ensure
+     * that standby shouldn't see an intermediate state of it.
+     */
+    LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+    LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);

I still think this is a bad idea. Releasing and reacquiring the lock
on the master doesn't prevent the standby from seeing intermediate
states; the comment, if I understand correctly, is just plain wrong.
What it does do is make the code on the master more complicated, since
now the page returned by _hash_addovflpage might theoretically fill up
while you are busy releasing and reacquiring the lock.

+ *    Add the tuples (itups) to wbuf in this function, we could do that in the
+ *    caller as well.  The advantage of doing it here is we can easily write

Change comma to period. Then: "We could easily do that in the caller,
but the advantage..."

+ * Initialise the freed overflow page, here we can't complete zeroed the

Don't use British spelling, and don't use a comma to join two
sentences. Also "here we can't complete zeroed" doesn't seem right; I
don't know what that means.

+         * To replay meta page changes, we can log the entire metapage which
+         * doesn't seem advisable considering size of hash metapage or we can
+         * log the individual updated values which seems doable, but we prefer
+         * to perform exact operations on metapage during replay as are done
+         * during actual operation.  That looks straight forward and has an
+         * advantage of using lesser space in WAL.

This sounds like it is describing two alternatives that were not
selected and then the one that was selected, but I don't understand
the difference between the second non-selected alternative and what
was actually picked. I'd just write "We need only log the one field
from the metapage that has changed." or maybe drop the comment
altogether.

+ XLogEnsureRecordSpace(0, XLR_NORMAL_RDATAS + nitups);

The use of XLR_NORMAL_RDATAS here seems wrong. Presumably you should
pass the number of rdatas you will actually need, which is some
constant plus nitups. I think you should just write that constant
here directly, whether or not it happens to be equal to
XLR_NORMAL_RDATAS.

+                /*
+                 * We need to release and if required reacquire the lock on
+                 * rbuf to ensure that standby shouldn't see an intermediate
+                 * state of it.  If we don't release the lock, after replay of
+                 * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
be able to
+                 * view the results of partial deletion on rblkno.
+                 */
+                LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);

If you DO release the lock, this will STILL be true, because what
matters on the standby is what the redo code does. This code doesn't
even execute on the standby, so how could it possibly affect what
intermediate states are visible there?

+        /*
+         * Here, we could have copied the entire metapage as well to WAL
+         * record and restore it as it is during replay.  However the size of
+         * metapage is not small (more than 600 bytes), so recording just the
+         * information required to construct metapage.
+         */

Again, this seems sorta obvious. You've got a bunch of these comments
that say similar things. I'd either make them shorter or just remove
them.

+         * We can log the exact changes made to meta page, however as no
+         * concurrent operation could see the index during the replay of this
+         * record, we can perform the operations during replay as they are
+         * done here.

Don't use a comma to join two sentences. Also, I don't understand
anything up to the "however". I think you mean something like "We
don't need hash index creation to be atomic from the point of view of
replay, because until the creating transaction commits it's not
visible to anything on the standby anyway."

+     * Set pd_lower just past the end of the metadata.  This is to log
+     * full_page_image of metapage in xloginsert.c.

Why the underscores?

+ * won't be a leak on standby, as next split will consume this space.

The master and the standby had better be the same, so I don't know
what it means to talk about a leak on the standby.

+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.

Uh, what? We only recover things during recovery. Nothing gets
recovered after recovery.

Maybe this could get some tests, via the isolation tester, for things
like snapshot too old?

I haven't gone through all of this in detail yet; I think most of it
is right on target. But there is a lot more to look at yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#81)

Re: Write Ahead Logging for Hash Indexes

On Wed, Mar 8, 2017 at 3:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 1, 2017 at 4:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 28, 2017 at 7:31 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, actually those were added later in Enable-WAL-for-Hash* patch,
but I think as this patch is standalone, so we should not remove it
from their existing usage, I have added those back and rebased the
remaining patches.

Great, thanks. 0001 looks good to me now, so committed.

Committed 0002.

Thanks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#82)

Re: Write Ahead Logging for Hash Indexes

On Wed, Mar 8, 2017 at 5:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Here are some initial review thoughts on 0003 based on a first read-through.
+    /*
+     * we need to release and reacquire the lock on overflow buffer to ensure
+     * that standby shouldn't see an intermediate state of it.
+     */
+    LockBuffer(ovflbuf, BUFFER_LOCK_UNLOCK);
+    LockBuffer(ovflbuf, BUFFER_LOCK_EXCLUSIVE);
I still think this is a bad idea. Releasing and reacquiring the lock
on the master doesn't prevent the standby from seeing intermediate
states; the comment, if I understand correctly, is just plain wrong.

I can remove this release and reacquire stuff, but first, let me try
to explain what intermediate state I am talking here. If we don't
have a release/reacquire lock here, standby could see an empty
overflow page at that location whereas master will be able to see the
new page only after the insertion of tuple/'s in that bucket is
finished. I understand that there is not much functionality impact of
this change but still wanted to make you understand why I have added
this code. I am okay with removing this code if you still feel this
is just a marginal case and we should not bother about it. Let me know
what you think?

+ * Initialise the freed overflow page, here we can't complete zeroed the

Don't use British spelling, and don't use a comma to join two
sentences. Also "here we can't complete zeroed" doesn't seem right; I
don't know what that means.

What the comment means to say is that we can't initialize page as zero
(something like below)
MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));

typo in sentence
/complete/completely

+         * To replay meta page changes, we can log the entire metapage which
+         * doesn't seem advisable considering size of hash metapage or we can
+         * log the individual updated values which seems doable, but we prefer
+         * to perform exact operations on metapage during replay as are done
+         * during actual operation.  That looks straight forward and has an
+         * advantage of using lesser space in WAL.
This sounds like it is describing two alternatives that were not
selected and then the one that was selected, but I don't understand
the difference between the second non-selected alternative and what
was actually picked.

Actually, the second alternative was picked.

I'd just write "We need only log the one field
from the metapage that has changed." or maybe drop the comment
altogether.

I have added the comment so that others can understand why we don't
want to log the entire meta page (I think in some other indexes, we
always log the entire meta page, so such a question can arise).
However, after reading your comment, I think it is overkill and I
prefer to drop the entire comment unless you feel differently.

+                /*
+                 * We need to release and if required reacquire the lock on
+                 * rbuf to ensure that standby shouldn't see an intermediate
+                 * state of it.  If we don't release the lock, after replay of
+                 * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
be able to
+                 * view the results of partial deletion on rblkno.
+                 */
+                LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);

If you DO release the lock, this will STILL be true, because what
matters on the standby is what the redo code does.

That's right, what I intend to do here is to release the lock in
master where it will be released in standby. In this case, we can't
ensure what user can see in master is same as in standby after this
WAL record is replayed, because in master we have exclusive lock on
write buf, so no one can read the contents of read buf (the location
of read buf will be after write buf) whereas, in standby, it will be
possible to read the contents of the read buf. I think this is not a
correctness issue, so we might want to leave as it is, what do you
say?

+         * We can log the exact changes made to meta page, however as no
+         * concurrent operation could see the index during the replay of this
+         * record, we can perform the operations during replay as they are
+         * done here.

Don't use a comma to join two sentences. Also, I don't understand
anything up to the "however".

I mean the below changes made to meta buf (refer existing code):
metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;

metap->hashm_nmaps++;

I think it will be clear if you look both the DO and REDO operation of
XLOG_HASH_INIT_BITMAP_PAGE. Let me know if you still think that the
comment needs to be changed?

+     * Set pd_lower just past the end of the metadata.  This is to log
+     * full_page_image of metapage in xloginsert.c.
Why the underscores?

+ * won't be a leak on standby, as next split will consume this space.

No specific reason, just trying to resemble full_page_writes, but can
change if you feel it doesn't make much sense.

+ * won't be a leak on standby, as next split will consume this space.

The master and the standby had better be the same,

That's right.

so I don't know
what it means to talk about a leak on the standby.

I just want to say that even if we fail after allocation of buckets
and before the split operation is completed, the newly allocated
buckets will neither be leaked on master nor on standby. Do you think
we should add a comment for same or is it too obvious? How about
changing it to something like:

"Even if we fail after this operation, it won't leak buckets, as next
split will consume this space. In any case, even without failure, we
don't use all the space in one split operation."

+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.
Uh, what? We only recover things during recovery. Nothing gets
recovered after recovery.

Right, how about "Change the shared buffer state in a critical
section, otherwise any error could make the page unrecoverable."

Maybe this could get some tests, via the isolation tester, for things
like snapshot too old?

Okay, I can try, but note that currently there is no test related to
"snapshot too old" for any other indexes.

I haven't gone through all of this in detail yet; I think most of it
is right on target. But there is a lot more to look at yet.

Thanks for your valuable comments.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#84)

Re: Write Ahead Logging for Hash Indexes

On Wed, Mar 8, 2017 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I still think this is a bad idea. Releasing and reacquiring the lock
on the master doesn't prevent the standby from seeing intermediate
states; the comment, if I understand correctly, is just plain wrong.

I can remove this release and reacquire stuff, but first, let me try
to explain what intermediate state I am talking here. If we don't
have a release/reacquire lock here, standby could see an empty
overflow page at that location whereas master will be able to see the
new page only after the insertion of tuple/'s in that bucket is
finished.

That's true, but you keep speaking as if the release and reacquire
prevents the standby from seeing that state. It doesn't. Quite the
opposite: it allows the master to see that intermediate state as well.
I don't think there is a problem with the standby seeing an
intermediate state that can't be seen on the master, provided we've
set up the code to handle that intermediate state properly, which
hopefully we have.

+ * Initialise the freed overflow page, here we can't complete zeroed the

Don't use British spelling, and don't use a comma to join two
sentences. Also "here we can't complete zeroed" doesn't seem right; I
don't know what that means.

What the comment means to say is that we can't initialize page as zero
(something like below)
MemSet(ovflpage, 0, BufferGetPageSize(ovflbuf));

typo in sentence
/complete/completely

I think you can just say we initialize the page. Saying we can't
initializing it by zeroing it seems pretty obvious, both from the code
and from the general theory of how stuff works. If you did want to
explain it, I think you'd write something like "Just zeroing the page
won't work, because <reasons>."

+                /*
+                 * We need to release and if required reacquire the lock on
+                 * rbuf to ensure that standby shouldn't see an intermediate
+                 * state of it.  If we don't release the lock, after replay of
+                 * XLOG_HASH_MOVE_PAGE_CONTENTS on standby users will
be able to
+                 * view the results of partial deletion on rblkno.
+                 */
+                LockBuffer(rbuf, BUFFER_LOCK_UNLOCK);
If you DO release the lock, this will STILL be true, because what
matters on the standby is what the redo code does.
That's right, what I intend to do here is to release the lock in
master where it will be released in standby. In this case, we can't
ensure what user can see in master is same as in standby after this
WAL record is replayed, because in master we have exclusive lock on
write buf, so no one can read the contents of read buf (the location
of read buf will be after write buf) whereas, in standby, it will be
possible to read the contents of the read buf. I think this is not a
correctness issue, so we might want to leave as it is, what do you
say?

Well, as in the case above, you're doing extra work to make sure that
every state which can be observed on the standby can also be observed
on the master. But that has no benefit, and it does have a cost: you
have to release and reacquire a lock that you could otherwise retain,
saving CPU cycles and code complexity. So I think the way you've got
it is not good.

+         * We can log the exact changes made to meta page, however as no
+         * concurrent operation could see the index during the replay of this
+         * record, we can perform the operations during replay as they are
+         * done here.
Don't use a comma to join two sentences. Also, I don't understand
anything up to the "however".
I mean the below changes made to meta buf (refer existing code):
metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;

metap->hashm_nmaps++;

I think it will be clear if you look both the DO and REDO operation of
XLOG_HASH_INIT_BITMAP_PAGE. Let me know if you still think that the
comment needs to be changed?

I think it's not very clear. I think what this boils down to is that
this is another place where you've got a comment to explain that you
didn't log the entire metapage, but I don't think anyone would expect
that anyway, so it doesn't seem like a very important thing to
explain. If you want a comment here, maybe something like /* This is
safe only because nobody else can be modifying the index at this
stage; it's only visible to the transaction that is creating it */

+     * Set pd_lower just past the end of the metadata.  This is to log
+     * full_page_image of metapage in xloginsert.c.
Why the underscores?

+ * won't be a leak on standby, as next split will consume this space.
No specific reason, just trying to resemble full_page_writes, but can
change if you feel it doesn't make much sense.

Well, usually I don't separate_words_in_a_sentence_with_underscores.
It makes sense to do that if you are referring to an identifier in the
code with that name, but otehrwise not.

+ * won't be a leak on standby, as next split will consume this space.

The master and the standby had better be the same,

That's right.

so I don't know
what it means to talk about a leak on the standby.

I just want to say that even if we fail after allocation of buckets
and before the split operation is completed, the newly allocated
buckets will neither be leaked on master nor on standby. Do you think
we should add a comment for same or is it too obvious? How about
changing it to something like:

"Even if we fail after this operation, it won't leak buckets, as next
split will consume this space. In any case, even without failure, we
don't use all the space in one split operation."

Sure.

+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.
Uh, what? We only recover things during recovery. Nothing gets
recovered after recovery.
Right, how about "Change the shared buffer state in a critical
section, otherwise any error could make the page unrecoverable."

Sure.

Maybe this could get some tests, via the isolation tester, for things
like snapshot too old?

Okay, I can try, but note that currently there is no test related to
"snapshot too old" for any other indexes.

Wow, that's surprising. It seems the snapshot_too_old test only
checks that this works for a table that has no indexes. Have you,
anyway, tested it manually?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Robert Haas (#82)

Re: Write Ahead Logging for Hash Indexes

On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Great, thanks. 0001 looks good to me now, so committed.

Committed 0002.

Here are some initial review thoughts on 0003 based on a first read-through.

More thoughts on the main patch:

The text you've added to the hash index README throws around the term
"single WAL entry" pretty freely. For example: "An insertion that
causes an addition of an overflow page is logged as a single WAL entry
preceded by a WAL entry for new overflow page required to insert a
tuple." When you say a single WAL entry, you make it sound like
there's only one, but because it's preceded by another WAL entry,
there are actually two. I would rephase this to just avoid using the
word "single" this way. For example: "If an insertion causes the
addition of an overflow page, there will be one WAL entry for the new
overflow page and a second entry for the insert itself."

+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.

But VACUUM actually does try to complete the splits, so this is wrong, I think.

+Squeeze operation moves tuples from one of the buckets later in the chain to

A squeeze operation

+As Squeeze operation involves writing multiple atomic operations, it is

As a squeeze operation involves multiple atomic operations

+                if (!xlrec.is_primary_bucket_page)
+                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);

So, in hashbucketcleanup, the page is registered and therefore an FPI
will be taken if this is the first reference to this page since the
checkpoint, but hash_xlog_delete won't restore the FPI. I think that
exposes us to a torn-page hazard.
_hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
Not sure exactly what the fix should be.

+            /*
+             * As bitmap page doesn't have standard page layout, so this will
+             * allow us to log the data.
+             */
+            XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+            XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));

If the page doesn't have a standard page layout, why are we passing
REGBUF_STANDARD? But I think you've hacked things so that bitmap
pages DO have a standard page layout, per the change to
_hash_initbitmapbuffer, in which case the comment seems wrong.

(The comment is also a little confusing grammatically, but let's iron
out the substantive issue first.)

+        recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+        PageSetLSN(BufferGetPage(ovflbuf), recptr);
+        PageSetLSN(BufferGetPage(buf), recptr);
+
+        if (BufferIsValid(mapbuf))
+            PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+        if (BufferIsValid(newmapbuf))
+            PageSetLSN(BufferGetPage(newmapbuf), recptr);

I think you forgot to call PageSetLSN() on metabuf. (There are 5
block references in the write-ahead log record but only 4 calls to
PageSetLSN ... surely that can't be right!)

+    /*
+     * We need to release the locks once the prev pointer of overflow bucket
+     * and next of left bucket are set, otherwise concurrent read might skip
+     * the bucket.
+     */
+    if (BufferIsValid(leftbuf))
+        UnlockReleaseBuffer(leftbuf);
+    UnlockReleaseBuffer(ovflbuf);

I don't believe the comment. Postponing the lock release wouldn't
cause the concurrent read to skip the bucket. It might cause it to be
delayed unnecessarily, but the following comment already addresses
that point. I think you can just nuke this comment.

+        xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+        xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;

Maybe just xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); and similar?

+    /*
+     * We release locks on writebuf and bucketbuf at end of replay operation
+     * to ensure that we hold lock on primary bucket page till end of
+     * operation.  We can optimize by releasing the lock on write buffer as
+     * soon as the operation for same is complete, if it is not same as
+     * primary bucket page, but that doesn't seem to be worth complicating the
+     * code.
+     */
+    if (BufferIsValid(writebuf))
+        UnlockReleaseBuffer(writebuf);
+
+    if (BufferIsValid(bucketbuf))
+        UnlockReleaseBuffer(bucketbuf);

Here in hash_xlog_squeeze_page(), you wait until the very end to
release these locks, but in e.g. hash_xlog_addovflpage you release
them considerably earlier for reasons that seem like they would also
be valid here. I think you could move this back to before the updates
to the metapage and bitmap page.

+        /*
+         * Note: in normal operation, we'd update the metapage while still
+         * holding lock on the page we inserted into.  But during replay it's
+         * not necessary to hold that lock, since no other index updates can
+         * be happening concurrently.
+         */

This comment in hash_xlog_init_bitmap_page seems to be off-base,
because we didn't just insert into a page. I'm not disputing that
it's safe, though: not only can nobody else be modifying it because
we're in recovery, but also nobody else can see it yet; the creating
transaction hasn't yet committed.

+    /*
+     * Note that we still update the page even if it was restored from a full
+     * page image, because the bucket flag is not included in the image.
+     */

Really? Why not? (Because it's part of the special space?)

Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
guard against concurrent scans? And also hash_xlog_split_complete?
And hash_xlog_delete? If the regular code path is getting a cleanup
lock to protect against concurrent scans, recovery better do it, too.

+    /*
+     * There is no harm in releasing the lock on old bucket before new bucket
+     * during replay as no other bucket splits can be happening concurrently.
+     */
+    if (BufferIsValid(oldbuf))
+        UnlockReleaseBuffer(oldbuf);

And I'm not sure this is safe either. The issue isn't that there
might be another bucket split happening concurrently, but that there
might be scans that get confused by seeing the bucket split flag set
before the new bucket is created or the metapage updated. Seems like
it would be safer to keep all the locks for this one.

+ * Initialise the new bucket page, here we can't complete zeroed the page

I think maybe a good way to write these would be "we can't simply zero
the page". And Initialise -> Initialize.

                 /*
+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.
+                 */
+                START_CRIT_SECTION();
+
+                /*
                  * Insert tuple on new page, using _hash_pgaddtup to ensure
                  * correct ordering by hashkey.  This is a tad inefficient
                  * since we may have to shuffle itempointers repeatedly.
                  * Possible future improvement: accumulate all the items for
                  * the new page and qsort them before insertion.
                  */
                 (void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);

+ END_CRIT_SECTION();

No way. You have to start the critical section before making any page
modifications and keep it alive until all changes have been logged.

hash_redo's mappings of record types to function names is a little
less regular than seems desirable. Generally, the function name is
the lower-case version of the record type with "hash" and "xlog"
switched. But XLOG_HASH_ADD_OVFL_PAGE and
XLOG_HASH_SPLIT_ALLOCATE_PAGE break the pattern.

It seems like a good test to do with this patch would be to set up a
pgbench test on the master with a hash index replacing the usual btree
index. Then, set up a standby and run a read-only pgbench on the
standby while a read-write pgbench test runs on the master. Maybe
you've already tried something like that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#86)

Re: Write Ahead Logging for Hash Indexes

On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Great, thanks. 0001 looks good to me now, so committed.

Committed 0002.

Here are some initial review thoughts on 0003 based on a first read-through.

More thoughts on the main patch:
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
But VACUUM actually does try to complete the splits, so this is wrong, I think.

Vacuum just tries to clean the remaining tuples from old bucket. Only
insert or split operation tries to complete the split.

+                if (!xlrec.is_primary_bucket_page)
+                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
So, in hashbucketcleanup, the page is registered and therefore an FPI
will be taken if this is the first reference to this page since the
checkpoint, but hash_xlog_delete won't restore the FPI. I think that
exposes us to a torn-page hazard.
_hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.

Right, if we use XLogReadBufferForRedoExtended() instead of
XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
then it will restore FPI when required and can serve our purpose of
acquiring clean up lock on primary bucket page. Yet another way could
be to store some information like block number, relfilenode, forknum
(maybe we can get away with some less info) of primary bucket in the
fixed part of each of these records and then using that read the page
and then take a cleanup lock. Now, as in most cases, this buffer
needs to be registered only in cases when there are multiple overflow
pages, I think having fixed information might hurt in some cases.

+            /*
+             * As bitmap page doesn't have standard page layout, so this will
+             * allow us to log the data.
+             */
+            XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+            XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
If the page doesn't have a standard page layout, why are we passing
REGBUF_STANDARD? But I think you've hacked things so that bitmap
pages DO have a standard page layout, per the change to
_hash_initbitmapbuffer, in which case the comment seems wrong.

Yes, the comment is obsolete and I have removed it. We need standard
page layout for bitmap page, so that full page writes won't exclude
the bitmappage data.

+    /*
+     * Note that we still update the page even if it was restored from a full
+     * page image, because the bucket flag is not included in the image.
+     */

Really? Why not? (Because it's part of the special space?)

Yes, because it's part of special space. Do you want that to
mentioned in comments?

Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
guard against concurrent scans?

I don't think so, because regular code takes it on old bucket to
protect the split from concurrent inserts and cleanup lock on new
bucket is just for the sake of consistency.

And also hash_xlog_split_complete?

In regular code, we are not taking cleanup lock for this operation.

And hash_xlog_delete? If the regular code path is getting a cleanup
lock to protect against concurrent scans, recovery better do it, too.

We are already taking cleanup lock for this, not sure what exactly is
missed here, can you be more specific?

+    /*
+     * There is no harm in releasing the lock on old bucket before new bucket
+     * during replay as no other bucket splits can be happening concurrently.
+     */
+    if (BufferIsValid(oldbuf))
+        UnlockReleaseBuffer(oldbuf);
And I'm not sure this is safe either. The issue isn't that there
might be another bucket split happening concurrently, but that there
might be scans that get confused by seeing the bucket split flag set
before the new bucket is created or the metapage updated.

During scan we check for bucket being populated, so this should be safe.

Seems like
it would be safer to keep all the locks for this one.

If you feel better that way, I can change it and add a comment saying
something like "we could release lock on bucket being split early as
well, but doing here to be consistent with normal operation"

It seems like a good test to do with this patch would be to set up a
pgbench test on the master with a hash index replacing the usual btree
index. Then, set up a standby and run a read-only pgbench on the
standby while a read-write pgbench test runs on the master. Maybe
you've already tried something like that?

Something similar has been done, but will again do with the final version.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#85)

Re: Write Ahead Logging for Hash Indexes

On Thu, Mar 9, 2017 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 8, 2017 at 7:45 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I can try, but note that currently there is no test related to
"snapshot too old" for any other indexes.

Wow, that's surprising. It seems the snapshot_too_old test only
checks that this works for a table that has no indexes. Have you,
anyway, tested it manually?

Yes, I have tested in manually. I think we need to ensure that the
modified tuple falls on the same page as old tuple to make the test
work. The slight difficulty with the index is to ensure the modified
tuple to be inserted into same page as old tuple, this is more true
with hash indexes. Also, for heap, I think it relies on hot pruning
stuff and for index we need to perform manual vacuum. Basically, if
we want we can write a test for index, but not sure if it is worth the
pain to write for hash index when the test for btree is not there.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#87)

Re: Write Ahead Logging for Hash Indexes

On Thu, Mar 9, 2017 at 10:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
But VACUUM actually does try to complete the splits, so this is wrong, I think.
Vacuum just tries to clean the remaining tuples from old bucket. Only
insert or split operation tries to complete the split.

Oh, right. Sorry.

+                if (!xlrec.is_primary_bucket_page)
+                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
So, in hashbucketcleanup, the page is registered and therefore an FPI
will be taken if this is the first reference to this page since the
checkpoint, but hash_xlog_delete won't restore the FPI. I think that
exposes us to a torn-page hazard.
_hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
Right, if we use XLogReadBufferForRedoExtended() instead of
XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
then it will restore FPI when required and can serve our purpose of
acquiring clean up lock on primary bucket page. Yet another way could
be to store some information like block number, relfilenode, forknum
(maybe we can get away with some less info) of primary bucket in the
fixed part of each of these records and then using that read the page
and then take a cleanup lock. Now, as in most cases, this buffer
needs to be registered only in cases when there are multiple overflow
pages, I think having fixed information might hurt in some cases.

Just because the WAL record gets larger? I think you could arrange to
record the data only in the cases where it is needed, but I'm also not
sure it would matter that much anyway. Off-hand, it seems better than
having to mark the primary bucket page dirty (because you have to set
the LSN) every time any page in the chain is modified.

+    /*
+     * Note that we still update the page even if it was restored from a full
+     * page image, because the bucket flag is not included in the image.
+     */
Really? Why not? (Because it's part of the special space?)
Yes, because it's part of special space. Do you want that to
mentioned in comments?

Seems like a good idea. Maybe just change the end to "because the
special space is not included in the image" to make it slightly more
explicit.

Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
guard against concurrent scans?

I don't think so, because regular code takes it on old bucket to
protect the split from concurrent inserts and cleanup lock on new
bucket is just for the sake of consistency.

You might be right, but this definitely does create a state on the
standby that can't occur on the master: the bucket being split is
flagged as such, but the bucket being populated doesn't appear to
exist yet. Maybe that's OK. It makes me slightly nervous to have the
standby take a weaker lock than the master does, but perhaps it's
harmless.

And also hash_xlog_split_complete?

In regular code, we are not taking cleanup lock for this operation.

You're right, sorry.

And hash_xlog_delete? If the regular code path is getting a cleanup
lock to protect against concurrent scans, recovery better do it, too.

We are already taking cleanup lock for this, not sure what exactly is
missed here, can you be more specific?

Never mind, I'm wrong.

+    /*
+     * There is no harm in releasing the lock on old bucket before new bucket
+     * during replay as no other bucket splits can be happening concurrently.
+     */
+    if (BufferIsValid(oldbuf))
+        UnlockReleaseBuffer(oldbuf);
And I'm not sure this is safe either. The issue isn't that there
might be another bucket split happening concurrently, but that there
might be scans that get confused by seeing the bucket split flag set
before the new bucket is created or the metapage updated.
During scan we check for bucket being populated, so this should be safe.

Yeah, I guess so. This business of the standby taking weaker locks
still seems scary to me, as in hash_xlog_split_allocpage above.

Seems like
it would be safer to keep all the locks for this one.

If you feel better that way, I can change it and add a comment saying
something like "we could release lock on bucket being split early as
well, but doing here to be consistent with normal operation"

I think that might not be a bad idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#89)

Re: Write Ahead Logging for Hash Indexes

On Thu, Mar 9, 2017 at 11:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 9, 2017 at 10:23 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, if we use XLogReadBufferForRedoExtended() instead of
XLogReadBufferExtended()/LockBufferForCleanup during relay routine,
then it will restore FPI when required and can serve our purpose of
acquiring clean up lock on primary bucket page. Yet another way could
be to store some information like block number, relfilenode, forknum
(maybe we can get away with some less info) of primary bucket in the
fixed part of each of these records and then using that read the page
and then take a cleanup lock. Now, as in most cases, this buffer
needs to be registered only in cases when there are multiple overflow
pages, I think having fixed information might hurt in some cases.

Just because the WAL record gets larger? I think you could arrange to
record the data only in the cases where it is needed, but I'm also not
sure it would matter that much anyway. Off-hand, it seems better than
having to mark the primary bucket page dirty (because you have to set
the LSN) every time any page in the chain is modified.

Do we really need to set LSN on this page (or mark it dirty), if so
why? Are you worried about restoration of FPI or something else?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#90)

Re: Write Ahead Logging for Hash Indexes

On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we really need to set LSN on this page (or mark it dirty), if so
why? Are you worried about restoration of FPI or something else?

I haven't thought through all of the possible consequences and am a
bit to tired to do so just now, but doesn't it seem rather risky to
invent a whole new way of using these xlog functions?
src/backend/access/transam/README describes how to do write-ahead
logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
described as an optional step.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#91)

Re: Write Ahead Logging for Hash Indexes

On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we really need to set LSN on this page (or mark it dirty), if so
why? Are you worried about restoration of FPI or something else?

I haven't thought through all of the possible consequences and am a
bit to tired to do so just now, but doesn't it seem rather risky to
invent a whole new way of using these xlog functions?
src/backend/access/transam/README describes how to do write-ahead
logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
described as an optional step.

Just to salvage my point, I think this is not the first place where we
register buffer, but don't set lsn. For XLOG_HEAP2_VISIBLE, we
register heap and vm buffers but set the LSN conditionally on heap
buffer. Having said that, I see the value of your point and I am open
to doing it that way if you feel that is a better way.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#92)

Re: Write Ahead Logging for Hash Indexes

On Fri, Mar 10, 2017 at 7:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we really need to set LSN on this page (or mark it dirty), if so
why? Are you worried about restoration of FPI or something else?

I haven't thought through all of the possible consequences and am a
bit to tired to do so just now, but doesn't it seem rather risky to
invent a whole new way of using these xlog functions?
src/backend/access/transam/README describes how to do write-ahead
logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
described as an optional step.

Just to salvage my point, I think this is not the first place where we
register buffer, but don't set lsn. For XLOG_HEAP2_VISIBLE, we
register heap and vm buffers but set the LSN conditionally on heap
buffer. Having said that, I see the value of your point and I am open
to doing it that way if you feel that is a better way.

Right, we did that, and it seems to have worked. But it was a scary
exception that required a lot of thought. Now, since we did it once,
we could do it again, but I am not sure it is for the best. In the
case of vacuum, we knew that a vacuum on a large table could otherwise
emit an FPI for every page, which would almost double the amount of
write I/O generated by a vacuum - instead of WAL records + heap pages,
you'd be writing FPIs + heap pages, a big increase. Now here I think
that the amount of write I/O that it's costing us is not so clear.
Unless it's going to be really big, I'd rather do this in some way we
can think is definitely safe.

Also, if we want to avoid dirtying the primary bucket page by setting
the LSN, IMHO the way to do that is to store the block number of the
primary bucket page without registering the buffer, and then during
recovery, lock that block. That seems cleaner than hoping that we can
take an FPI without setting the page LSN.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#93)

Re: Write Ahead Logging for Hash Indexes

On Fri, Mar 10, 2017 at 6:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 10, 2017 at 7:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Mar 10, 2017 at 8:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 9, 2017 at 9:34 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Do we really need to set LSN on this page (or mark it dirty), if so
why? Are you worried about restoration of FPI or something else?

I haven't thought through all of the possible consequences and am a
bit to tired to do so just now, but doesn't it seem rather risky to
invent a whole new way of using these xlog functions?
src/backend/access/transam/README describes how to do write-ahead
logging properly, and neither MarkBufferDirty() nor PageSetLSN() is
described as an optional step.

Just to salvage my point, I think this is not the first place where we
register buffer, but don't set lsn. For XLOG_HEAP2_VISIBLE, we
register heap and vm buffers but set the LSN conditionally on heap
buffer. Having said that, I see the value of your point and I am open
to doing it that way if you feel that is a better way.

Right, we did that, and it seems to have worked. But it was a scary
exception that required a lot of thought. Now, since we did it once,
we could do it again, but I am not sure it is for the best. In the
case of vacuum, we knew that a vacuum on a large table could otherwise
emit an FPI for every page, which would almost double the amount of
write I/O generated by a vacuum - instead of WAL records + heap pages,
you'd be writing FPIs + heap pages, a big increase. Now here I think
that the amount of write I/O that it's costing us is not so clear.
Unless it's going to be really big, I'd rather do this in some way we
can think is definitely safe.

Also, if we want to avoid dirtying the primary bucket page by setting
the LSN, IMHO the way to do that is to store the block number of the
primary bucket page without registering the buffer, and then during
recovery, lock that block. That seems cleaner than hoping that we can
take an FPI without setting the page LSN.

I was thinking that we will use REGBUF_NO_IMAGE flag as is used in
XLOG_HEAP2_VISIBLE record for heap buffer, that will avoid any extra
I/O and will make it safe as well. I think that makes registering the
buffer safe without setting LSN for XLOG_HEAP2_VISIBLE record.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#94)

Re: Write Ahead Logging for Hash Indexes

On Fri, Mar 10, 2017 at 8:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I was thinking that we will use REGBUF_NO_IMAGE flag as is used in
XLOG_HEAP2_VISIBLE record for heap buffer, that will avoid any extra
I/O and will make it safe as well. I think that makes registering the
buffer safe without setting LSN for XLOG_HEAP2_VISIBLE record.

Hmm, yeah, that might work, though I would probably want to spend some
time studying the patch and thinking about it before deciding for
sure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#86)

Re: Write Ahead Logging for Hash Indexes

On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Great, thanks. 0001 looks good to me now, so committed.

Committed 0002.

Here are some initial review thoughts on 0003 based on a first read-through.

More thoughts on the main patch:
/*
+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.
+                 */
+                START_CRIT_SECTION();
+
+                /*
* Insert tuple on new page, using _hash_pgaddtup to ensure
* correct ordering by hashkey.  This is a tad inefficient
* since we may have to shuffle itempointers repeatedly.
* Possible future improvement: accumulate all the items for
* the new page and qsort them before insertion.
*/
(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
+ END_CRIT_SECTION();

No way. You have to start the critical section before making any page
modifications and keep it alive until all changes have been logged.

I think what we need to do here is to accumulate all the tuples that
need to be added to new bucket page till either that page has no more
space or there are no more tuples remaining in an old bucket. Then in
a critical section, add them to the page using _hash_pgaddmultitup and
log the entire new bucket page contents as is currently done in patch
log_split_page(). Now, here we can choose to log the individual
tuples as well instead of a complete page, however not sure if there
is any benefit for doing the same because XLogRecordAssemble() will
anyway remove the empty space from the page. Let me know if you have
something else in mind.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#96)

Re: Write Ahead Logging for Hash Indexes

On Sat, Mar 11, 2017 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

/*
+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.
+                 */
+                START_CRIT_SECTION();
+
+                /*
* Insert tuple on new page, using _hash_pgaddtup to ensure
* correct ordering by hashkey.  This is a tad inefficient
* since we may have to shuffle itempointers repeatedly.
* Possible future improvement: accumulate all the items for
* the new page and qsort them before insertion.
*/
(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
+ END_CRIT_SECTION();

No way. You have to start the critical section before making any page
modifications and keep it alive until all changes have been logged.
I think what we need to do here is to accumulate all the tuples that
need to be added to new bucket page till either that page has no more
space or there are no more tuples remaining in an old bucket. Then in
a critical section, add them to the page using _hash_pgaddmultitup and
log the entire new bucket page contents as is currently done in patch
log_split_page().

I agree.

Now, here we can choose to log the individual
tuples as well instead of a complete page, however not sure if there
is any benefit for doing the same because XLogRecordAssemble() will
anyway remove the empty space from the page. Let me know if you have
something else in mind.

Well, if you have two pages that are 75% full, and you move a third of
the tuples from one of them into the other, it's going to be about
four times more efficient to log only the moved tuples than the whole
page. But logging the whole filled page wouldn't be wrong, just
somewhat inefficient.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#86)

2 attachment(s)

Re: Write Ahead Logging for Hash Indexes

On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Great, thanks. 0001 looks good to me now, so committed.

Committed 0002.

Here are some initial review thoughts on 0003 based on a first read-through.

More thoughts on the main patch:

The text you've added to the hash index README throws around the term
"single WAL entry" pretty freely. For example: "An insertion that
causes an addition of an overflow page is logged as a single WAL entry
preceded by a WAL entry for new overflow page required to insert a
tuple." When you say a single WAL entry, you make it sound like
there's only one, but because it's preceded by another WAL entry,
there are actually two. I would rephase this to just avoid using the
word "single" this way. For example: "If an insertion causes the
addition of an overflow page, there will be one WAL entry for the new
overflow page and a second entry for the insert itself."

Okay, changed as per suggestion. I have also removed the second part
of the above comment which states that newly allocated overflow page
can be used by concurrent transactions. That won't be true after
fixing release/reacquire stuff in _hash_addovflpage. I have also
rephrased another similar usage of "single WAL entry".

+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.

But VACUUM actually does try to complete the splits, so this is wrong, I think.

As discussed up thread, there is no need to change.

+Squeeze operation moves tuples from one of the buckets later in the chain to

A squeeze operation

+As Squeeze operation involves writing multiple atomic operations, it is

As a squeeze operation involves multiple atomic operations

Changed as per suggestion.

+                if (!xlrec.is_primary_bucket_page)
+                    XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
So, in hashbucketcleanup, the page is registered and therefore an FPI
will be taken if this is the first reference to this page since the
checkpoint, but hash_xlog_delete won't restore the FPI. I think that
exposes us to a torn-page hazard.
_hash_freeovflpage/hash_xlog_squeeze_page seems to have the same
disease, as does _hash_squeezebucket/hash_xlog_move_page_contents.
Not sure exactly what the fix should be.

I have used REGBUF_NO_IMAGE while registering buffer to avoid this
hazard and during replay used XLogReadBufferForRedoExtended() to read
block and take cleanup lock on it.

+            /*
+             * As bitmap page doesn't have standard page layout, so this will
+             * allow us to log the data.
+             */
+            XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+            XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
If the page doesn't have a standard page layout, why are we passing
REGBUF_STANDARD? But I think you've hacked things so that bitmap
pages DO have a standard page layout, per the change to
_hash_initbitmapbuffer, in which case the comment seems wrong.

Yes, the comment is obsolete and I have removed it. We need standard
page layout for bitmap page, so that full page writes won't exclude
the bitmappage data.

+        recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+        PageSetLSN(BufferGetPage(ovflbuf), recptr);
+        PageSetLSN(BufferGetPage(buf), recptr);
+
+        if (BufferIsValid(mapbuf))
+            PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+        if (BufferIsValid(newmapbuf))
+            PageSetLSN(BufferGetPage(newmapbuf), recptr);
I think you forgot to call PageSetLSN() on metabuf. (There are 5
block references in the write-ahead log record but only 4 calls to
PageSetLSN ... surely that can't be right!)

Right, fixed.

+    /*
+     * We need to release the locks once the prev pointer of overflow bucket
+     * and next of left bucket are set, otherwise concurrent read might skip
+     * the bucket.
+     */
+    if (BufferIsValid(leftbuf))
+        UnlockReleaseBuffer(leftbuf);
+    UnlockReleaseBuffer(ovflbuf);
I don't believe the comment. Postponing the lock release wouldn't
cause the concurrent read to skip the bucket. It might cause it to be
delayed unnecessarily, but the following comment already addresses
that point. I think you can just nuke this comment.

Okay, removed the comment.

+        xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf) ? true : false;
+        xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf) ? true : false;
Maybe just xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf); and similar?

Changed as per suggestion.

+    /*
+     * We release locks on writebuf and bucketbuf at end of replay operation
+     * to ensure that we hold lock on primary bucket page till end of
+     * operation.  We can optimize by releasing the lock on write buffer as
+     * soon as the operation for same is complete, if it is not same as
+     * primary bucket page, but that doesn't seem to be worth complicating the
+     * code.
+     */
+    if (BufferIsValid(writebuf))
+        UnlockReleaseBuffer(writebuf);
+
+    if (BufferIsValid(bucketbuf))
+        UnlockReleaseBuffer(bucketbuf);
Here in hash_xlog_squeeze_page(), you wait until the very end to
release these locks, but in e.g. hash_xlog_addovflpage you release
them considerably earlier for reasons that seem like they would also
be valid here. I think you could move this back to before the updates
to the metapage and bitmap page.

Sure, doesn't see any problem with releasing locks early, so changed
as per suggestion.

+        /*
+         * Note: in normal operation, we'd update the metapage while still
+         * holding lock on the page we inserted into.  But during replay it's
+         * not necessary to hold that lock, since no other index updates can
+         * be happening concurrently.
+         */
This comment in hash_xlog_init_bitmap_page seems to be off-base,
because we didn't just insert into a page. I'm not disputing that
it's safe, though: not only can nobody else be modifying it because
we're in recovery, but also nobody else can see it yet; the creating
transaction hasn't yet committed.

Fixed the comment and mentioned the second reason as suggested by you
(that sounds appropriate here).

+    /*
+     * Note that we still update the page even if it was restored from a full
+     * page image, because the bucket flag is not included in the image.
+     */

Really? Why not? (Because it's part of the special space?)

Added a comment as discussed above.

Doesn't hash_xlog_split_allocpage need to acquire cleanup locks to
guard against concurrent scans?

Changed and made it similar to normal operations by taking cleanup lock.

+    /*
+     * There is no harm in releasing the lock on old bucket before new bucket
+     * during replay as no other bucket splits can be happening concurrently.
+     */
+    if (BufferIsValid(oldbuf))
+        UnlockReleaseBuffer(oldbuf);
And I'm not sure this is safe either. The issue isn't that there
might be another bucket split happening concurrently, but that there
might be scans that get confused by seeing the bucket split flag set
before the new bucket is created or the metapage updated. Seems like
it would be safer to keep all the locks for this one.

Changed as discussed.

+ * Initialise the new bucket page, here we can't complete zeroed the page

I think maybe a good way to write these would be "we can't simply zero
the page". And Initialise -> Initialize.

I have changed the comment, see if that looks okay to you.

/*
+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.
+                 */
+                START_CRIT_SECTION();
+
+                /*
* Insert tuple on new page, using _hash_pgaddtup to ensure
* correct ordering by hashkey.  This is a tad inefficient
* since we may have to shuffle itempointers repeatedly.
* Possible future improvement: accumulate all the items for
* the new page and qsort them before insertion.
*/
(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);

+ END_CRIT_SECTION();

No way. You have to start the critical section before making any page
modifications and keep it alive until all changes have been logged.

Changed as discussed up thread.

hash_redo's mappings of record types to function names is a little
less regular than seems desirable. Generally, the function name is
the lower-case version of the record type with "hash" and "xlog"
switched. But XLOG_HASH_ADD_OVFL_PAGE and
XLOG_HASH_SPLIT_ALLOCATE_PAGE break the pattern.

Okay, changed as per suggestion and I went ahead and change the name
of corresponding xlog record structures as well because those also
match the function name pattern.

It seems like a good test to do with this patch would be to set up a
pgbench test on the master with a hash index replacing the usual btree
index. Then, set up a standby and run a read-only pgbench on the
standby while a read-write pgbench test runs on the master. Maybe
you've already tried something like that?

I also think so and apart from that I think it makes sense to perform
recovery test by Jeff Janes tool and probably tests with
wal_consistency_check. These tests are already running from past seven
hours or so and I will keep them running for the whole night to see if
there is any discrepancy. Thanks to Kuntal and Ashutosh for helping
me in doing these stress tests.

Note that I have fixed all other comments raised by you in another
e-mail including adding the test for snapshot too old (I have prepared
a separate patch for the test. I have added just one test to keep the
timing short because each test will take minimum 6s to complete. This
6s requirement is what is required by existing tests or any other test
for snapshot too old). Let me know if I have missed anything?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

wal_hash_index_v10.patchapplication/octet-stream; name=wal_hash_index_v10.patchDownload

diff --git a/contrib/pageinspect/expected/hash.out b/contrib/pageinspect/expected/hash.out
index 8ed60bc..3ba01f6 100644
--- a/contrib/pageinspect/expected/hash.out
+++ b/contrib/pageinspect/expected/hash.out
@@ -1,7 +1,6 @@
 CREATE TABLE test_hash (a int, b text);
 INSERT INTO test_hash VALUES (1, 'one');
 CREATE INDEX test_hash_a_idx ON test_hash USING hash (a);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 \x
 SELECT hash_page_type(get_raw_page('test_hash_a_idx', 0));
 -[ RECORD 1 ]--+---------
diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index baee2bd..2c3515b 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -131,7 +131,6 @@ select * from pgstatginindex('test_ginidx');
 (1 row)
 
 create index test_hashidx on test using hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 select * from pgstathashindex('test_hashidx');
  version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | live_items | dead_items | free_percent 
 ---------+--------------+----------------+--------------+------------+------------+------------+--------------
@@ -226,7 +225,6 @@ ERROR:  "test_partition" is not an index
 -- an actual index of a partitioned table should work though
 create index test_partition_idx on test_partition(a);
 create index test_partition_hash_idx on test_partition using hash (a);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- these should work
 select pgstatindex('test_partition_idx');
          pgstatindex          
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 12f2a14..69c599e 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1538,19 +1538,6 @@ archive_command = 'local_backup_script.sh "%p" "%f"'
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.  This will mean that any new inserts
-     will be ignored by the index, updated rows will apparently disappear and
-     deleted rows will still retain pointers. In other words, if you modify a
-     table with a hash index on it then you will get incorrect query results
-     on a standby server.  When recovery completes it is recommended that you
-     manually <xref linkend="sql-reindex">
-     each such index after completing a recovery operation.
-    </para>
-   </listitem>
-
-   <listitem>
-    <para>
      If a <xref linkend="sql-createdatabase">
      command is executed while a base backup is being taken, and then
      the template database that the <command>CREATE DATABASE</> copied
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 69844e5..eadbfcd 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2153,10 +2153,9 @@ include_dir 'conf.d'
          has materialized a result set, no error will be generated even if the
          underlying rows in the referenced table have been vacuumed away.
          Some tables cannot safely be vacuumed early, and so will not be
-         affected by this setting.  Examples include system catalogs and any
-         table which has a hash index.  For such tables this setting will
-         neither reduce bloat nor create a possibility of a <literal>snapshot
-         too old</> error on scanning.
+         affected by this setting, such as system catalogs.  For such tables
+         this setting will neither reduce bloat nor create a possibility
+         of a <literal>snapshot too old</> error on scanning.
         </para>
        </listitem>
       </varlistentry>
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 48de2ce..0e61991 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2353,12 +2353,6 @@ LOG:  database system is ready to accept read only connections
   <itemizedlist>
    <listitem>
     <para>
-     Operations on hash indexes are not presently WAL-logged, so
-     replay will not update these indexes.
-    </para>
-   </listitem>
-   <listitem>
-    <para>
      Full knowledge of running transactions is required before snapshots
      can be taken. Transactions that use large numbers of subtransactions
      (currently greater than 64) will delay the start of read only
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 271c135..e40750e 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -193,18 +193,6 @@ CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable>
 </synopsis>
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    <indexterm>
     <primary>index</primary>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index fcb7a60..7163b03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -510,19 +510,6 @@ Indexes:
    they can be useful.
   </para>
 
-  <caution>
-   <para>
-    Hash index operations are not presently WAL-logged,
-    so hash indexes might need to be rebuilt with <command>REINDEX</>
-    after a database crash if there were unwritten changes.
-    Also, changes to hash indexes are not replicated over streaming or
-    file-based replication after the initial base backup, so they
-    give wrong answers to queries that subsequently use them.
-    Hash indexes are also not properly restored during point-in-time
-    recovery.  For these reasons, hash index use is presently discouraged.
-   </para>
-  </caution>
-
   <para>
    Currently, only the B-tree, GiST, GIN, and BRIN index methods support
    multicolumn indexes. Up to 32 fields can be specified by default.
diff --git a/src/backend/access/hash/Makefile b/src/backend/access/hash/Makefile
index e2e7e91..b154569 100644
--- a/src/backend/access/hash/Makefile
+++ b/src/backend/access/hash/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = hash.o hashfunc.o hashinsert.o hashovfl.o hashpage.o hashsearch.o \
-       hashsort.o hashutil.o hashvalidate.o
+       hashsort.o hashutil.o hashvalidate.o hash_xlog.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 703ae98..bd13d07 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -287,13 +287,17 @@ The insertion algorithm is rather similar:
 	if current page is full, release lock but not pin, read/exclusive-lock
      next page; repeat as needed
 	>> see below if no space in any page of bucket
+	take buffer content lock in exclusive mode on metapage
 	insert tuple at appropriate place in page
-	mark current page dirty and release buffer content lock and pin
-	if the current page is not a bucket page, release the pin on bucket page
-	pin meta page and take buffer content lock in exclusive mode
+	mark current page dirty
 	increment tuple count, decide if split needed
-	mark meta page dirty and release buffer content lock and pin
-	done if no split needed, else enter Split algorithm below
+	mark meta page dirty
+	write WAL for insertion of tuple
+	release the buffer content lock on metapage
+	release buffer content lock on current page
+	if current page is not a bucket page, release the pin on bucket page
+	if split is needed, enter Split algorithm below
+	release the pin on metapage
 
 To speed searches, the index entries within any individual index page are
 kept sorted by hash code; the insertion code must take care to insert new
@@ -328,12 +332,17 @@ existing bucket in two, thereby lowering the fill ratio:
        try to finish the split and the cleanup work
        if that succeeds, start over; if it fails, give up
 	mark the old and new buckets indicating split is in progress
+	mark both old and new buckets as dirty
+	write WAL for allocation of new page for split
 	copy the tuples that belongs to new bucket from old bucket, marking
      them as moved-by-split
+	write WAL record for moving tuples to new page once the new page is full
+	or all the pages of old bucket are finished
 	release lock but not pin for primary bucket page of old bucket,
 	 read/shared-lock next page; repeat as needed
 	clear the bucket-being-split and bucket-being-populated flags
 	mark the old bucket indicating split-cleanup
+	write WAL for changing the flags on both old and new buckets
 
 The split operation's attempt to acquire cleanup-lock on the old bucket number
 could fail if another process holds any lock or pin on it.  We do not want to
@@ -369,6 +378,8 @@ The fourth operation is garbage collection (bulk deletion):
 		acquire cleanup lock on primary bucket page
 		loop:
 			scan and remove tuples
+			mark the target page dirty
+			write WAL for deleting tuples from target page
 			if this is the last bucket page, break out of loop
 			pin and x-lock next page
 			release prior lock and pin (except keep pin on primary bucket page)
@@ -383,7 +394,8 @@ The fourth operation is garbage collection (bulk deletion):
 	check if number of buckets changed
 	if so, release content lock and pin and return to for-each-bucket loop
 	else update metapage tuple count
-	mark meta page dirty and release buffer content lock and pin
+	mark meta page dirty and write WAL for update of metapage
+	release buffer content lock and pin
 
 Note that this is designed to allow concurrent splits and scans.  If a split
 occurs, tuples relocated into the new bucket will be visited twice by the
@@ -425,18 +437,16 @@ Obtaining an overflow page:
 	search for a free page (zero bit in bitmap)
 	if found:
 		set bit in bitmap
-		mark bitmap page dirty and release content lock
+		mark bitmap page dirty
 		take metapage buffer content lock in exclusive mode
 		if first-free-bit value did not change,
 			update it and mark meta page dirty
-		release meta page buffer content lock
-		return page number
 	else (not found):
 	release bitmap page buffer content lock
 	loop back to try next bitmap page, if any
 -- here when we have checked all bitmap pages; we hold meta excl. lock
 	extend index to add another overflow page; update meta information
-	mark meta page dirty and release buffer content lock
+	mark meta page dirty
 	return page number
 
 It is slightly annoying to release and reacquire the metapage lock
@@ -456,12 +466,15 @@ like this:
 
 	-- having determined that no space is free in the target bucket:
 	remember last page of bucket, drop write lock on it
-	call free-page-acquire routine
 	re-write-lock last page of bucket
 	if it is not last anymore, step to the last page
-	update (former) last page to point to new page
+	execute free-page-acquire (Obtaining an overflow page) mechanism described above
+	update (former) last page to point to the new page and mark the  buffer dirty.
 	write-lock and initialize new page, with back link to former last page
-	write and release former last page
+	write WAL for addition of overflow page
+	release the locks on meta page and bitmap page acquired in free-page-acquire algorithm
+	release the lock on former last page
+	release the lock on new overflow page
 	insert tuple into new page
 	-- etc.
 
@@ -488,12 +501,14 @@ accessors of pages in the bucket.  The algorithm is:
 	determine which bitmap page contains the free space bit for page
 	release meta page buffer content lock
 	pin bitmap page and take buffer content lock in exclusive mode
-	update bitmap bit
-	mark bitmap page dirty and release buffer content lock and pin
-	if page number is less than what we saw as first-free-bit in meta:
 	retake meta page buffer content lock in exclusive mode
+	move (insert) tuples that belong to the overflow page being freed
+	update bitmap bit
+	mark bitmap page dirty
 	if page number is still less than first-free-bit,
 		update first-free-bit field and mark meta page dirty
+	write WAL for delinking overflow page operation
+	release buffer content lock and pin
 	release meta page buffer content lock and pin
 
 We have to do it this way because we must clear the bitmap bit before
@@ -504,8 +519,97 @@ page acquirer will scan more bitmap bits than he needs to.  What must be
 avoided is having first-free-bit greater than the actual first free bit,
 because then that free page would never be found by searchers.
 
-All the freespace operations should be called while holding no buffer
-locks.  Since they need no lmgr locks, deadlock is not possible.
+The reason of moving tuples from overflow page while delinking the later is
+to make that as an atomic operation.  Not doing so could lead to spurious reads
+on standby.  Basically, the user might see the same tuple twice.
+
+
+WAL Considerations
+------------------
+
+The hash index operations like create index, insert, delete, bucket split,
+allocate overflow page, squeeze (overflow pages are freed in this operation)
+in themselves doesn't guarantee hash index consistency after a crash.  To
+provide robustness, we write WAL for each of these operations which get
+replayed on crash recovery.
+
+Multiple WAL records are being written for create index operation, first for
+initializing the metapage, followed by one for each new bucket created during
+operation followed by one for initializing the bitmap page.  If the system
+crashes after any operation, the whole operation is rolled back.  We have
+considered to write a single WAL record for the whole operation, but for that
+we need to limit the number of initial buckets that can be created during the
+operation. As we can log only the fixed number of pages XLR_MAX_BLOCK_ID (32)
+with current XLog machinery, it is better to write multiple WAL records for this
+operation.  The downside of restricting the number of buckets is that we need
+to perform the split operation if the number of tuples are more than what can be
+accommodated in the initial set of buckets and it is not unusual to have large
+number of tuples during create index operation.
+
+Ordinary item insertions (that don't force a page split or need a new overflow
+page) are single WAL entries.  They touch a single bucket page and meta page,
+metapage is updated during replay as it is updated during original operation.
+
+If an insertion causes the addition of an overflow page, there will be one
+WAL entry for the new overflow page and second entry for insert itself.
+
+If an insertion causes a bucket split, there will be one WAL entry for insert
+itself, followed by a WAL entry for allocating a new bucket, followed by a WAL
+entry for each overflow bucket page in the new bucket to which the tuples are
+moved from old bucket, followed by a WAL entry to indicate that split is complete
+for both old and new buckets.
+
+A split operation which requires overflow pages to complete the operation
+will need to write a WAL record for each new allocation of an overflow page.
+
+As splitting involves multiple atomic actions, it's possible that the
+system crashes between moving tuples from bucket pages of the old bucket to new
+bucket.  In such a case, after recovery, both the old and new buckets will be
+marked with bucket-being-split and bucket-being-populated flags respectively
+which indicates that split is in progress for those buckets.  The reader
+algorithm works correctly, as it will scan both the old and new buckets when
+the split is in progress as explained in the reader algorithm section above.
+
+We finish the split at next insert or split operation on the old bucket as
+explained in insert and split algorithm above.  It could be done during
+searches, too, but it seems best not to put any extra updates in what would
+otherwise be a read-only operation (updating is not possible in hot standby
+mode anyway).  It would seem natural to complete the split in VACUUM, but since
+splitting a bucket might require allocating a new page, it might fail if you
+run out of disk space.  That would be bad during VACUUM - the reason for
+running VACUUM in the first place might be that you run out of disk space,
+and now VACUUM won't finish because you're out of disk space.  In contrast,
+an insertion can require enlarging the physical file anyway.
+
+Deletion of tuples from a bucket is performed for two reasons, one for
+removing the dead tuples and other for removing the tuples that are moved by
+split.  WAL entry is made for each bucket page from which tuples are removed,
+followed by a WAL entry to clear the garbage flag if the tuples moved by split
+are removed.  Another separate WAL entry is made for updating the metapage if
+the deletion is performed for removing the dead tuples by vacuum.
+
+As deletion involves multiple atomic operations, it is quite possible that
+system crashes after (a) removing tuples from some of the bucket pages,
+(b) before clearing the garbage flag, or (c) before updating the metapage.
+If the system crashes before completing (b), it will again try to clean the
+bucket during next vacuum or insert after recovery which can have some performance
+impact, but it will work fine. If the system crashes before completing (c),
+after recovery there could be some additional splits till the next vacuum
+updates the metapage, but the other operations like insert, delete and scan
+will work correctly.  We can fix this problem by actually updating the metapage
+based on delete operation during replay, but not sure if it is worth the
+complication.
+
+A squeeze operation moves tuples from one of the buckets later in the chain to
+one of the bucket earlier in chain and writes WAL record when either the
+bucket to which it is writing tuples is filled or bucket from which it
+is removing the tuples becomes empty.
+
+As a squeeze operation involves writing multiple atomic operations, it is
+quite possible that the system crashes before completing the operation on
+entire bucket.  After recovery, the operations will work correctly, but
+the index will remain bloated and this can impact performance of read and
+insert operations until the next vacuum squeeze the bucket completely.
 
 
 Other Notes
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 1f8a7f6..6416769 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -28,6 +28,7 @@
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/rel.h"
+#include "miscadmin.h"
 
 
 /* Working state for hashbuild and its callback */
@@ -303,6 +304,11 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		buf = so->hashso_curbuf;
 		Assert(BufferIsValid(buf));
 		page = BufferGetPage(buf);
+
+		/*
+		 * We don't need test for old snapshot here as the current buffer is
+		 * pinned, so vacuum can't clean the page.
+		 */
 		maxoffnum = PageGetMaxOffsetNumber(page);
 		for (offnum = ItemPointerGetOffsetNumber(current);
 			 offnum <= maxoffnum;
@@ -623,6 +629,7 @@ loop_top:
 	}
 
 	/* Okay, we're really done.  Update tuple count in metapage. */
+	START_CRIT_SECTION();
 
 	if (orig_maxbucket == metap->hashm_maxbucket &&
 		orig_ntuples == metap->hashm_ntuples)
@@ -649,6 +656,26 @@ loop_top:
 	}
 
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_update_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.ntuples = metap->hashm_ntuples;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+
+		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_UPDATE_META_PAGE);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	_hash_relbuf(rel, metabuf);
 
 	/* return statistics */
@@ -816,9 +843,40 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		 */
 		if (ndeletable > 0)
 		{
+			/* No ereport(ERROR) until changes are logged */
+			START_CRIT_SECTION();
+
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 			MarkBufferDirty(buf);
+
+			/* XLOG stuff */
+			if (RelationNeedsWAL(rel))
+			{
+				xl_hash_delete xlrec;
+				XLogRecPtr	recptr;
+
+				xlrec.is_primary_bucket_page = (buf == bucket_buf) ? true : false;
+
+				XLogBeginInsert();
+				XLogRegisterData((char *) &xlrec, SizeOfHashDelete);
+
+				/*
+				 * bucket buffer needs to be registered to ensure that we can
+				 * acquire a cleanup lock on it during replay.
+				 */
+				if (!xlrec.is_primary_bucket_page)
+					XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+
+				XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+				XLogRegisterBufData(1, (char *) deletable,
+									ndeletable * sizeof(OffsetNumber));
+
+				recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_DELETE);
+				PageSetLSN(BufferGetPage(buf), recptr);
+			}
+
+			END_CRIT_SECTION();
 		}
 
 		/* bail out if there are no more pages to scan. */
@@ -866,8 +924,25 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 		page = BufferGetPage(bucket_buf);
 		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 
+		/* No ereport(ERROR) until changes are logged */
+		START_CRIT_SECTION();
+
 		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
 		MarkBufferDirty(bucket_buf);
+
+		/* XLOG stuff */
+		if (RelationNeedsWAL(rel))
+		{
+			XLogRecPtr	recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD);
+
+			recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_CLEANUP);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
 	}
 
 	/*
@@ -881,9 +956,3 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
 	else
 		LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
 }
-
-void
-hash_redo(XLogReaderState *record)
-{
-	elog(PANIC, "hash_redo: unimplemented");
-}
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
new file mode 100644
index 0000000..4b6811e
--- /dev/null
+++ b/src/backend/access/hash/hash_xlog.c
@@ -0,0 +1,963 @@
+/*-------------------------------------------------------------------------
+ *
+ * hash_xlog.c
+ *	  WAL replay logic for hash index.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/hash/hash_xlog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "access/xlogutils.h"
+
+/*
+ * replay a hash index meta page
+ */
+static void
+hash_xlog_init_meta_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Page		page;
+	Buffer		metabuf;
+
+	xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) XLogRecGetData(record);
+
+	/* create the index' metapage */
+	metabuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(metabuf));
+	_hash_init_metabuffer(metabuf, xlrec->num_tuples, xlrec->procid,
+						  xlrec->ffactor, true);
+	page = (Page) BufferGetPage(metabuf);
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(metabuf);
+	/* all done */
+	UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index bitmap page
+ */
+static void
+hash_xlog_init_bitmap_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		bitmapbuf;
+	Buffer		metabuf;
+	Page		page;
+	HashMetaPage metap;
+	uint32		num_buckets;
+
+	xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) XLogRecGetData(record);
+
+	/*
+	 * Initialize bitmap page
+	 */
+	bitmapbuf = XLogInitBufferForRedo(record, 0);
+	_hash_initbitmapbuffer(bitmapbuf, xlrec->bmsize, true);
+	PageSetLSN(BufferGetPage(bitmapbuf), lsn);
+	MarkBufferDirty(bitmapbuf);
+	UnlockReleaseBuffer(bitmapbuf);
+
+	/* add the new bitmap page to the metapage's list of bitmaps */
+	if (XLogReadBufferForRedo(record, 1, &metabuf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the bitmap page.  But during replay it's not
+		 * necessary to hold that lock, since nobody can see it yet; the
+		 * creating transaction hasn't yet committed.
+		 */
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		num_buckets = metap->hashm_maxbucket + 1;
+		metap->hashm_mapp[metap->hashm_nmaps] = num_buckets + 1;
+		metap->hashm_nmaps++;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay a hash index insert without split
+ */
+static void
+hash_xlog_insert(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_insert *xlrec = (xl_hash_insert *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		Size		datalen;
+		char	   *datapos = XLogRecGetBlockData(record, 0, &datalen);
+
+		page = BufferGetPage(buffer);
+
+		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+						false, false) == InvalidOffsetNumber)
+			elog(PANIC, "hash_xlog_insert: failed to add item");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Note: in normal operation, we'd update the metapage while still
+		 * holding lock on the page we inserted into.  But during replay it's
+		 * not necessary to hold that lock, since no other index updates can
+		 * be happening concurrently.
+		 */
+		page = BufferGetPage(buffer);
+		metap = HashPageGetMeta(page);
+		metap->hashm_ntuples += 1;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay addition of overflow page for hash index
+ */
+static void
+hash_xlog_add_ovfl_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_add_ovfl_page *xlrec = (xl_hash_add_ovfl_page *) XLogRecGetData(record);
+	Buffer		leftbuf;
+	Buffer		ovflbuf;
+	Buffer		metabuf;
+	BlockNumber leftblk;
+	BlockNumber rightblk;
+	BlockNumber newmapblk = InvalidBlockNumber;
+	Page		ovflpage;
+	HashPageOpaque ovflopaque;
+	uint32	   *num_bucket;
+	char	   *data;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	bool		new_bmpage = false;
+
+	XLogRecGetBlockTag(record, 0, NULL, NULL, &rightblk);
+	XLogRecGetBlockTag(record, 1, NULL, NULL, &leftblk);
+
+	ovflbuf = XLogInitBufferForRedo(record, 0);
+	Assert(BufferIsValid(ovflbuf));
+
+	data = XLogRecGetBlockData(record, 0, &datalen);
+	num_bucket = (uint32 *) data;
+	Assert(datalen == sizeof(uint32));
+	_hash_initbuf(ovflbuf, InvalidBlockNumber, *num_bucket, LH_OVERFLOW_PAGE,
+				  true);
+	/* update backlink */
+	ovflpage = BufferGetPage(ovflbuf);
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+	ovflopaque->hasho_prevblkno = leftblk;
+
+	PageSetLSN(ovflpage, lsn);
+	MarkBufferDirty(ovflbuf);
+
+	if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
+	{
+		Page		leftpage;
+		HashPageOpaque leftopaque;
+
+		leftpage = BufferGetPage(leftbuf);
+		leftopaque = (HashPageOpaque) PageGetSpecialPointer(leftpage);
+		leftopaque->hasho_nextblkno = rightblk;
+
+		PageSetLSN(leftpage, lsn);
+		MarkBufferDirty(leftbuf);
+	}
+
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	UnlockReleaseBuffer(ovflbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the overflow pages.  But during replay it's not
+	 * necessary to hold those locks, since no other index updates can be
+	 * happening concurrently.
+	 */
+	if (XLogRecHasBlockRef(record, 2))
+	{
+		Buffer		mapbuffer;
+
+		if (XLogReadBufferForRedo(record, 2, &mapbuffer) == BLK_NEEDS_REDO)
+		{
+			Page		mappage = (Page) BufferGetPage(mapbuffer);
+			uint32	   *freep = NULL;
+			char	   *data;
+			uint32	   *bitmap_page_bit;
+
+			freep = HashPageGetBitmap(mappage);
+
+			data = XLogRecGetBlockData(record, 2, &datalen);
+			bitmap_page_bit = (uint32 *) data;
+
+			SETBIT(freep, *bitmap_page_bit);
+
+			PageSetLSN(mappage, lsn);
+			MarkBufferDirty(mapbuffer);
+		}
+		if (BufferIsValid(mapbuffer))
+			UnlockReleaseBuffer(mapbuffer);
+	}
+
+	if (XLogRecHasBlockRef(record, 3))
+	{
+		Buffer		newmapbuf;
+
+		newmapbuf = XLogInitBufferForRedo(record, 3);
+
+		_hash_initbitmapbuffer(newmapbuf, xlrec->bmsize, true);
+
+		new_bmpage = true;
+		newmapblk = BufferGetBlockNumber(newmapbuf);
+
+		MarkBufferDirty(newmapbuf);
+		PageSetLSN(BufferGetPage(newmapbuf), lsn);
+
+		UnlockReleaseBuffer(newmapbuf);
+	}
+
+	if (XLogReadBufferForRedo(record, 4, &metabuf) == BLK_NEEDS_REDO)
+	{
+		HashMetaPage metap;
+		Page		page;
+		uint32	   *firstfree_ovflpage;
+
+		data = XLogRecGetBlockData(record, 4, &datalen);
+		firstfree_ovflpage = (uint32 *) data;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_firstfree = *firstfree_ovflpage;
+
+		if (!xlrec->bmpage_found)
+		{
+			metap->hashm_spares[metap->hashm_ovflpoint]++;
+
+			if (new_bmpage)
+			{
+				Assert(BlockNumberIsValid(newmapblk));
+
+				metap->hashm_mapp[metap->hashm_nmaps] = newmapblk;
+				metap->hashm_nmaps++;
+				metap->hashm_spares[metap->hashm_ovflpoint]++;
+			}
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay allocation of page for split operation
+ */
+static void
+hash_xlog_split_allocate_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_allocate_page *xlrec = (xl_hash_split_allocate_page *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	Buffer		metabuf;
+	Size datalen PG_USED_FOR_ASSERTS_ONLY;
+	char	   *data;
+	XLogRedoAction action;
+
+	/*
+	 * To be consitent with normal operation, here we do take cleanup locks on
+	 * both old and new buckets even though there can't be any concurrent
+	 * inserts.
+	 */
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the special space is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+		oldopaque->hasho_prevblkno = xlrec->new_bucket;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+
+	/* replay the record for new bucket */
+	newbuf = XLogInitBufferForRedo(record, 1);
+	_hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket,
+				  xlrec->new_bucket_flag, true);
+	if (!IsBufferCleanupOK(newbuf))
+		elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire cleanup lock");
+	MarkBufferDirty(newbuf);
+	PageSetLSN(BufferGetPage(newbuf), lsn);
+
+	/*
+	 * We can release the lock on old bucket early as well but doing here to
+	 * consistent with normal operation.
+	 */
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the meta page while still
+	 * holding lock on the old and new bucket pages.  But during replay it's
+	 * not necessary to hold those locks, since no other bucket splits can be
+	 * happening concurrently.
+	 */
+
+	/* replay the record for metapage changes */
+	if (XLogReadBufferForRedo(record, 2, &metabuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		HashMetaPage metap;
+
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+		metap->hashm_maxbucket = xlrec->new_bucket;
+
+		data = XLogRecGetBlockData(record, 2, &datalen);
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS)
+		{
+			uint32		lowmask;
+			uint32	   *highmask;
+
+			/* extract low and high masks. */
+			memcpy(&lowmask, data, sizeof(uint32));
+			highmask = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_lowmask = lowmask;
+			metap->hashm_highmask = *highmask;
+
+			data += sizeof(uint32) * 2;
+		}
+
+		if (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT)
+		{
+			uint32		ovflpoint;
+			uint32	   *ovflpages;
+
+			/* extract information of overflow pages. */
+			memcpy(&ovflpoint, data, sizeof(uint32));
+			ovflpages = (uint32 *) ((char *) data + sizeof(uint32));
+
+			/* update metapage */
+			metap->hashm_spares[ovflpoint] = *ovflpages;
+			metap->hashm_ovflpoint = ovflpoint;
+		}
+
+		MarkBufferDirty(metabuf);
+		PageSetLSN(BufferGetPage(metabuf), lsn);
+	}
+
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+/*
+ * replay of split operation
+ */
+static void
+hash_xlog_split_page(XLogReaderState *record)
+{
+	Buffer		buf;
+
+	if (XLogReadBufferForRedo(record, 0, &buf) != BLK_RESTORED)
+		elog(ERROR, "Hash split record did not contain a full-page image");
+
+	UnlockReleaseBuffer(buf);
+}
+
+/*
+ * replay completion of split operation
+ */
+static void
+hash_xlog_split_complete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_split_complete *xlrec = (xl_hash_split_complete *) XLogRecGetData(record);
+	Buffer		oldbuf;
+	Buffer		newbuf;
+	XLogRedoAction action;
+
+	/* replay the record for old bucket */
+	action = XLogReadBufferForRedo(record, 0, &oldbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		oldpage;
+		HashPageOpaque oldopaque;
+
+		oldpage = BufferGetPage(oldbuf);
+		oldopaque = (HashPageOpaque) PageGetSpecialPointer(oldpage);
+
+		oldopaque->hasho_flag = xlrec->old_bucket_flag;
+
+		PageSetLSN(oldpage, lsn);
+		MarkBufferDirty(oldbuf);
+	}
+	if (BufferIsValid(oldbuf))
+		UnlockReleaseBuffer(oldbuf);
+
+	/* replay the record for new bucket */
+	action = XLogReadBufferForRedo(record, 1, &newbuf);
+
+	/*
+	 * Note that we still update the page even if it was restored from a full
+	 * page image, because the bucket flag is not included in the image.
+	 */
+	if (action == BLK_NEEDS_REDO || action == BLK_RESTORED)
+	{
+		Page		newpage;
+		HashPageOpaque nopaque;
+
+		newpage = BufferGetPage(newbuf);
+		nopaque = (HashPageOpaque) PageGetSpecialPointer(newpage);
+
+		nopaque->hasho_flag = xlrec->new_bucket_flag;
+
+		PageSetLSN(newpage, lsn);
+		MarkBufferDirty(newbuf);
+	}
+	if (BufferIsValid(newbuf))
+		UnlockReleaseBuffer(newbuf);
+}
+
+/*
+ * replay move of page contents for squeeze operation of hash index
+ */
+static void
+hash_xlog_move_page_contents(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_move_page_contents *xldata = (xl_hash_move_page_contents *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf = InvalidBuffer;
+	Buffer		deletebuf = InvalidBuffer;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		/*
+		 * we don't care for return value as the purpose of reading bucketbuf
+		 * is to ensure a cleanup lock on primary bucket page.
+		 */
+		(void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_move_page_contents: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for deleting entries from overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &deletebuf) == BLK_NEEDS_REDO)
+	{
+		Page		page;
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 2, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+
+	/*
+	 * Replay is complete, now we can release the buffers. We release locks at
+	 * end of replay operation to ensure that we hold lock on primary bucket
+	 * page till end of operation.  We can optimize by releasing the lock on
+	 * write buffer as soon as the operation for same is complete, if it is
+	 * not same as primary bucket page, but that doesn't seem to be worth
+	 * complicating the code.
+	 */
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay squeeze page operation of hash index
+ */
+static void
+hash_xlog_squeeze_page(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_squeeze_page *xldata = (xl_hash_squeeze_page *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		writebuf;
+	Buffer		ovflbuf;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		mapbuf;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_prim_bucket_same_wrt)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &writebuf);
+	else
+	{
+		/*
+		 * we don't care for return value as the purpose of reading bucketbuf
+		 * is to ensure a cleanup lock on primary bucket page.
+		 */
+		(void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &writebuf);
+	}
+
+	/* replay the record for adding entries in overflow buffer */
+	if (action == BLK_NEEDS_REDO)
+	{
+		Page		writepage;
+		char	   *begin;
+		char	   *data;
+		Size		datalen;
+		uint16		ninserted = 0;
+
+		data = begin = XLogRecGetBlockData(record, 1, &datalen);
+
+		writepage = (Page) BufferGetPage(writebuf);
+
+		if (xldata->ntups > 0)
+		{
+			OffsetNumber *towrite = (OffsetNumber *) data;
+
+			data += sizeof(OffsetNumber) * xldata->ntups;
+
+			while (data - begin < datalen)
+			{
+				IndexTuple	itup = (IndexTuple) data;
+				Size		itemsz;
+				OffsetNumber l;
+
+				itemsz = IndexTupleDSize(*itup);
+				itemsz = MAXALIGN(itemsz);
+
+				data += itemsz;
+
+				l = PageAddItem(writepage, (Item) itup, itemsz, towrite[ninserted], false, false);
+				if (l == InvalidOffsetNumber)
+					elog(ERROR, "hash_xlog_squeeze_page: failed to add item to hash index page, size %d bytes",
+						 (int) itemsz);
+
+				ninserted++;
+			}
+		}
+
+		/*
+		 * number of tuples inserted must be same as requested in REDO record.
+		 */
+		Assert(ninserted == xldata->ntups);
+
+		/*
+		 * if the page on which are adding tuples is a page previous to freed
+		 * overflow page, then update its nextblno.
+		 */
+		if (xldata->is_prev_bucket_same_wrt)
+		{
+			HashPageOpaque writeopaque = (HashPageOpaque) PageGetSpecialPointer(writepage);
+
+			writeopaque->hasho_nextblkno = xldata->nextblkno;
+		}
+
+		PageSetLSN(writepage, lsn);
+		MarkBufferDirty(writebuf);
+	}
+
+	/* replay the record for initializing overflow buffer */
+	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
+	{
+		Page		ovflpage;
+
+		ovflpage = BufferGetPage(ovflbuf);
+
+		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+		PageSetLSN(ovflpage, lsn);
+		MarkBufferDirty(ovflbuf);
+	}
+	if (BufferIsValid(ovflbuf))
+		UnlockReleaseBuffer(ovflbuf);
+
+	/* replay the record for page previous to the freed overflow page */
+	if (!xldata->is_prev_bucket_same_wrt &&
+		XLogReadBufferForRedo(record, 3, &prevbuf) == BLK_NEEDS_REDO)
+	{
+		Page		prevpage = BufferGetPage(prevbuf);
+		HashPageOpaque prevopaque = (HashPageOpaque) PageGetSpecialPointer(prevpage);
+
+		prevopaque->hasho_nextblkno = xldata->nextblkno;
+
+		PageSetLSN(prevpage, lsn);
+		MarkBufferDirty(prevbuf);
+	}
+	if (BufferIsValid(prevbuf))
+		UnlockReleaseBuffer(prevbuf);
+
+	/* replay the record for page next to the freed overflow page */
+	if (XLogRecHasBlockRef(record, 4))
+	{
+		Buffer		nextbuf;
+
+		if (XLogReadBufferForRedo(record, 4, &nextbuf) == BLK_NEEDS_REDO)
+		{
+			Page		nextpage = BufferGetPage(nextbuf);
+			HashPageOpaque nextopaque = (HashPageOpaque) PageGetSpecialPointer(nextpage);
+
+			nextopaque->hasho_prevblkno = xldata->prevblkno;
+
+			PageSetLSN(nextpage, lsn);
+			MarkBufferDirty(nextbuf);
+		}
+		if (BufferIsValid(nextbuf))
+			UnlockReleaseBuffer(nextbuf);
+	}
+
+	if (BufferIsValid(writebuf))
+		UnlockReleaseBuffer(writebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+
+	/*
+	 * Note: in normal operation, we'd update the bitmap and meta page while
+	 * still holding lock on the primary bucket page and overflow pages.  But
+	 * during replay it's not necessary to hold those locks, since no other
+	 * index updates can be happening concurrently.
+	 */
+	/* replay the record for bitmap page */
+	if (XLogReadBufferForRedo(record, 5, &mapbuf) == BLK_NEEDS_REDO)
+	{
+		Page		mappage = (Page) BufferGetPage(mapbuf);
+		uint32	   *freep = NULL;
+		char	   *data;
+		uint32	   *bitmap_page_bit;
+		Size		datalen;
+
+		freep = HashPageGetBitmap(mappage);
+
+		data = XLogRecGetBlockData(record, 5, &datalen);
+		bitmap_page_bit = (uint32 *) data;
+
+		CLRBIT(freep, *bitmap_page_bit);
+
+		PageSetLSN(mappage, lsn);
+		MarkBufferDirty(mapbuf);
+	}
+	if (BufferIsValid(mapbuf))
+		UnlockReleaseBuffer(mapbuf);
+
+	/* replay the record for meta page */
+	if (XLogRecHasBlockRef(record, 6))
+	{
+		Buffer		metabuf;
+
+		if (XLogReadBufferForRedo(record, 6, &metabuf) == BLK_NEEDS_REDO)
+		{
+			HashMetaPage metap;
+			Page		page;
+			char	   *data;
+			uint32	   *firstfree_ovflpage;
+			Size		datalen;
+
+			data = XLogRecGetBlockData(record, 6, &datalen);
+			firstfree_ovflpage = (uint32 *) data;
+
+			page = BufferGetPage(metabuf);
+			metap = HashPageGetMeta(page);
+			metap->hashm_firstfree = *firstfree_ovflpage;
+
+			PageSetLSN(page, lsn);
+			MarkBufferDirty(metabuf);
+		}
+		if (BufferIsValid(metabuf))
+			UnlockReleaseBuffer(metabuf);
+	}
+}
+
+/*
+ * replay delete operation of hash index
+ */
+static void
+hash_xlog_delete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_delete *xldata = (xl_hash_delete *) XLogRecGetData(record);
+	Buffer		bucketbuf = InvalidBuffer;
+	Buffer		deletebuf;
+	Page		page;
+	XLogRedoAction action;
+
+	/*
+	 * Ensure to have a cleanup lock on primary bucket page before we start
+	 * with the actual replay operation.  This is to ensure that neither a
+	 * scan can start nor a scan can be already-in-progress during the replay
+	 * of this operation.  If we allow scans during this operation, then they
+	 * can miss some records or show the same record multiple times.
+	 */
+	if (xldata->is_primary_bucket_page)
+		action = XLogReadBufferForRedoExtended(record, 1, RBM_NORMAL, true, &deletebuf);
+	else
+	{
+		/*
+		 * we don't care for return value as the purpose of reading bucketbuf
+		 * is to ensure a cleanup lock on primary bucket page.
+		 */
+		(void) XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL, true, &bucketbuf);
+
+		action = XLogReadBufferForRedo(record, 1, &deletebuf);
+	}
+
+	/* replay the record for deleting entries in bucket page */
+	if (action == BLK_NEEDS_REDO)
+	{
+		char	   *ptr;
+		Size		len;
+
+		ptr = XLogRecGetBlockData(record, 1, &len);
+
+		page = (Page) BufferGetPage(deletebuf);
+
+		if (len > 0)
+		{
+			OffsetNumber *unused;
+			OffsetNumber *unend;
+
+			unused = (OffsetNumber *) ptr;
+			unend = (OffsetNumber *) ((char *) ptr + len);
+
+			if ((unend - unused) > 0)
+				PageIndexMultiDelete(page, unused, unend - unused);
+		}
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(deletebuf);
+	}
+	if (BufferIsValid(deletebuf))
+		UnlockReleaseBuffer(deletebuf);
+
+	if (BufferIsValid(bucketbuf))
+		UnlockReleaseBuffer(bucketbuf);
+}
+
+/*
+ * replay split cleanup flag operation for primary bucket page.
+ */
+static void
+hash_xlog_split_cleanup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		HashPageOpaque bucket_opaque;
+
+		page = (Page) BufferGetPage(buffer);
+
+		bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page);
+		bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP;
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
+/*
+ * replay for update meta page
+ */
+static void
+hash_xlog_update_meta_page(XLogReaderState *record)
+{
+	HashMetaPage metap;
+	XLogRecPtr	lsn = record->EndRecPtr;
+	xl_hash_update_meta_page *xldata = (xl_hash_update_meta_page *) XLogRecGetData(record);
+	Buffer		metabuf;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &metabuf) == BLK_NEEDS_REDO)
+	{
+		page = BufferGetPage(metabuf);
+		metap = HashPageGetMeta(page);
+
+		metap->hashm_ntuples = xldata->ntuples;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(metabuf);
+	}
+	if (BufferIsValid(metabuf))
+		UnlockReleaseBuffer(metabuf);
+}
+
+void
+hash_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			hash_xlog_init_meta_page(record);
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			hash_xlog_init_bitmap_page(record);
+			break;
+		case XLOG_HASH_INSERT:
+			hash_xlog_insert(record);
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			hash_xlog_add_ovfl_page(record);
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			hash_xlog_split_allocate_page(record);
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			hash_xlog_split_page(record);
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			hash_xlog_split_complete(record);
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			hash_xlog_move_page_contents(record);
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			hash_xlog_squeeze_page(record);
+			break;
+		case XLOG_HASH_DELETE:
+			hash_xlog_delete(record);
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			hash_xlog_split_cleanup(record);
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			hash_xlog_update_meta_page(record);
+			break;
+		default:
+			elog(PANIC, "hash_redo: unknown op code %u", info);
+	}
+}
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 354e733..241728f 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -16,6 +16,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -40,6 +42,7 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	bool		do_expand;
 	uint32		hashkey;
 	Bucket		bucket;
+	OffsetNumber itup_off;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -158,25 +161,20 @@ restart_insert:
 		Assert(pageopaque->hasho_bucket == bucket);
 	}
 
-	/* found page with enough space, so add the item here */
-	(void) _hash_pgaddtup(rel, buf, itemsz, itup);
-
-	/*
-	 * dirty and release the modified page.  if the page we modified was an
-	 * overflow page, we also need to separately drop the pin we retained on
-	 * the primary bucket page.
-	 */
-	MarkBufferDirty(buf);
-	_hash_relbuf(rel, buf);
-	if (buf != bucket_buf)
-		_hash_dropbuf(rel, bucket_buf);
-
 	/*
 	 * Write-lock the metapage so we can increment the tuple count. After
 	 * incrementing it, check to see if it's time for a split.
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* Do the update.  No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/* found page with enough space, so add the item here */
+	itup_off = _hash_pgaddtup(rel, buf, itemsz, itup);
+	MarkBufferDirty(buf);
+
+	/* metapage operations */
 	metap = HashPageGetMeta(metapage);
 	metap->hashm_ntuples += 1;
 
@@ -184,10 +182,43 @@ restart_insert:
 	do_expand = metap->hashm_ntuples >
 		(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
 
-	/* Write out the metapage and drop lock, but keep pin */
 	MarkBufferDirty(metabuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_insert xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.offnum = itup_off;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInsert);
+
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+		XLogRegisterBufData(0, (char *) itup, IndexTupleDSize(*itup));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INSERT);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* drop lock on metapage, but keep pin */
 	LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
 
+	/*
+	 * Release the modified page and ensure to release the pin on primary
+	 * page.
+	 */
+	_hash_relbuf(rel, buf);
+	if (buf != bucket_buf)
+		_hash_dropbuf(rel, bucket_buf);
+
 	/* Attempt to split if a split is needed */
 	if (do_expand)
 		_hash_expandtable(rel, metabuf);
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index 1087480..b7ee6af 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -18,6 +18,8 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
+#include "miscadmin.h"
 #include "utils/rel.h"
 
 
@@ -136,6 +138,13 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 	 * page is released, then finally acquire the lock on new overflow buffer.
 	 * We need this locking order to avoid deadlock with backends that are
 	 * doing inserts.
+	 *
+	 * Note: We could have avoided locking many buffers here if we make two
+	 * WAL records for acquiring an overflow page (one to allocate an overflow
+	 * page and another to add it to overflow bucket chain).  However, doing
+	 * so can leak an overflow page, if the system crashes after allocation.
+	 * Needless to say, it is better to have a single record from a
+	 * performance point of view as well.
 	 */
 	LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 
@@ -303,8 +312,12 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin)
 found:
 
 	/*
-	 * Do the update.
+	 * Do the update.  No ereport(ERROR) until changes are logged. We want to
+	 * log the changes for bitmap page and overflow page together to avoid
+	 * loss of pages in case the new page is added.
 	 */
+	START_CRIT_SECTION();
+
 	if (page_found)
 	{
 		Assert(BufferIsValid(mapbuf));
@@ -362,6 +375,51 @@ found:
 
 	MarkBufferDirty(buf);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_add_ovfl_page xlrec;
+
+		xlrec.bmpage_found = page_found;
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashAddOvflPage);
+
+		XLogRegisterBuffer(0, ovflbuf, REGBUF_WILL_INIT);
+		XLogRegisterBufData(0, (char *) &pageopaque->hasho_bucket, sizeof(Bucket));
+
+		XLogRegisterBuffer(1, buf, REGBUF_STANDARD);
+
+		if (BufferIsValid(mapbuf))
+		{
+			XLogRegisterBuffer(2, mapbuf, REGBUF_STANDARD);
+			XLogRegisterBufData(2, (char *) &bitmap_page_bit, sizeof(uint32));
+		}
+
+		if (BufferIsValid(newmapbuf))
+			XLogRegisterBuffer(3, newmapbuf, REGBUF_WILL_INIT);
+
+		XLogRegisterBuffer(4, metabuf, REGBUF_STANDARD);
+		XLogRegisterBufData(4, (char *) &metap->hashm_firstfree, sizeof(uint32));
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_ADD_OVFL_PAGE);
+
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+		PageSetLSN(BufferGetPage(buf), recptr);
+
+		if (BufferIsValid(mapbuf))
+			PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (BufferIsValid(newmapbuf))
+			PageSetLSN(BufferGetPage(newmapbuf), recptr);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	if (retain_pin)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 	else
@@ -408,7 +466,11 @@ _hash_firstfreebit(uint32 map)
  *	Remove this overflow page from its bucket's chain, and mark the page as
  *	free.  On entry, ovflbuf is write-locked; it is released before exiting.
  *
- *	Add the tuples (itups) to wbuf.
+ *	Add the tuples (itups) to wbuf in this function.  We could do that in the
+ *	caller as well, but the advantage of doing it here is we can easily write
+ *	the WAL for XLOG_HASH_SQUEEZE_PAGE operation.  Addition of tuples and
+ *	removal of overflow page has to done as an atomic operation, otherwise
+ *	during replay on standby users might find duplicate records.
  *
  *	Since this function is invoked in VACUUM, we provide an access strategy
  *	parameter that controls fetches of the bucket pages.
@@ -430,8 +492,6 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	HashMetaPage metap;
 	Buffer		metabuf;
 	Buffer		mapbuf;
-	Buffer		prevbuf = InvalidBuffer;
-	Buffer		nextbuf = InvalidBuffer;
 	BlockNumber ovflblkno;
 	BlockNumber prevblkno;
 	BlockNumber blkno;
@@ -445,6 +505,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	int32		bitmappage,
 				bitmapbit;
 	Bucket bucket PG_USED_FOR_ASSERTS_ONLY;
+	Buffer		prevbuf = InvalidBuffer;
+	Buffer		nextbuf = InvalidBuffer;
+	bool		update_metap = false;
 
 	/* Get information from the doomed page */
 	_hash_checkpage(rel, ovflbuf, LH_OVERFLOW_PAGE);
@@ -508,6 +571,12 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/* Get write-lock on metapage to update firstfree */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	/* This operation needs to log multiple tuples, prepare WAL for that */
+	if (RelationNeedsWAL(rel))
+		XLogEnsureRecordSpace(HASH_XLOG_FREE_OVFL_BUFS, 4 + nitups);
+
+	START_CRIT_SECTION();
+
 	/*
 	 * we have to insert tuples on the "write" page, being careful to preserve
 	 * hashkey ordering.  (If we insert many tuples into the same "write" page
@@ -519,7 +588,11 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 		MarkBufferDirty(wbuf);
 	}
 
-	/* Initialize the freed overflow page. */
+	/*
+	 * Initialize the freed overflow page.  Just zeroing the page won't work,
+	 * because WAL replay routines expect pages to be initialized. See
+	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 */
 	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
 	MarkBufferDirty(ovflbuf);
 
@@ -550,9 +623,83 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	if (ovflbitno < metap->hashm_firstfree)
 	{
 		metap->hashm_firstfree = ovflbitno;
+		update_metap = true;
 		MarkBufferDirty(metabuf);
 	}
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_squeeze_page xlrec;
+		XLogRecPtr	recptr;
+		int			i;
+
+		xlrec.prevblkno = prevblkno;
+		xlrec.nextblkno = nextblkno;
+		xlrec.ntups = nitups;
+		xlrec.is_prim_bucket_same_wrt = (wbuf == bucketbuf);
+		xlrec.is_prev_bucket_same_wrt = (wbuf == prevbuf);
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashSqueezePage);
+
+		/*
+		 * bucket buffer needs to be registered to ensure that we can acquire
+		 * a cleanup lock on it during replay.
+		 */
+		if (!xlrec.is_prim_bucket_same_wrt)
+			XLogRegisterBuffer(0, bucketbuf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+
+		XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+		if (xlrec.ntups > 0)
+		{
+			XLogRegisterBufData(1, (char *) itup_offsets,
+								nitups * sizeof(OffsetNumber));
+			for (i = 0; i < nitups; i++)
+				XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+		}
+
+		XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
+
+		/*
+		 * If prevpage and the writepage (block in which we are moving tuples
+		 * from overflow) are same, then no need to separately register
+		 * prevpage.  During replay, we can directly update the nextblock in
+		 * writepage.
+		 */
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			XLogRegisterBuffer(3, prevbuf, REGBUF_STANDARD);
+
+		if (BufferIsValid(nextbuf))
+			XLogRegisterBuffer(4, nextbuf, REGBUF_STANDARD);
+
+		XLogRegisterBuffer(5, mapbuf, REGBUF_STANDARD);
+		XLogRegisterBufData(5, (char *) &bitmapbit, sizeof(uint32));
+
+		if (update_metap)
+		{
+			XLogRegisterBuffer(6, metabuf, REGBUF_STANDARD);
+			XLogRegisterBufData(6, (char *) &metap->hashm_firstfree, sizeof(uint32));
+		}
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
+
+		PageSetLSN(BufferGetPage(wbuf), recptr);
+		PageSetLSN(BufferGetPage(ovflbuf), recptr);
+
+		if (BufferIsValid(prevbuf) && !xlrec.is_prev_bucket_same_wrt)
+			PageSetLSN(BufferGetPage(prevbuf), recptr);
+		if (BufferIsValid(nextbuf))
+			PageSetLSN(BufferGetPage(nextbuf), recptr);
+
+		PageSetLSN(BufferGetPage(mapbuf), recptr);
+
+		if (update_metap)
+			PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
+	END_CRIT_SECTION();
+
 	/* release previous bucket if it is not same as write bucket */
 	if (BufferIsValid(prevbuf) && prevblkno != writeblkno)
 		_hash_relbuf(rel, prevbuf);
@@ -601,7 +748,11 @@ _hash_initbitmapbuffer(Buffer buf, uint16 bmsize, bool initpage)
 	freep = HashPageGetBitmap(pg);
 	MemSet(freep, 0xFF, bmsize);
 
-	/* Set pd_lower just past the end of the bitmap page data. */
+	/*
+	 * Set pd_lower just past the end of the bitmap page data.  We could even
+	 * set pd_lower equal to pd_upper, but this is more precise and makes the
+	 * page look compressible to xlog.c.
+	 */
 	((PageHeader) pg)->pd_lower = ((char *) freep + bmsize) - (char *) pg;
 }
 
@@ -761,6 +912,15 @@ readpage:
 					Assert(nitups == ndeletable);
 
 					/*
+					 * This operation needs to log multiple tuples, prepare
+					 * WAL for that.
+					 */
+					if (RelationNeedsWAL(rel))
+						XLogEnsureRecordSpace(0, 3 + nitups);
+
+					START_CRIT_SECTION();
+
+					/*
 					 * we have to insert tuples on the "write" page, being
 					 * careful to preserve hashkey ordering.  (If we insert
 					 * many tuples into the same "write" page it would be
@@ -773,6 +933,43 @@ readpage:
 					PageIndexMultiDelete(rpage, deletable, ndeletable);
 					MarkBufferDirty(rbuf);
 
+					/* XLOG stuff */
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
+						xl_hash_move_page_contents xlrec;
+
+						xlrec.ntups = nitups;
+						xlrec.is_prim_bucket_same_wrt = (wbuf == bucket_buf) ? true : false;
+
+						XLogBeginInsert();
+						XLogRegisterData((char *) &xlrec, SizeOfHashMovePageContents);
+
+						/*
+						 * bucket buffer needs to be registered to ensure that
+						 * we can acquire a cleanup lock on it during replay.
+						 */
+						if (!xlrec.is_prim_bucket_same_wrt)
+							XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE);
+
+						XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(1, (char *) itup_offsets,
+											nitups * sizeof(OffsetNumber));
+						for (i = 0; i < nitups; i++)
+							XLogRegisterBufData(1, (char *) itups[i], tups_size[i]);
+
+						XLogRegisterBuffer(2, rbuf, REGBUF_STANDARD);
+						XLogRegisterBufData(2, (char *) deletable,
+										  ndeletable * sizeof(OffsetNumber));
+
+						recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_MOVE_PAGE_CONTENTS);
+
+						PageSetLSN(BufferGetPage(wbuf), recptr);
+						PageSetLSN(BufferGetPage(rbuf), recptr);
+					}
+
+					END_CRIT_SECTION();
+
 					tups_moved = true;
 				}
 
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index c73929c..7f2dedc 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -29,6 +29,7 @@
 #include "postgres.h"
 
 #include "access/hash.h"
+#include "access/hash_xlog.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
@@ -43,6 +44,8 @@ static void _hash_splitbucket(Relation rel, Buffer metabuf,
 				  HTAB *htab,
 				  uint32 maxbucket,
 				  uint32 highmask, uint32 lowmask);
+static void
+			log_split_page(Relation rel, Buffer buf);
 
 
 /*
@@ -381,6 +384,25 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 	pg = BufferGetPage(metabuf);
 	metap = HashPageGetMeta(pg);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_meta_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.num_tuples = num_tuples;
+		xlrec.procid = metap->hashm_procid;
+		xlrec.ffactor = metap->hashm_ffactor;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitMetaPage);
+		XLogRegisterBuffer(0, metabuf, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_META_PAGE);
+
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	num_buckets = metap->hashm_maxbucket + 1;
 
 	/*
@@ -405,6 +427,12 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 		buf = _hash_getnewbuf(rel, blkno, forkNum);
 		_hash_initbuf(buf, metap->hashm_maxbucket, i, LH_BUCKET_PAGE, false);
 		MarkBufferDirty(buf);
+
+		log_newpage(&rel->rd_node,
+					forkNum,
+					blkno,
+					BufferGetPage(buf),
+					true);
 		_hash_relbuf(rel, buf);
 	}
 
@@ -431,6 +459,31 @@ _hash_init(Relation rel, double num_tuples, ForkNumber forkNum)
 	metap->hashm_nmaps++;
 	MarkBufferDirty(metabuf);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_init_bitmap_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.bmsize = metap->hashm_bmsize;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, SizeOfHashInitBitmapPage);
+		XLogRegisterBuffer(0, bitmapbuf, REGBUF_WILL_INIT);
+
+		/*
+		 * This is safe only because nobody else can be modifying the index at
+		 * this stage; it's only visible to the transaction that is creating
+		 * it.
+		 */
+		XLogRegisterBuffer(1, metabuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_INIT_BITMAP_PAGE);
+
+		PageSetLSN(BufferGetPage(bitmapbuf), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	/* all done */
 	_hash_relbuf(rel, bitmapbuf);
 	_hash_relbuf(rel, metabuf);
@@ -525,7 +578,10 @@ _hash_init_metabuffer(Buffer buf, double num_tuples, RegProcedure procid,
 	metap->hashm_ovflpoint = log2_num_buckets;
 	metap->hashm_firstfree = 0;
 
-	/* Set pd_lower just past the end of the metadata. */
+	/*
+	 * Set pd_lower just past the end of the metadata.  This is to log full
+	 * page image of metapage in xloginsert.c.
+	 */
 	((PageHeader) page)->pd_lower =
 		((char *) metap + sizeof(HashMetaPageData)) - (char *) page;
 }
@@ -569,6 +625,8 @@ _hash_expandtable(Relation rel, Buffer metabuf)
 	uint32		maxbucket;
 	uint32		highmask;
 	uint32		lowmask;
+	bool		metap_update_masks = false;
+	bool		metap_update_splitpoint = false;
 
 restart_expand:
 
@@ -728,7 +786,11 @@ restart_expand:
 		 * The number of buckets in the new splitpoint is equal to the total
 		 * number already in existence, i.e. new_bucket.  Currently this maps
 		 * one-to-one to blocks required, but someday we may need a more
-		 * complicated calculation here.
+		 * complicated calculation here.  We treat allocation of buckets as a
+		 * separate WAL action.  Even if we fail after this operation, it
+		 * won't be a leak bucket pages, as next split will consume this
+		 * space. In any case, even without failure we don't use all the space
+		 * in one split operation.
 		 */
 		if (!_hash_alloc_buckets(rel, start_nblkno, new_bucket))
 		{
@@ -757,8 +819,7 @@ restart_expand:
 	 * Since we are scribbling on the pages in the shared buffers, establish a
 	 * critical section.  Any failure in this next code leaves us with a big
 	 * problem: the metapage is effectively corrupt but could get written back
-	 * to disk.  We don't really expect any failure, but just to be sure,
-	 * establish a critical section.
+	 * to disk.
 	 */
 	START_CRIT_SECTION();
 
@@ -772,6 +833,7 @@ restart_expand:
 		/* Starting a new doubling */
 		metap->hashm_lowmask = metap->hashm_highmask;
 		metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
+		metap_update_masks = true;
 	}
 
 	/*
@@ -784,6 +846,7 @@ restart_expand:
 	{
 		metap->hashm_spares[spare_ndx] = metap->hashm_spares[metap->hashm_ovflpoint];
 		metap->hashm_ovflpoint = spare_ndx;
+		metap_update_splitpoint = true;
 	}
 
 	MarkBufferDirty(metabuf);
@@ -829,6 +892,49 @@ restart_expand:
 
 	MarkBufferDirty(buf_nblkno);
 
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_hash_split_allocate_page xlrec;
+		XLogRecPtr	recptr;
+
+		xlrec.new_bucket = maxbucket;
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+		xlrec.flags = 0;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf_oblkno, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, buf_nblkno, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(2, metabuf, REGBUF_STANDARD);
+
+		if (metap_update_masks)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_MASKS;
+			XLogRegisterBufData(2, (char *) &metap->hashm_lowmask, sizeof(uint32));
+			XLogRegisterBufData(2, (char *) &metap->hashm_highmask, sizeof(uint32));
+		}
+
+		if (metap_update_splitpoint)
+		{
+			xlrec.flags |= XLH_SPLIT_META_UPDATE_SPLITPOINT;
+			XLogRegisterBufData(2, (char *) &metap->hashm_ovflpoint,
+								sizeof(uint32));
+			XLogRegisterBufData(2,
+					   (char *) &metap->hashm_spares[metap->hashm_ovflpoint],
+								sizeof(uint32));
+		}
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitAllocPage);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_ALLOCATE_PAGE);
+
+		PageSetLSN(BufferGetPage(buf_oblkno), recptr);
+		PageSetLSN(BufferGetPage(buf_nblkno), recptr);
+		PageSetLSN(BufferGetPage(metabuf), recptr);
+	}
+
 	END_CRIT_SECTION();
 
 	/* drop lock, but keep pin */
@@ -883,6 +989,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 {
 	BlockNumber lastblock;
 	char		zerobuf[BLCKSZ];
+	Page		page;
 
 	lastblock = firstblock + nblocks - 1;
 
@@ -893,7 +1000,20 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	if (lastblock < firstblock || lastblock == InvalidBlockNumber)
 		return false;
 
-	MemSet(zerobuf, 0, sizeof(zerobuf));
+	page = (Page) zerobuf;
+
+	/*
+	 * Initialize the freed overflow page.  Just zeroing the page won't work,
+	 * See _hash_freeovflpage for similar usage.
+	 */
+	_hash_pageinit(page, BLCKSZ);
+
+	if (RelationNeedsWAL(rel))
+		log_newpage(&rel->rd_node,
+					MAIN_FORKNUM,
+					lastblock,
+					zerobuf,
+					true);
 
 	RelationOpenSmgr(rel);
 	smgrextend(rel->rd_smgr, MAIN_FORKNUM, lastblock, zerobuf, false);
@@ -951,6 +1071,11 @@ _hash_splitbucket(Relation rel,
 	Page		npage;
 	HashPageOpaque oopaque;
 	HashPageOpaque nopaque;
+	OffsetNumber itup_offsets[MaxIndexTuplesPerPage];
+	IndexTuple	itups[MaxIndexTuplesPerPage];
+	Size		all_tups_size = 0;
+	int			i;
+	uint16		nitups = 0;
 
 	bucket_obuf = obuf;
 	opage = BufferGetPage(obuf);
@@ -1029,29 +1154,38 @@ _hash_splitbucket(Relation rel,
 				itemsz = IndexTupleDSize(*new_itup);
 				itemsz = MAXALIGN(itemsz);
 
-				if (PageGetFreeSpace(npage) < itemsz)
+				if (PageGetFreeSpaceForMultipleTuples(npage, nitups + 1) < (all_tups_size + itemsz))
 				{
-					/* write out nbuf and drop lock, but keep pin */
+					/*
+					 * Change the shared buffer state in critical section,
+					 * otherwise any error could make it unrecoverable.
+					 */
+					START_CRIT_SECTION();
+
+					_hash_pgaddmultitup(rel, nbuf, itups, itup_offsets, nitups);
 					MarkBufferDirty(nbuf);
+					/* log the split operation before releasing the lock */
+					log_split_page(rel, nbuf);
+
+					END_CRIT_SECTION();
+
 					/* drop lock, but keep pin */
 					LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
+
+					/* be tidy */
+					for (i = 0; i < nitups; i++)
+						pfree(itups[i]);
+					nitups = 0;
+					all_tups_size = 0;
+
 					/* chain to a new overflow page */
 					nbuf = _hash_addovflpage(rel, metabuf, nbuf, (nbuf == bucket_nbuf) ? true : false);
 					npage = BufferGetPage(nbuf);
 					nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 				}
 
-				/*
-				 * Insert tuple on new page, using _hash_pgaddtup to ensure
-				 * correct ordering by hashkey.  This is a tad inefficient
-				 * since we may have to shuffle itempointers repeatedly.
-				 * Possible future improvement: accumulate all the items for
-				 * the new page and qsort them before insertion.
-				 */
-				(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
-
-				/* be tidy */
-				pfree(new_itup);
+				itups[nitups++] = new_itup;
+				all_tups_size += itemsz;
 			}
 			else
 			{
@@ -1073,11 +1207,27 @@ _hash_splitbucket(Relation rel,
 		/* Exit loop if no more overflow pages in old bucket */
 		if (!BlockNumberIsValid(oblkno))
 		{
+			/*
+			 * Change the shared buffer state in critical section, otherwise
+			 * any error could make it unrecoverable.
+			 */
+			START_CRIT_SECTION();
+
+			_hash_pgaddmultitup(rel, nbuf, itups, itup_offsets, nitups);
 			MarkBufferDirty(nbuf);
+			/* log the split operation before releasing the lock */
+			log_split_page(rel, nbuf);
+
+			END_CRIT_SECTION();
+
 			if (nbuf == bucket_nbuf)
 				LockBuffer(nbuf, BUFFER_LOCK_UNLOCK);
 			else
 				_hash_relbuf(rel, nbuf);
+
+			/* be tidy */
+			for (i = 0; i < nitups; i++)
+				pfree(itups[i]);
 			break;
 		}
 
@@ -1103,6 +1253,8 @@ _hash_splitbucket(Relation rel,
 	npage = BufferGetPage(bucket_nbuf);
 	nopaque = (HashPageOpaque) PageGetSpecialPointer(npage);
 
+	START_CRIT_SECTION();
+
 	oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
 	nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
 
@@ -1119,6 +1271,29 @@ _hash_splitbucket(Relation rel,
 	 */
 	MarkBufferDirty(bucket_obuf);
 	MarkBufferDirty(bucket_nbuf);
+
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_hash_split_complete xlrec;
+
+		xlrec.old_bucket_flag = oopaque->hasho_flag;
+		xlrec.new_bucket_flag = nopaque->hasho_flag;
+
+		XLogBeginInsert();
+
+		XLogRegisterData((char *) &xlrec, SizeOfHashSplitComplete);
+
+		XLogRegisterBuffer(0, bucket_obuf, REGBUF_STANDARD);
+		XLogRegisterBuffer(1, bucket_nbuf, REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_COMPLETE);
+
+		PageSetLSN(BufferGetPage(bucket_obuf), recptr);
+		PageSetLSN(BufferGetPage(bucket_nbuf), recptr);
+	}
+
+	END_CRIT_SECTION();
 }
 
 /*
@@ -1245,6 +1420,32 @@ _hash_finish_split(Relation rel, Buffer metabuf, Buffer obuf, Bucket obucket,
 }
 
 /*
+ *	log_split_page() -- Log the split operation
+ *
+ *	We log the split operation when the new page in new bucket gets full,
+ *	so we log the entire page.
+ *
+ *	'buf' must be locked by the caller which is also responsible for unlocking
+ *	it.
+ */
+static void
+log_split_page(Relation rel, Buffer buf)
+{
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+
+		XLogBeginInsert();
+
+		XLogRegisterBuffer(0, buf, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
+
+		recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SPLIT_PAGE);
+
+		PageSetLSN(BufferGetPage(buf), recptr);
+	}
+}
+
+/*
  *	_hash_getcachedmetap() -- Returns cached metapage data.
  *
  *	If metabuf is not InvalidBuffer, caller must hold a pin, but no lock, on
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9e5d7e4..d733770 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -123,6 +123,7 @@ _hash_readnext(IndexScanDesc scan,
 	if (block_found)
 	{
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 	}
 }
@@ -168,6 +169,7 @@ _hash_readprev(IndexScanDesc scan,
 		*bufp = _hash_getbuf(rel, blkno, HASH_READ,
 							 LH_BUCKET_PAGE | LH_OVERFLOW_PAGE);
 		*pagep = BufferGetPage(*bufp);
+		TestForOldSnapshot(scan->xs_snapshot, rel, *pagep);
 		*opaquep = (HashPageOpaque) PageGetSpecialPointer(*pagep);
 
 		/*
@@ -283,6 +285,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 
 	buf = _hash_getbucketbuf_from_hashkey(rel, hashkey, HASH_READ, NULL);
 	page = BufferGetPage(buf);
+	TestForOldSnapshot(scan->xs_snapshot, rel, page);
 	opaque = (HashPageOpaque) PageGetSpecialPointer(page);
 	bucket = opaque->hasho_bucket;
 
@@ -318,6 +321,7 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
+		TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf));
 
 		/*
 		 * remember the split bucket buffer so as to use it later for
@@ -520,6 +524,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 					_hash_readprev(scan, &buf, &page, &opaque);
 					if (BufferIsValid(buf))
 					{
+						TestForOldSnapshot(scan->xs_snapshot, rel, page);
 						maxoff = PageGetMaxOffsetNumber(page);
 						offnum = _hash_binsearch_last(page, so->hashso_sk_hash);
 					}
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index 7eac819..f1cc9ff 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -19,10 +19,142 @@
 void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			{
+				xl_hash_init_meta_page *xlrec = (xl_hash_init_meta_page *) rec;
+
+				appendStringInfo(buf, "num_tuples %g, fillfactor %d",
+								 xlrec->num_tuples, xlrec->ffactor);
+				break;
+			}
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			{
+				xl_hash_init_bitmap_page *xlrec = (xl_hash_init_bitmap_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d", xlrec->bmsize);
+				break;
+			}
+		case XLOG_HASH_INSERT:
+			{
+				xl_hash_insert *xlrec = (xl_hash_insert *) rec;
+
+				appendStringInfo(buf, "off %u", xlrec->offnum);
+				break;
+			}
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			{
+				xl_hash_add_ovfl_page *xlrec = (xl_hash_add_ovfl_page *) rec;
+
+				appendStringInfo(buf, "bmsize %d, bmpage_found %c",
+						   xlrec->bmsize, (xlrec->bmpage_found) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			{
+				xl_hash_split_allocate_page *xlrec = (xl_hash_split_allocate_page *) rec;
+
+				appendStringInfo(buf, "new_bucket %u, meta_page_masks_updated %c, issplitpoint_changed %c",
+								 xlrec->new_bucket,
+					(xlrec->flags & XLH_SPLIT_META_UPDATE_MASKS) ? 'T' : 'F',
+								 (xlrec->flags & XLH_SPLIT_META_UPDATE_SPLITPOINT) ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SPLIT_COMPLETE:
+			{
+				xl_hash_split_complete *xlrec = (xl_hash_split_complete *) rec;
+
+				appendStringInfo(buf, "old_bucket_flag %u, new_bucket_flag %u",
+							 xlrec->old_bucket_flag, xlrec->new_bucket_flag);
+				break;
+			}
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			{
+				xl_hash_move_page_contents *xlrec = (xl_hash_move_page_contents *) rec;
+
+				appendStringInfo(buf, "ntups %d, is_primary %c",
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_SQUEEZE_PAGE:
+			{
+				xl_hash_squeeze_page *xlrec = (xl_hash_squeeze_page *) rec;
+
+				appendStringInfo(buf, "prevblkno %u, nextblkno %u, ntups %d, is_primary %c",
+								 xlrec->prevblkno,
+								 xlrec->nextblkno,
+								 xlrec->ntups,
+								 xlrec->is_prim_bucket_same_wrt ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_DELETE:
+			{
+				xl_hash_delete *xlrec = (xl_hash_delete *) rec;
+
+				appendStringInfo(buf, "is_primary %c",
+								 xlrec->is_primary_bucket_page ? 'T' : 'F');
+				break;
+			}
+		case XLOG_HASH_UPDATE_META_PAGE:
+			{
+				xl_hash_update_meta_page *xlrec = (xl_hash_update_meta_page *) rec;
+
+				appendStringInfo(buf, "ntuples %g",
+								 xlrec->ntuples);
+				break;
+			}
+	}
 }
 
 const char *
 hash_identify(uint8 info)
 {
-	return NULL;
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_HASH_INIT_META_PAGE:
+			id = "INIT_META_PAGE";
+			break;
+		case XLOG_HASH_INIT_BITMAP_PAGE:
+			id = "INIT_BITMAP_PAGE";
+			break;
+		case XLOG_HASH_INSERT:
+			id = "INSERT";
+			break;
+		case XLOG_HASH_ADD_OVFL_PAGE:
+			id = "ADD_OVFL_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_ALLOCATE_PAGE:
+			id = "SPLIT_ALLOCATE_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_PAGE:
+			id = "SPLIT_PAGE";
+			break;
+		case XLOG_HASH_SPLIT_COMPLETE:
+			id = "SPLIT_COMPLETE";
+			break;
+		case XLOG_HASH_MOVE_PAGE_CONTENTS:
+			id = "MOVE_PAGE_CONTENTS";
+			break;
+		case XLOG_HASH_SQUEEZE_PAGE:
+			id = "SQUEEZE_PAGE";
+			break;
+		case XLOG_HASH_DELETE:
+			id = "DELETE";
+			break;
+		case XLOG_HASH_SPLIT_CLEANUP:
+			id = "SPLIT_CLEANUP";
+			break;
+		case XLOG_HASH_UPDATE_META_PAGE:
+			id = "UPDATE_META_PAGE";
+			break;
+	}
+
+	return id;
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 72bb06c..9618032 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -506,11 +506,6 @@ DefineIndex(Oid relationId,
 	accessMethodForm = (Form_pg_am) GETSTRUCT(tuple);
 	amRoutine = GetIndexAmRoutine(accessMethodForm->amhandler);
 
-	if (strcmp(accessMethodName, "hash") == 0 &&
-		RelationNeedsWAL(rel))
-		ereport(WARNING,
-				(errmsg("hash indexes are not WAL-logged and their use is discouraged")));
-
 	if (stmt->unique && !amRoutine->amcanunique)
 		ereport(ERROR,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 9001e20..ce55fc5 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5880,13 +5880,10 @@ RelationIdIsInInitFile(Oid relationId)
 /*
  * Tells whether any index for the relation is unlogged.
  *
- * Any index using the hash AM is implicitly unlogged.
- *
  * Note: There doesn't seem to be any way to have an unlogged index attached
- * to a permanent table except to create a hash index, but it seems best to
- * keep this general so that it returns sensible results even when they seem
- * obvious (like for an unlogged table) and to handle possible future unlogged
- * indexes on permanent tables.
+ * to a permanent table, but it seems best to keep this general so that it
+ * returns sensible results even when they seem obvious (like for an unlogged
+ * table) and to handle possible future unlogged indexes on permanent tables.
  */
 bool
 RelationHasUnloggedIndex(Relation rel)
@@ -5908,8 +5905,7 @@ RelationHasUnloggedIndex(Relation rel)
 			elog(ERROR, "cache lookup failed for relation %u", indexoid);
 		reltup = (Form_pg_class) GETSTRUCT(tp);
 
-		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED
-			|| reltup->relam == HASH_AM_OID)
+		if (reltup->relpersistence == RELPERSISTENCE_UNLOGGED)
 			result = true;
 
 		ReleaseSysCache(tp);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index cc23163..2075ab7 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -16,7 +16,239 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "storage/off.h"
 
+/* Number of buffers required for XLOG_HASH_SQUEEZE_PAGE operation */
+#define HASH_XLOG_FREE_OVFL_BUFS	6
+
+/*
+ * XLOG records for hash operations
+ */
+#define XLOG_HASH_INIT_META_PAGE	0x00		/* initialize the meta page */
+#define XLOG_HASH_INIT_BITMAP_PAGE	0x10		/* initialize the bitmap page */
+#define XLOG_HASH_INSERT		0x20	/* add index tuple without split */
+#define XLOG_HASH_ADD_OVFL_PAGE 0x30	/* add overflow page */
+#define XLOG_HASH_SPLIT_ALLOCATE_PAGE	0x40	/* allocate new page for split */
+#define XLOG_HASH_SPLIT_PAGE	0x50	/* split page */
+#define XLOG_HASH_SPLIT_COMPLETE	0x60		/* completion of split
+												 * operation */
+#define XLOG_HASH_MOVE_PAGE_CONTENTS	0x70	/* remove tuples from one page
+												 * and add to another page */
+#define XLOG_HASH_SQUEEZE_PAGE	0x80	/* add tuples to one of the previous
+										 * pages in chain and free the ovfl
+										 * page */
+#define XLOG_HASH_DELETE		0x90	/* delete index tuples from a page */
+#define XLOG_HASH_SPLIT_CLEANUP 0xA0	/* clear split-cleanup flag in primary
+										 * bucket page after deleting tuples
+										 * that are moved due to split	*/
+#define XLOG_HASH_UPDATE_META_PAGE	0xB0		/* update meta page after
+												 * vacuum */
+
+
+/*
+ * xl_hash_split_allocate_page flag values, 8 bits are available.
+ */
+#define XLH_SPLIT_META_UPDATE_MASKS		(1<<0)
+#define XLH_SPLIT_META_UPDATE_SPLITPOINT		(1<<1)
+
+/*
+ * This is what we need to know about a HASH index create.
+ *
+ * Backup block 0: metapage
+ */
+typedef struct xl_hash_createidx
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_createidx;
+#define SizeOfHashCreateIdx (offsetof(xl_hash_createidx, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to know about simple (without split) insert.
+ *
+ * This data record is used for XLOG_HASH_INSERT
+ *
+ * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 1: metapage (HashMetaPageData)
+ */
+typedef struct xl_hash_insert
+{
+	OffsetNumber offnum;
+}	xl_hash_insert;
+
+#define SizeOfHashInsert	(offsetof(xl_hash_insert, offnum) + sizeof(OffsetNumber))
+
+/*
+ * This is what we need to know about addition of overflow page.
+ *
+ * This data record is used for XLOG_HASH_ADD_OVFL_PAGE
+ *
+ * Backup Blk 0: newly allocated overflow page
+ * Backup Blk 1: page before new overflow page in the bucket chain
+ * Backup Blk 2: bitmap page
+ * Backup Blk 3: new bitmap page
+ * Backup Blk 4: metapage
+ */
+typedef struct xl_hash_add_ovfl_page
+{
+	uint16		bmsize;
+	bool		bmpage_found;
+}	xl_hash_add_ovfl_page;
+
+#define SizeOfHashAddOvflPage	\
+	(offsetof(xl_hash_add_ovfl_page, bmpage_found) + sizeof(bool))
+
+/*
+ * This is what we need to know about allocating a page for split.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_ALLOCATE_PAGE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ * Backup Blk 2: metapage
+ */
+typedef struct xl_hash_split_allocate_page
+{
+	uint32		new_bucket;
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+	uint8		flags;
+}	xl_hash_split_allocate_page;
+
+#define SizeOfHashSplitAllocPage	\
+	(offsetof(xl_hash_split_allocate_page, flags) + sizeof(uint8))
+
+/*
+ * This is what we need to know about completing the split operation.
+ *
+ * This data record is used for XLOG_HASH_SPLIT_COMPLETE
+ *
+ * Backup Blk 0: page for old bucket
+ * Backup Blk 1: page for new bucket
+ */
+typedef struct xl_hash_split_complete
+{
+	uint16		old_bucket_flag;
+	uint16		new_bucket_flag;
+}	xl_hash_split_complete;
+
+#define SizeOfHashSplitComplete \
+	(offsetof(xl_hash_split_complete, new_bucket_flag) + sizeof(uint16))
+
+/*
+ * This is what we need to know about move page contents required during
+ * squeeze operation.
+ *
+ * This data record is used for XLOG_HASH_MOVE_PAGE_CONTENTS
+ *
+ * Backup Blk 0: bucket page
+ * Backup Blk 1: page containing moved tuples
+ * Backup Blk 2: page from which tuples will be removed
+ */
+typedef struct xl_hash_move_page_contents
+{
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+}	xl_hash_move_page_contents;
+
+#define SizeOfHashMovePageContents	\
+	(offsetof(xl_hash_move_page_contents, is_prim_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the squeeze page operation.
+ *
+ * This data record is used for XLOG_HASH_SQUEEZE_PAGE
+ *
+ * Backup Blk 0: page containing tuples moved from freed overflow page
+ * Backup Blk 1: freed overflow page
+ * Backup Blk 2: page previous to the freed overflow page
+ * Backup Blk 3: page next to the freed overflow page
+ * Backup Blk 4: bitmap page containing info of freed overflow page
+ * Backup Blk 5: meta page
+ */
+typedef struct xl_hash_squeeze_page
+{
+	BlockNumber prevblkno;
+	BlockNumber nextblkno;
+	uint16		ntups;
+	bool		is_prim_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is same as
+												 * primary bucket page */
+	bool		is_prev_bucket_same_wrt;		/* TRUE if the page to which
+												 * tuples are moved is the
+												 * page previous to the freed
+												 * overflow page */
+}	xl_hash_squeeze_page;
+
+#define SizeOfHashSqueezePage	\
+	(offsetof(xl_hash_squeeze_page, is_prev_bucket_same_wrt) + sizeof(bool))
+
+/*
+ * This is what we need to know about the deletion of index tuples from a page.
+ *
+ * This data record is used for XLOG_HASH_DELETE
+ *
+ * Backup Blk 0: primary bucket page
+ * Backup Blk 1: page from which tuples are deleted
+ */
+typedef struct xl_hash_delete
+{
+	bool		is_primary_bucket_page; /* TRUE if the operation is for
+										 * primary bucket page */
+}	xl_hash_delete;
+
+#define SizeOfHashDelete	(offsetof(xl_hash_delete, is_primary_bucket_page) + sizeof(bool))
+
+/*
+ * This is what we need for metapage update operation.
+ *
+ * This data record is used for XLOG_HASH_UPDATE_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_update_meta_page
+{
+	double		ntuples;
+}	xl_hash_update_meta_page;
+
+#define SizeOfHashUpdateMetaPage	\
+	(offsetof(xl_hash_update_meta_page, ntuples) + sizeof(double))
+
+/*
+ * This is what we need to initialize metapage.
+ *
+ * This data record is used for XLOG_HASH_INIT_META_PAGE
+ *
+ * Backup Blk 0: meta page
+ */
+typedef struct xl_hash_init_meta_page
+{
+	double		num_tuples;
+	RegProcedure procid;
+	uint16		ffactor;
+}	xl_hash_init_meta_page;
+
+#define SizeOfHashInitMetaPage		\
+	(offsetof(xl_hash_init_meta_page, ffactor) + sizeof(uint16))
+
+/*
+ * This is what we need to initialize bitmap page.
+ *
+ * This data record is used for XLOG_HASH_INIT_BITMAP_PAGE
+ *
+ * Backup Blk 0: bitmap page
+ * Backup Blk 1: meta page
+ */
+typedef struct xl_hash_init_bitmap_page
+{
+	uint16		bmsize;
+}	xl_hash_init_bitmap_page;
+
+#define SizeOfHashInitBitmapPage	\
+	(offsetof(xl_hash_init_bitmap_page, bmsize) + sizeof(uint16))
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index e519fdb..26cd059 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2335,13 +2335,9 @@ Options: fastupdate=on, gin_pending_list_limit=128
 -- HASH
 --
 CREATE INDEX hash_i4_index ON hash_i4_heap USING hash (random int4_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_name_index ON hash_name_heap USING hash (random name_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_txt_index ON hash_txt_heap USING hash (random text_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE INDEX hash_f8_index ON hash_f8_heap USING hash (random float8_ops);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNLOGGED TABLE unlogged_hash_table (id int4);
 CREATE INDEX unlogged_hash_index ON unlogged_hash_table USING hash (id int4_ops);
 DROP TABLE unlogged_hash_table;
@@ -2350,7 +2346,6 @@ DROP TABLE unlogged_hash_table;
 -- maintenance_work_mem setting and fillfactor:
 SET maintenance_work_mem = '1MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
                       QUERY PLAN                       
diff --git a/src/test/regress/expected/enum.out b/src/test/regress/expected/enum.out
index 514d1d0..0e60304 100644
--- a/src/test/regress/expected/enum.out
+++ b/src/test/regress/expected/enum.out
@@ -383,7 +383,6 @@ DROP INDEX enumtest_btree;
 -- Hash index / opclass with the = operator
 --
 CREATE INDEX enumtest_hash ON enumtest USING hash (col);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT * FROM enumtest WHERE col = 'orange';
   col   
 --------
diff --git a/src/test/regress/expected/hash_index.out b/src/test/regress/expected/hash_index.out
index f8b9f02..0a18efa 100644
--- a/src/test/regress/expected/hash_index.out
+++ b/src/test/regress/expected/hash_index.out
@@ -201,7 +201,6 @@ SELECT h.seqno AS f20000
 --
 CREATE TABLE hash_split_heap (keycol INT);
 CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 70000) a;
 VACUUM FULL hash_split_heap;
 -- Let's do a backward scan.
@@ -230,5 +229,4 @@ DROP TABLE hash_temp_heap CASCADE;
 CREATE TABLE hash_heap_float4 (x float4, y int);
 INSERT INTO hash_heap_float4 VALUES (1.1,1);
 CREATE INDEX hash_idx ON hash_heap_float4 USING hash (x);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 DROP TABLE hash_heap_float4 CASCADE;
diff --git a/src/test/regress/expected/macaddr.out b/src/test/regress/expected/macaddr.out
index e84ff5f..151f9ce 100644
--- a/src/test/regress/expected/macaddr.out
+++ b/src/test/regress/expected/macaddr.out
@@ -41,7 +41,6 @@ SELECT * FROM macaddr_data;
 
 CREATE INDEX macaddr_data_btree ON macaddr_data USING btree (b);
 CREATE INDEX macaddr_data_hash ON macaddr_data USING hash (b);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 SELECT a, b, trunc(b) FROM macaddr_data ORDER BY 2, 1;
  a  |         b         |       trunc       
 ----+-------------------+-------------------
diff --git a/src/test/regress/expected/replica_identity.out b/src/test/regress/expected/replica_identity.out
index fa63235..67c34a9 100644
--- a/src/test/regress/expected/replica_identity.out
+++ b/src/test/regress/expected/replica_identity.out
@@ -12,7 +12,6 @@ CREATE UNIQUE INDEX test_replica_identity_keyab_key ON test_replica_identity (ke
 CREATE UNIQUE INDEX test_replica_identity_oid_idx ON test_replica_identity (oid);
 CREATE UNIQUE INDEX test_replica_identity_nonkey ON test_replica_identity (keya, nonkey);
 CREATE INDEX test_replica_identity_hash ON test_replica_identity USING hash (nonkey);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 CREATE UNIQUE INDEX test_replica_identity_expr ON test_replica_identity (keya, keyb, (3));
 CREATE UNIQUE INDEX test_replica_identity_partial ON test_replica_identity (keya, keyb) WHERE keyb != '3';
 -- default is 'd'/DEFAULT for user created tables
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 423f277..db66dc7 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -114,7 +114,6 @@ SELECT COUNT(*) FROM guid1 WHERE guid_field >= '22222222-2222-2222-2222-22222222
 -- btree and hash index creation test
 CREATE INDEX guid1_btree ON guid1 USING BTREE (guid_field);
 CREATE INDEX guid1_hash  ON guid1 USING HASH  (guid_field);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 -- unique index test
 CREATE UNIQUE INDEX guid1_unique_BTREE ON guid1 USING BTREE (guid_field);
 -- should fail

wal_hash_index_test_v1.patchapplication/octet-stream; name=wal_hash_index_test_v1.patchDownload

diff --git a/src/test/modules/snapshot_too_old/Makefile b/src/test/modules/snapshot_too_old/Makefile
index 16339f0..81b87b6 100644
--- a/src/test/modules/snapshot_too_old/Makefile
+++ b/src/test/modules/snapshot_too_old/Makefile
@@ -2,7 +2,7 @@
 
 EXTRA_CLEAN = ./isolation_output
 
-ISOLATIONCHECKS=sto_using_cursor sto_using_select
+ISOLATIONCHECKS=sto_using_cursor sto_using_select sto_using_hash_index
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/src/test/modules/snapshot_too_old/expected/sto_using_hash_index.out b/src/test/modules/snapshot_too_old/expected/sto_using_hash_index.out
new file mode 100644
index 0000000..bf94054
--- /dev/null
+++ b/src/test/modules/snapshot_too_old/expected/sto_using_hash_index.out
@@ -0,0 +1,15 @@
+Parsed test spec with 2 sessions
+
+starting permutation: noseq s1f1 s2sleep s2u s1f2
+step noseq: SET enable_seqscan = false;
+step s1f1: SELECT c FROM sto1 where c = 1000;
+c              
+
+1000           
+step s2sleep: SELECT setting, pg_sleep(6) FROM pg_settings WHERE name = 'old_snapshot_threshold';
+setting        pg_sleep       
+
+0                             
+step s2u: UPDATE sto1 SET c = 1001 WHERE c = 1000;
+step s1f2: SELECT c FROM sto1 where c = 1001;
+ERROR:  snapshot too old
diff --git a/src/test/modules/snapshot_too_old/specs/sto_using_hash_index.spec b/src/test/modules/snapshot_too_old/specs/sto_using_hash_index.spec
new file mode 100644
index 0000000..d05a6d9
--- /dev/null
+++ b/src/test/modules/snapshot_too_old/specs/sto_using_hash_index.spec
@@ -0,0 +1,39 @@
+# This test provokes a "snapshot too old" error for hash index.
+#
+# The sleep is needed because with a threshold of zero a statement could error
+# on changes it made.  With more normal settings no external delay is needed,
+# but we don't want these tests to run long enough to see that, since
+# granularity is in minutes.
+#
+# Since results depend on the value of old_snapshot_threshold, sneak that into
+# the line generated by the sleep, so that a surprising values isn't so hard
+# to identify.
+
+setup
+{
+    CREATE TABLE sto1 (c int NOT NULL);
+    INSERT INTO sto1 SELECT generate_series(1, 1000);
+    CREATE INDEX idx_sto1 ON sto1 USING HASH (c);
+}
+setup
+{
+    VACUUM ANALYZE sto1;
+}
+
+teardown
+{
+    DROP TABLE sto1;
+}
+
+session "s1"
+setup			{ BEGIN ISOLATION LEVEL REPEATABLE READ; }
+step "noseq"	{ SET enable_seqscan = false; }
+step "s1f1"		{ SELECT c FROM sto1 where c = 1000; }
+step "s1f2"		{ SELECT c FROM sto1 where c = 1001; }
+teardown		{ ROLLBACK; }
+
+session "s2"
+step "s2sleep"	{ SELECT setting, pg_sleep(6) FROM pg_settings WHERE name = 'old_snapshot_threshold'; }
+step "s2u"		{ UPDATE sto1 SET c = 1001 WHERE c = 1000; }
+
+permutation "noseq" "s1f1" "s2sleep" "s2u" "s1f2"

#99

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#97)

Re: Write Ahead Logging for Hash Indexes

On Sun, Mar 12, 2017 at 8:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Mar 11, 2017 at 12:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
/*
+                 * Change the shared buffer state in critical section,
+                 * otherwise any error could make it unrecoverable after
+                 * recovery.
+                 */
+                START_CRIT_SECTION();
+
+                /*
* Insert tuple on new page, using _hash_pgaddtup to ensure
* correct ordering by hashkey.  This is a tad inefficient
* since we may have to shuffle itempointers repeatedly.
* Possible future improvement: accumulate all the items for
* the new page and qsort them before insertion.
*/
(void) _hash_pgaddtup(rel, nbuf, itemsz, new_itup);
+ END_CRIT_SECTION();

No way. You have to start the critical section before making any page
modifications and keep it alive until all changes have been logged.
I think what we need to do here is to accumulate all the tuples that
need to be added to new bucket page till either that page has no more
space or there are no more tuples remaining in an old bucket. Then in
a critical section, add them to the page using _hash_pgaddmultitup and
log the entire new bucket page contents as is currently done in patch
log_split_page().
I agree.

Okay, I have changed like that in the patch.

Now, here we can choose to log the individual
tuples as well instead of a complete page, however not sure if there
is any benefit for doing the same because XLogRecordAssemble() will
anyway remove the empty space from the page. Let me know if you have
something else in mind.

Well, if you have two pages that are 75% full, and you move a third of
the tuples from one of them into the other, it's going to be about
four times more efficient to log only the moved tuples than the whole
page.

I don't see how this could happen during split? I mean if you are
moving 25% tuples from old bucket page to a new bucket page which is
75% full, it will log the new bucket page only after it gets full.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#98)

Re: Write Ahead Logging for Hash Indexes

On Mon, Mar 13, 2017 at 6:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 9, 2017 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 7, 2017 at 6:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Great, thanks. 0001 looks good to me now, so committed.

Committed 0002.

Here are some initial review thoughts on 0003 based on a first read-through.

It seems like a good test to do with this patch would be to set up a
pgbench test on the master with a hash index replacing the usual btree
index. Then, set up a standby and run a read-only pgbench on the
standby while a read-write pgbench test runs on the master. Maybe
you've already tried something like that?

I also think so and apart from that I think it makes sense to perform
recovery test by Jeff Janes tool and probably tests with
wal_consistency_check. These tests are already running from past seven
hours or so and I will keep them running for the whole night to see if
there is any discrepancy.

We didn't found any issue with the above testing.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#100)

Re: Write Ahead Logging for Hash Indexes

On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We didn't found any issue with the above testing.

Great! I've committed the latest version of the patch, with some
cosmetic changes.

It would be astonishing if there weren't a bug or two left, but I
think overall this is very solid work, and I think it's time to put
this out there and see how things go.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Robert Haas (#101)

Re: Write Ahead Logging for Hash Indexes

Robert Haas <robertmhaas@gmail.com> writes:

Great! I've committed the latest version of the patch, with some
cosmetic changes.

Woo hoo! That's been a bee in the bonnet for, um, decades.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Tom Lane (#102)

Re: Write Ahead Logging for Hash Indexes

On Tue, Mar 14, 2017 at 1:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Great! I've committed the latest version of the patch, with some
cosmetic changes.

Woo hoo! That's been a bee in the bonnet for, um, decades.

Yeah. I'm pretty happy to see that go in.

It's become pretty clear to me that there are a bunch of other things
about hash indexes which are not exactly great, the worst of which is
the way they grow by DOUBLING IN SIZE. Mithun has submitted a patch
which - if it's acceptable - would alleviate that to some degree by
splitting up each doubling into four separate increments. But that
still could lead to very large growth increments on big indexes, like
your 32GB index suddenly growing - in one fell swoop - to 40GB.
That's admittedly a lot better than what will happen right now, which
is instantaneous growth to 64GB, but it's not great, and it's not
altogether clear how it could be fixed in a really satisfying way.

Other things that are not so great:

- no multi-column support
- no amcanunique support
- every insert dirties the metapage
- splitting is generally too aggressive; very few overflow pages are
ever created unless you have piles of duplicates

Still, this is a big step forward. I think it will be useful for
people with long text strings or other very wide data that they want
to index, and hopefully we'll get some feedback from users about where
this works well and where it doesn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Robert Haas (#103)

Re: Write Ahead Logging for Hash Indexes

Robert Haas <robertmhaas@gmail.com> writes:

It's become pretty clear to me that there are a bunch of other things
about hash indexes which are not exactly great, the worst of which is
the way they grow by DOUBLING IN SIZE.

Uh, what? Growth should happen one bucket-split at a time.

Other things that are not so great:

- no multi-column support
- no amcanunique support
- every insert dirties the metapage
- splitting is generally too aggressive; very few overflow pages are
ever created unless you have piles of duplicates

Yeah. It's a bit hard to see how to add multi-column support unless you
give up the property of allowing queries on a subset of the index columns.
Lack of amcanunique seems like mostly a round-tuit shortage. The other
two are implementation deficiencies that maybe we can remedy someday.

Another thing I'd like to see is support for 64-bit hash values.

But all of these were mainly blocked by people not wanting to sink effort
into hash indexes as long as they were unusable for production due to lack
of WAL support. So this is a huge step forward.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Tom Lane (#104)

Re: Write Ahead Logging for Hash Indexes

On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

It's become pretty clear to me that there are a bunch of other things
about hash indexes which are not exactly great, the worst of which is
the way they grow by DOUBLING IN SIZE.

Uh, what? Growth should happen one bucket-split at a time.

Technically, the buckets are created one at a time, but because of the
way hashm_spares works, the primary bucket pages for all bucket from
2^N to 2^{N+1}-1 must be physically consecutive. See
_hash_alloc_buckets.

Other things that are not so great:

- no multi-column support
- no amcanunique support
- every insert dirties the metapage
- splitting is generally too aggressive; very few overflow pages are
ever created unless you have piles of duplicates

Yeah. It's a bit hard to see how to add multi-column support unless you
give up the property of allowing queries on a subset of the index columns.
Lack of amcanunique seems like mostly a round-tuit shortage. The other
two are implementation deficiencies that maybe we can remedy someday.

Another thing I'd like to see is support for 64-bit hash values.

But all of these were mainly blocked by people not wanting to sink effort
into hash indexes as long as they were unusable for production due to lack
of WAL support. So this is a huge step forward.

Agreed, on all points.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Robert Haas (#105)

Re: Write Ahead Logging for Hash Indexes

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

It's become pretty clear to me that there are a bunch of other things
about hash indexes which are not exactly great, the worst of which is
the way they grow by DOUBLING IN SIZE.

Uh, what? Growth should happen one bucket-split at a time.

Technically, the buckets are created one at a time, but because of the
way hashm_spares works, the primary bucket pages for all bucket from
2^N to 2^{N+1}-1 must be physically consecutive. See
_hash_alloc_buckets.

Right, but we only fill those pages one at a time.

It's true that as soon as we need another overflow page, that's going to
get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
index will grow quite a lot. But any modern filesystem should handle
that without much difficulty by treating the index as a sparse file.

There may be some work to be done in places like pg_basebackup to
recognize and deal with sparse files, but it doesn't seem like a
reason to panic.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107

Stephen Frost

sfrost@snowman.net

almost 9 years ago

In reply to: Tom Lane (#106)

Re: Write Ahead Logging for Hash Indexes

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Mar 14, 2017 at 2:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

It's become pretty clear to me that there are a bunch of other things
about hash indexes which are not exactly great, the worst of which is
the way they grow by DOUBLING IN SIZE.

Uh, what? Growth should happen one bucket-split at a time.

Technically, the buckets are created one at a time, but because of the
way hashm_spares works, the primary bucket pages for all bucket from
2^N to 2^{N+1}-1 must be physically consecutive. See
_hash_alloc_buckets.

Right, but we only fill those pages one at a time.

It's true that as soon as we need another overflow page, that's going to
get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
index will grow quite a lot. But any modern filesystem should handle
that without much difficulty by treating the index as a sparse file.

Uh, last I heard we didn't allow or want sparse files in the backend
because then we have to handle a possible out-of-disk-space failure on
every write.

If we think they're ok to do, it'd be awful nice to figure out a way for
VACUUM to turn an entirely-empty 1G chunk into a sparse file..

There may be some work to be done in places like pg_basebackup to
recognize and deal with sparse files, but it doesn't seem like a
reason to panic.

Well, and every file-based backup tool out there..

Thanks!

Stephen

#108

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Stephen Frost (#107)

Re: Write Ahead Logging for Hash Indexes

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

It's true that as soon as we need another overflow page, that's going to
get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
index will grow quite a lot. But any modern filesystem should handle
that without much difficulty by treating the index as a sparse file.

Uh, last I heard we didn't allow or want sparse files in the backend
because then we have to handle a possible out-of-disk-space failure on
every write.

For a hash index, this would happen during a bucket split, which would
need to be resilient against out-of-disk-space anyway.

There may be some work to be done in places like pg_basebackup to
recognize and deal with sparse files, but it doesn't seem like a
reason to panic.

Well, and every file-based backup tool out there..

Weren't you the one leading the charge to deprecate use of file-based
backup?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109

Stephen Frost

sfrost@snowman.net

almost 9 years ago

In reply to: Tom Lane (#108)

Re: Write Ahead Logging for Hash Indexes

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

It's true that as soon as we need another overflow page, that's going to
get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
index will grow quite a lot. But any modern filesystem should handle
that without much difficulty by treating the index as a sparse file.

Uh, last I heard we didn't allow or want sparse files in the backend
because then we have to handle a possible out-of-disk-space failure on
every write.

For a hash index, this would happen during a bucket split, which would
need to be resilient against out-of-disk-space anyway.

We wouldn't attempt to use the area of the file which is not yet
allocated except when doing a bucket split? If that's the case then
this does seem to at least be less of an issue, though I hope we put in
appropriate comments about it.

There may be some work to be done in places like pg_basebackup to
recognize and deal with sparse files, but it doesn't seem like a
reason to panic.

Well, and every file-based backup tool out there..

Weren't you the one leading the charge to deprecate use of file-based
backup?

No, nor do I see how we would ever be able to deprecate file-based
backups. If anything, I'd like to see us improve our support for
them. I'm certainly curious where the notion that I was ever in favor
of deprecating them came from, particularly given all of the effort that
David and I have been pouring into our favorite file-based backup tool
over the past few years.

Thanks!

Stephen

#110

Mark Kirkwood

mark.kirkwood@catalyst.net.nz

almost 9 years ago

In reply to: Robert Haas (#101)

Re: Write Ahead Logging for Hash Indexes

On 15/03/17 06:29, Robert Haas wrote:

Great! I've committed the latest version of the patch, with some
cosmetic changes.

It would be astonishing if there weren't a bug or two left, but I
think overall this is very solid work, and I think it's time to put
this out there and see how things go.

Awesome, great work!

Cheers

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#101)

Re: Write Ahead Logging for Hash Indexes

On Tue, Mar 14, 2017 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We didn't found any issue with the above testing.

Great! I've committed the latest version of the patch, with some
cosmetic changes.

Thanks a lot. I have noticed that the test case patch for "snapshot
too old" is not committed. Do you want to leave that due to timing
requirement or do you want me to submit it separately so that we can
deal with it separately and may take an input from Kevin?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Stephen Frost (#109)

Re: Write Ahead Logging for Hash Indexes

On Wed, Mar 15, 2017 at 12:53 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

It's true that as soon as we need another overflow page, that's going to
get dropped beyond the 2^{N+1}-1 point, and the *apparent* size of the
index will grow quite a lot. But any modern filesystem should handle
that without much difficulty by treating the index as a sparse file.

Uh, last I heard we didn't allow or want sparse files in the backend
because then we have to handle a possible out-of-disk-space failure on
every write.

For a hash index, this would happen during a bucket split, which would
need to be resilient against out-of-disk-space anyway.

We wouldn't attempt to use the area of the file which is not yet
allocated except when doing a bucket split?

That's right.

If that's the case then
this does seem to at least be less of an issue, though I hope we put in
appropriate comments about it.

I think we have sufficient comments in code especially on top of
function _hash_alloc_buckets().

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113

Stephen Frost

sfrost@snowman.net

almost 9 years ago

In reply to: Amit Kapila (#112)

Re: Write Ahead Logging for Hash Indexes

Amit,

* Amit Kapila (amit.kapila16@gmail.com) wrote:

On Wed, Mar 15, 2017 at 12:53 AM, Stephen Frost <sfrost@snowman.net> wrote:

If that's the case then
this does seem to at least be less of an issue, though I hope we put in
appropriate comments about it.

I think we have sufficient comments in code especially on top of
function _hash_alloc_buckets().

I don't see any comments regarding how we have to be sure to handle
an out-of-space case properly in the middle of a file because we've made
it sparse.

I do see that mdwrite() should handle an out-of-disk-space case, though
that just makes me wonder what's different here compared to normal
relations that we don't have an issue with a sparse WAL'd hash index but
we can't handle it if a normal relation is sparse.

Additional comments here would be overkill if we think that the lower
layers are already all set up to handle such a case properly and we
don't have to do anything special, but I had understood that we didn't
generally think that sparse files would just work. I'm certainly happy
to be wrong there because, if that's the case, it'd be a great way to
improve the problems we have returning large chunks of free space in the
middle of a relation to the OS, but if there is a real concern here then
we need to work out what it is and probably add comments or possibly add
code to address whatever it is.

Thanks!

Stephen

#114

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Stephen Frost (#113)

Re: Write Ahead Logging for Hash Indexes

On Wed, Mar 15, 2017 at 9:18 AM, Stephen Frost <sfrost@snowman.net> wrote:

I think we have sufficient comments in code especially on top of
function _hash_alloc_buckets().

I don't see any comments regarding how we have to be sure to handle
an out-of-space case properly in the middle of a file because we've made
it sparse.

I do see that mdwrite() should handle an out-of-disk-space case, though
that just makes me wonder what's different here compared to normal
relations that we don't have an issue with a sparse WAL'd hash index but
we can't handle it if a normal relation is sparse.

I agree. I think that what hash indexes are doing here is
inconsistent with what we do for btree indexes and the heap. And I
don't think it would be bad to fix that. We could, of course, go the
other way and do what Tom is suggesting - insist that everybody's got
to be prepared for sparse files, but I would view that as something of
a U-turn. I think hash indexes are like this because nobody's really
worried about hash indexes because they haven't been crash-safe or
performant. Now that we've made them crash safe and are on the way to
making them performant, fixing other things is, as Tom already said, a
lot more interesting.

Now, that having been said, I'm not sure it's a good idea to tinker
with the behavior for v10. We could change the new-splitpoint code so
that it loops over all the pages in the new splitpoint and zeroes them
all, instead of just the last one. If we all agree that's how it
should work, then it's probably not a lot of work to make the change.
But if what's needed is anything more than that or we don't all agree,
then we'd better just leave it alone and we can revisit it for v11.
It's too late to start making significant or controversial design
changes at this point, and you could argue that this is a preexisting
design defect which the WAL-logging patch wasn't necessarily obligated
to fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Robert Haas (#114)

Re: Write Ahead Logging for Hash Indexes

Robert Haas <robertmhaas@gmail.com> writes:

Now, that having been said, I'm not sure it's a good idea to tinker
with the behavior for v10. We could change the new-splitpoint code so
that it loops over all the pages in the new splitpoint and zeroes them
all, instead of just the last one.

Why would we do that? That would change the behavior from something
that's arguably OK (at least given the right filesystem) to something
that's clearly not very OK.

It's too late to start making significant or controversial design
changes at this point, and you could argue that this is a preexisting
design defect which the WAL-logging patch wasn't necessarily obligated
to fix.

Yes, and it is certainly that, and no this patch wasn't chartered to fix
it. I don't have a problem with leaving things like this for v10.

FWIW, I'm not certain that Stephen is correct to claim that we have
some concrete problem with sparse files. We certainly don't *depend*
on sparse storage anyplace else, nor write data in a way that would be
likely to trigger it; but I'm not aware that we need to work hard to
avoid it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Stephen Frost (#113)

Re: Write Ahead Logging for Hash Indexes

Stephen Frost <sfrost@snowman.net> writes:

I do see that mdwrite() should handle an out-of-disk-space case, though
that just makes me wonder what's different here compared to normal
relations that we don't have an issue with a sparse WAL'd hash index but
we can't handle it if a normal relation is sparse.

*Any* write has to be prepared to handle errors. There's always a risk of
EIO, and on a COW filesystem you might well get ENOSPC even when you think
you're overwriting previously-allocated storage. All that we are doing by
pre-allocating storage is reducing the risks a bit, not guaranteeing that
no error will happen.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Stephen Frost

sfrost@snowman.net

almost 9 years ago

In reply to: Tom Lane (#115)

Re: Write Ahead Logging for Hash Indexes

Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

FWIW, I'm not certain that Stephen is correct to claim that we have
some concrete problem with sparse files. We certainly don't *depend*
on sparse storage anyplace else, nor write data in a way that would be
likely to trigger it; but I'm not aware that we need to work hard to
avoid it.

I don't have any concrete evidence that there is an issue, just a
recollection of bringing up exactly the idea of turning entirely empty
1G segments into sparse files to free that space back to the filesystem
during a VACUUM during a discussion somewhere along the way and being
told that it'd be bad because we would run into issues later when we try
to write those pages back out and discover we're out of disk space.

Naturally, there's a lot to consider with such a change to VACUUM like
how we would WAL that and how we'd make sure that nothing is about to
try and use any of those pages (perhaps start at the end of the segment
and lock the pages, or just try to get an exclusive lock on the table as
we do when we try to truncate the relation because there's free space at
the end?), but it'd be a very interesting project to consider, if we are
willing to accept sparse heap files.

Another interesting idea might be to somehow change the now-empty
relation into just a place-holder that says "assume I'm 1G in size and
empty." but actually have the file be shrunk.

Anyway, I tend to agree with the sentiment that we don't need this patch
to change the behavior here and perhaps it'll be good to see what
happens when people start using these sparsh hash indexes, maybe that
will shed some light on if we have anything to worry about here or not,
and if not then we can consider having sparse files elsewhere.

Thanks!

Stephen

#118

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Tom Lane (#115)

Re: Write Ahead Logging for Hash Indexes

On Wed, Mar 15, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I'm not certain that Stephen is correct to claim that we have
some concrete problem with sparse files. We certainly don't *depend*
on sparse storage anyplace else, nor write data in a way that would be
likely to trigger it; but I'm not aware that we need to work hard to
avoid it.

That theory seems inconsistent with how mdextend() works. My
understanding is that we zero-fill the new blocks before populating
them with actual data precisely to avoid running out of disk space due
to deferred allocation at the OS level. If we don't care about
failures due to deferred allocation at the OS level, we can rip that
logic out and improve the performance of relation extension
considerably. If we do care about failures due to deferred
allocation, then leaving holes in the file is a bad idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#111)

Re: Write Ahead Logging for Hash Indexes

On Tue, Mar 14, 2017 at 10:30 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Mar 14, 2017 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 13, 2017 at 11:48 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

We didn't found any issue with the above testing.

Great! I've committed the latest version of the patch, with some
cosmetic changes.

Thanks a lot. I have noticed that the test case patch for "snapshot
too old" is not committed. Do you want to leave that due to timing
requirement or do you want me to submit it separately so that we can
deal with it separately and may take an input from Kevin?

I've committed that now as well. Let's see what the buildfarm thinks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Stephen Frost

sfrost@snowman.net

almost 9 years ago

In reply to: Robert Haas (#118)

Re: Write Ahead Logging for Hash Indexes

Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Wed, Mar 15, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

FWIW, I'm not certain that Stephen is correct to claim that we have
some concrete problem with sparse files. We certainly don't *depend*
on sparse storage anyplace else, nor write data in a way that would be
likely to trigger it; but I'm not aware that we need to work hard to
avoid it.

That theory seems inconsistent with how mdextend() works. My
understanding is that we zero-fill the new blocks before populating
them with actual data precisely to avoid running out of disk space due
to deferred allocation at the OS level. If we don't care about
failures due to deferred allocation at the OS level, we can rip that
logic out and improve the performance of relation extension
considerably. If we do care about failures due to deferred
allocation, then leaving holes in the file is a bad idea.

That is a fantastic point.

Thanks!

Stephen

#121

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Robert Haas (#118)

Re: Write Ahead Logging for Hash Indexes

Robert Haas <robertmhaas@gmail.com> writes:

That theory seems inconsistent with how mdextend() works. My
understanding is that we zero-fill the new blocks before populating
them with actual data precisely to avoid running out of disk space due
to deferred allocation at the OS level. If we don't care about
failures due to deferred allocation at the OS level, we can rip that
logic out and improve the performance of relation extension
considerably.

See my reply to Stephen. The fact that this fails to guarantee no
ENOSPC on COW filesystems doesn't mean that it's not worth doing on
other filesystems. We're reducing the risk, not eliminating it,
but reducing risk is still a worthwhile activity.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#122

Stephen Frost

sfrost@snowman.net

almost 9 years ago

In reply to: Tom Lane (#121)

Re: Write Ahead Logging for Hash Indexes

Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Robert Haas <robertmhaas@gmail.com> writes:

That theory seems inconsistent with how mdextend() works. My
understanding is that we zero-fill the new blocks before populating
them with actual data precisely to avoid running out of disk space due
to deferred allocation at the OS level. If we don't care about
failures due to deferred allocation at the OS level, we can rip that
logic out and improve the performance of relation extension
considerably.

See my reply to Stephen. The fact that this fails to guarantee no
ENOSPC on COW filesystems doesn't mean that it's not worth doing on
other filesystems. We're reducing the risk, not eliminating it,
but reducing risk is still a worthwhile activity.

Considering how much work we end up doing to extend a relation and how
we know that's been a hotspot, I'm not entirely sure I agree that
avoiding the relativly infrequent out-of-disk-space concern when
extending the relation (instead of letting it happen when we go to
actually write data into the page) really is a good trade-off to make.

Thanks!

Stephen

#123

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Tom Lane (#121)

Re: Write Ahead Logging for Hash Indexes

On Wed, Mar 15, 2017 at 11:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

That theory seems inconsistent with how mdextend() works. My
understanding is that we zero-fill the new blocks before populating
them with actual data precisely to avoid running out of disk space due
to deferred allocation at the OS level. If we don't care about
failures due to deferred allocation at the OS level, we can rip that
logic out and improve the performance of relation extension
considerably.

See my reply to Stephen. The fact that this fails to guarantee no
ENOSPC on COW filesystems doesn't mean that it's not worth doing on
other filesystems. We're reducing the risk, not eliminating it,
but reducing risk is still a worthwhile activity.

Well, then it would presumably be worth reducing for hash indexes, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#124

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#114)

Re: Write Ahead Logging for Hash Indexes

On Wed, Mar 15, 2017 at 7:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 15, 2017 at 9:18 AM, Stephen Frost <sfrost@snowman.net> wrote:

I think we have sufficient comments in code especially on top of
function _hash_alloc_buckets().

I don't see any comments regarding how we have to be sure to handle
an out-of-space case properly in the middle of a file because we've made
it sparse.

I do see that mdwrite() should handle an out-of-disk-space case, though
that just makes me wonder what's different here compared to normal
relations that we don't have an issue with a sparse WAL'd hash index but
we can't handle it if a normal relation is sparse.

I agree. I think that what hash indexes are doing here is
inconsistent with what we do for btree indexes and the heap. And I
don't think it would be bad to fix that. We could, of course, go the
other way and do what Tom is suggesting - insist that everybody's got
to be prepared for sparse files, but I would view that as something of
a U-turn. I think hash indexes are like this because nobody's really
worried about hash indexes because they haven't been crash-safe or
performant. Now that we've made them crash safe and are on the way to
making them performant, fixing other things is, as Tom already said, a
lot more interesting.

Now, that having been said, I'm not sure it's a good idea to tinker
with the behavior for v10. We could change the new-splitpoint code so
that it loops over all the pages in the new splitpoint and zeroes them
all, instead of just the last one.

Yeah, but I am slightly afraid that apart from consuming too much
space, it will make the operation lot slower when we have to perform
the allocation step. Another possibility is to do that when we
actually use/allocate the bucket? As of now, _hash_getnewbuf assumes
that the block is pre-existing and avoid the actual read/extend by
using RBM_ZERO_AND_LOCK mode. I think we can change it so that it
forces a new allocation (if the block doesn't exist) on need. I agree
this is somewhat more change than what you have proposed to circumvent
the sparse file behavior, but may not be a ton more work if we decide
to fix and has the advantage of allocating the space on disk on actual
need.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers