Bug in new buffering GiST build code

Started by Heikki Linnakangasover 13 years ago13 messages
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
1 attachment(s)

I bumped into a bug in the new buffering GiST build code. I did this:

create table gisttest (t text);
insert into gisttest select
a||'fooooooooooooooooooooooooooooooooooooooooooooooo' from
generate_series(1,10000000) a;

create index i_gisttest on gisttest using gist (t collate "C") WITH
(fillfactor=10);

After a while, this segfaults.

I debugged this, and traced this into a bug in the
gistRelocateBuildBuffersOnSplit() function. It splits a node buffer into
two (or more) node buffers, when the corresponding index page is split.
It first makes a copy of the old GISTNodeBuffer struct, and then
repurposes the struct to hold the new buffer for the new leftmost page
of the split. The temporary copy of the old page is only needed while
the function moves all the tuples from the old buffer into the new
buffers, after that it can be discarded. The temporary copy of the
struct is kept in the stack. However, the temporary copy can find its
way into the list of "loaded buffers". After the function exits, and
it's time to unload all the currently loaded buffers, you get a segfault
because the pointer now points to garbage. I think the reason this
doesn't always crash is that the copy in the stack usually still happens
to be valid enough that gistUnloadNodeBuffer() just skips it.

I'll commit the attached fix for that, marking the temporary copy
explicitly, so that we can avoid leaking it into any long-lived data
structures.

After fixing that, however, I'm now getting another error, much later in
the build process:

ERROR: failed to re-find parent for block 123002
STATEMENT: create index i_gisttest on gisttest using gist (t collate
"C") WITH (fillfactor=10);

I'll continue debugging that, but it seems to be another, unrelated, bug.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

fix-relocate-segfault-1.patchtext/x-diff; name=fix-relocate-segfault-1.patchDownload
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 34a12bc..a40b838 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -148,8 +148,10 @@ gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 		int			level;
 		MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
 
-		nodeBuffer->pageBuffer = NULL;
+		/* nodeBuffer->nodeBlocknum is the hash key and was filled in already */
 		nodeBuffer->blocksCount = 0;
+		nodeBuffer->pageBlocknum = InvalidBlockNumber;
+		nodeBuffer->pageBuffer = NULL;
 		nodeBuffer->queuedForEmptying = false;
 
 		/*
@@ -244,11 +246,15 @@ gistAllocateNewPageBuffer(GISTBuildBuffers *gfbb)
 }
 
 /*
- * Add specified block number into loadedBuffers array.
+ * Add specified buffer into loadedBuffers array.
  */
 static void
 gistAddLoadedBuffer(GISTBuildBuffers *gfbb, GISTNodeBuffer *nodeBuffer)
 {
+	/* Never add a temporary buffer to the array */
+	if (nodeBuffer->isTemp)
+		return;
+
 	/* Enlarge the array if needed */
 	if (gfbb->loadedBuffersCount >= gfbb->loadedBuffersLen)
 	{
@@ -591,7 +597,7 @@ gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 				i;
 	GISTENTRY	entry[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	GISTNodeBuffer nodebuf;
+	GISTNodeBuffer oldBuf;
 	ListCell   *lc;
 
 	/* If the splitted page doesn't have buffers, we have nothing to do. */
@@ -619,16 +625,14 @@ gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 	 * read the tuples straight from the heap instead of the root buffer.
 	 */
 	Assert(blocknum != GIST_ROOT_BLKNO);
-	memcpy(&nodebuf, nodeBuffer, sizeof(GISTNodeBuffer));
+	memcpy(&oldBuf, nodeBuffer, sizeof(GISTNodeBuffer));
+	oldBuf.isTemp = true;
 
 	/* Reset the old buffer, used for the new left page from now on */
 	nodeBuffer->blocksCount = 0;
 	nodeBuffer->pageBuffer = NULL;
 	nodeBuffer->pageBlocknum = InvalidBlockNumber;
 
-	/* Reassign pointer to the saved copy. */
-	nodeBuffer = &nodebuf;
-
 	/*
 	 * Allocate memory for information about relocation buffers.
 	 */
@@ -675,7 +679,7 @@ gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 	 * Loop through all index tuples on the buffer on the splitted page,
 	 * moving them to buffers on the new pages.
 	 */
-	while (gistPopItupFromNodeBuffer(gfbb, nodeBuffer, &itup))
+	while (gistPopItupFromNodeBuffer(gfbb, &oldBuf, &itup))
 	{
 		float		sum_grow,
 					which_grow[INDEX_MAX_KEYS];
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 0d6b625..481ef5f 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -326,6 +326,9 @@ typedef struct
 	/* is this buffer queued for emptying? */
 	bool		queuedForEmptying;
 
+	/* is this a temporary copy, not in the hash table? */
+	bool		isTemp;
+
 	struct GISTBufferingInsertStack *path;
 } GISTNodeBuffer;
 
#2Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Bug in new buffering GiST build code

On Fri, May 18, 2012 at 8:27 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

After fixing that, however, I'm now getting another error, much later in
the build process:

ERROR: failed to re-find parent for block 123002
STATEMENT: create index i_gisttest on gisttest using gist (t collate "C")
WITH (fillfactor=10);

I'll continue debugging that, but it seems to be another, unrelated, bug.

Thanks for debugging and fixing that. I'm going to take a look on the
remaining bug.

------
With best regards,
Alexander Korotkov.

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#2)
Re: Bug in new buffering GiST build code

On 18.05.2012 20:34, Alexander Korotkov wrote:

On Fri, May 18, 2012 at 8:27 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

After fixing that, however, I'm now getting another error, much later in
the build process:

ERROR: failed to re-find parent for block 123002
STATEMENT: create index i_gisttest on gisttest using gist (t collate "C")
WITH (fillfactor=10);

I'll continue debugging that, but it seems to be another, unrelated, bug.

Thanks for debugging and fixing that. I'm going to take a look on the
remaining bug.

After staring at graphs built from gist trees for the whole day, I think
I finally understand what's wrong:

There's a thinko in the way we maintain the parent paths during
insertions. It boils down to the fact that in a GiST index, the
left-to-right ordering as determined by the right-links on the upper
level does not necessarily match the left-to-right ordering at a lower
level. I'm afraid we've inadvertently made that assumption in the code.

This can happen:

1. Let's imagine that we have a tree that looks like this:

root
|
...
|
A (internal node at upper level)
|
|
B
|
|
C (internal node at a lower level)
|
...

2. While we descend down the tree to insert a tuple, we memorize the
path A..B..C. This is stored in the node buffer associated with node C.

3. More tuples are inserted to another subtree below B (not shown),
until node B needs to be split. This produces tree:

A
|\
| \
B->B2
|
|
C

We still have the path A..B..C memorized in C's node buffer. The
downlink for C is now actually in B2, but that's ok, because we have the
code to follow the right links if we can't find the downlink for a node
in the memorized parent.

4. More tuples are added to another subtree of A, until A has to be
split. Picksplit decides to keep the downlink for B2 on the original
page, and moves the downlink for B on the new page, A2:

A->A2
\ /
X
/ \
B->B2
|
|
C

Remember that we still have the path A..B..C memorized in C's node buffer.

5. More tuples are buffered, and we traverse down the tree along the
path A2->B->... When we look up the node buffer for page B, we update
the path stored there. It's now A2..B. This fragment of the path is
shared by the path in C's node buffer.

6. At this point, the path memorized in C's node buffer is A2..B..C.
This is where things go wrong. While it's true that A2 is the parent of
B, and it's true that the parent of C can be found by following the
rightlink from B, A2 is *not* a grandparent of C.

7. More tuples are added below C, and C has to be split. To insert the
downlink for the new sibling, we re-find the parent for C. The memorized
path is A2..B..C. We begin by searching for the downlink for C in page
B. It's not there, so we move right, and find it in B2. The path we're
working with is now A2..B2..C. When we insert the new downlink into B2,
it also fills up and has to be split, so recurse up and have to refind
the parent of B2. We begin looking in the memorized parent, A2. The
downlink is not there, so we move right. But the downlink for B2 is to
the left from A2, so we never find it. We walk right until we fall off
the cliff, and you get the "failed to re-find parent" error.

I think the minimal fix is that after moving right, look up the node
buffer of the page we moved onto, and use the path memorized for that if
we have to recurse further up the tree. So in the above example, at step
7 after we've moved right to node B2, we should look up the memorized
path for B2 in B2's node buffer. That would give us the correct path,
A..B2..C.

The management of the path stacks is a bit complicated, anyway. I'll
think about this some more tomorrow, maybe we can make it simpler,
knowing that we have to do those extra lookups.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#2)
Re: Bug in new buffering GiST build code

On 18.05.2012 20:34, Alexander Korotkov wrote:

On Fri, May 18, 2012 at 8:27 PM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

After fixing that, however, I'm now getting another error, much later in
the build process:

ERROR: failed to re-find parent for block 123002
STATEMENT: create index i_gisttest on gisttest using gist (t collate "C")
WITH (fillfactor=10);

I'll continue debugging that, but it seems to be another, unrelated, bug.

Thanks for debugging and fixing that. I'm going to take a look on the
remaining bug.

After staring at graphs built from gist trees for the whole day, I think
I finally understand what's wrong:

There's a thinko in the way we maintain the parent paths during
insertions. It boils down to the fact that in a GiST index, the
left-to-right ordering as determined by the right-links on the upper
level does not necessarily match the left-to-right ordering at a lower
level. I'm afraid we've inadvertently made that assumption in the code.

This can happen:

1. Let's imagine that we have a tree that looks like this:

root
|
...
|
A (internal node at upper level)
|
|
B
|
|
C (internal node at a lower level)
|
...

2. While we descend down the tree to insert a tuple, we memorize the
path A..B..C. This is stored in the node buffer associated with node C.

3. More tuples are inserted to another subtree below B (not shown),
until node B needs to be split. This produces tree:

A
|\
| \
B->B2
|
|
C

We still have the path A..B..C memorized in C's node buffer. The
downlink for C is now actually in B2, but that's ok, because we have the
code to follow the right links if we can't find the downlink for a node
in the memorized parent.

4. More tuples are added to another subtree of A, until A has to be
split. Picksplit decides to keep the downlink for B2 on the original
page, and moves the downlink for B on the new page, A2:

A->A2
\ /
X
/ \
B->B2
|
|
C

Remember that we still have the path A..B..C memorized in C's node buffer.

5. More tuples are buffered, and we traverse down the tree along the
path A2->B->... When we look up the node buffer for page B, we update
the path stored there. It's now A2..B. This fragment of the path is
shared by the path in C's node buffer.

6. At this point, the path memorized in C's node buffer is A2..B..C.
This is where things go wrong. While it's true that A2 is the parent of
B, and it's true that the parent of C can be found by following the
rightlink from B, A2 is *not* a grandparent of C.

7. More tuples are added below C, and C has to be split. To insert the
downlink for the new sibling, we re-find the parent for C. The memorized
path is A2..B..C. We begin by searching for the downlink for C in page
B. It's not there, so we move right, and find it in B2. The path we're
working with is now A2..B2..C. When we insert the new downlink into B2,
it also fills up and has to be split, so recurse up and have to refind
the parent of B2. We begin looking in the memorized parent, A2. The
downlink is not there, so we move right. But the downlink for B2 is to
the left from A2, so we never find it. We walk right until we fall off
the cliff, and you get the "failed to re-find parent" error.

I think the minimal fix is that after moving right, look up the node
buffer of the page we moved onto, and use the path memorized for that if
we have to recurse further up the tree. So in the above example, at step
7 after we've moved right to node B2, we should look up the memorized
path for B2 in B2's node buffer. That would give us the correct path,
A..B2..C.

The management of the path stacks is a bit complicated, anyway. I'll
think about this some more tomorrow, maybe we can make it simpler,
knowing that we have to do those extra lookups.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: Bug in new buffering GiST build code

Hi!

On Tue, May 22, 2012 at 12:56 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

After staring at graphs built from gist trees for the whole day, I think I
finally understand what's wrong:

There's a thinko in the way we maintain the parent paths during
insertions. It boils down to the fact that in a GiST index, the
left-to-right ordering as determined by the right-links on the upper level
does not necessarily match the left-to-right ordering at a lower level. I'm
afraid we've inadvertently made that assumption in the code.

This can happen:

1. Let's imagine that we have a tree that looks like this:

root
|
...
|
A (internal node at upper level)
|
|
B
|
|
C (internal node at a lower level)
|
...

2. While we descend down the tree to insert a tuple, we memorize the path
A..B..C. This is stored in the node buffer associated with node C.

3. More tuples are inserted to another subtree below B (not shown), until
node B needs to be split. This produces tree:

A
|\
| \
B->B2
|
|
C

We still have the path A..B..C memorized in C's node buffer. The downlink
for C is now actually in B2, but that's ok, because we have the code to
follow the right links if we can't find the downlink for a node in the
memorized parent.

4. More tuples are added to another subtree of A, until A has to be split.
Picksplit decides to keep the downlink for B2 on the original page, and
moves the downlink for B on the new page, A2:

A->A2
\ /
X
/ \
B->B2
|
|
C

Remember that we still have the path A..B..C memorized in C's node buffer.

5. More tuples are buffered, and we traverse down the tree along the path
A2->B->... When we look up the node buffer for page B, we update the path
stored there. It's now A2..B. This fragment of the path is shared by the
path in C's node buffer.

6. At this point, the path memorized in C's node buffer is A2..B..C. This
is where things go wrong. While it's true that A2 is the parent of B, and
it's true that the parent of C can be found by following the rightlink from
B, A2 is *not* a grandparent of C.

7. More tuples are added below C, and C has to be split. To insert the
downlink for the new sibling, we re-find the parent for C. The memorized
path is A2..B..C. We begin by searching for the downlink for C in page B.
It's not there, so we move right, and find it in B2. The path we're working
with is now A2..B2..C. When we insert the new downlink into B2, it also
fills up and has to be split, so recurse up and have to refind the parent
of B2. We begin looking in the memorized parent, A2. The downlink is not
there, so we move right. But the downlink for B2 is to the left from A2, so
we never find it. We walk right until we fall off the cliff, and you get
the "failed to re-find parent" error.

I think the minimal fix is that after moving right, look up the node
buffer of the page we moved onto, and use the path memorized for that if we
have to recurse further up the tree. So in the above example, at step 7
after we've moved right to node B2, we should look up the memorized path
for B2 in B2's node buffer. That would give us the correct path, A..B2..C.

The management of the path stacks is a bit complicated, anyway. I'll think
about this some more tomorrow, maybe we can make it simpler, knowing that
we have to do those extra lookups.

WOW! You did enormous work on exploring that!
I just arrived from PGCon and start looking at it when find you've already
done comprehensive research of this problem.
On the step 5 if we've NSN in GISTBufferingInsertStack structure, we could
detect situation of changing parent of splitted page. Using this we could
save copy of GISTBufferingInsertStack for B2 with original parent A,
because we know split of B to occur after creating GISTBufferingInsertStack
but before split of A. The question is how to find this copy from C, hash?

------
With best regards,
Alexander Korotkov.

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#5)
1 attachment(s)
Re: Bug in new buffering GiST build code

On 22.05.2012 01:09, Alexander Korotkov wrote:

Hi!

On Tue, May 22, 2012 at 12:56 AM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

The management of the path stacks is a bit complicated, anyway. I'll think
about this some more tomorrow, maybe we can make it simpler, knowing that
we have to do those extra lookups.

WOW! You did enormous work on exploring that!
I just arrived from PGCon and start looking at it when find you've already
done comprehensive research of this problem.
On the step 5 if we've NSN in GISTBufferingInsertStack structure, we could
detect situation of changing parent of splitted page. Using this we could
save copy of GISTBufferingInsertStack for B2 with original parent A,
because we know split of B to occur after creating GISTBufferingInsertStack
but before split of A. The question is how to find this copy from C, hash?

I tested a patch that adds the extra getNodeBuffer() call after
refinding the parent, as discussed. However, I'm still getting a "failed
to-refind parent" error later in the build, so I think we're still
missing some corner case.

I think we should rewrite the way we track the parents completely.
Rather than keep a path stack attached to every node buffer, let's just
maintain a second hash table that contains the parent of every internal
node. Whenever a downlink is moved to another page, update the hash
table with the new information. That way we always have up-to-date
information about the parent of every internal node.

That's much easier to understand than the path stack structures we have
now. I think the overall memory consumption will be about the same too.
Although we need the extra hash table with one entry for every internal
node, we get rid of the path stack structs, which are also one per every
internal node at the moment.

I believe it is faster too. I added some instrumentation to the existing
gist code (with the additional getNodeBuffer() call added to fix this
bug), to measure the time spent moving right, when refinding the parent
of a page. I added gettimeofday() calls before and after moving right,
and summed the total. In my test case, the final index size was about
19GB, and the index build took 3545 seconds (59 minutes). Of that time,
580 seconds (~ 10 minutes) was spent moving right to refind parents.
That's a lot. I also printed a line whenever a refind operation had to
move right 20 pages or more. That happened 2482 times during the build,
in the worst case we moved right over 40000 pages.

Attached is a patch to replace the path stacks with a hash table. With
this patch, the index build time in my test case dropped from 59 minutes
to about 55 minutes. I don'ẗ know how representative or repeatable this
test case is, but this definitely seems very worthwhile, not only
because it fixes the bug and makes the code simpler, but also on
performance grounds.

Alexander, do you still have the test environments and data lying around
that you used for GiST buffering testing last summer? Could you rerun
some of those tests with this patch?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

use-hash-table-for-parents-1.patchtext/x-diff; name=use-hash-table-for-parents-1.patchDownload
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 988896d..9d9a031 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -55,16 +55,24 @@ typedef struct
 {
 	Relation	indexrel;
 	GISTSTATE  *giststate;
-	GISTBuildBuffers *gfbb;
 
 	int64		indtuples;		/* number of tuples indexed */
 	int64		indtuplesSize;	/* total size of all indexed tuples */
 
 	Size		freespace;		/* amount of free space to leave on pages */
 
+	/*
+	 * Extra data structures used during a buffering build. GISTBuildBuffers
+	 * contains information related to managing the build buffers. parentMap
+	 * is a lookup table of the parent of each internal page.
+	 */
+	GISTBuildBuffers *gfbb;
+	HTAB	   *parentMap;
+
 	GistBufferingMode bufferingMode;
 } GISTBuildState;
 
+/* prototypes for private functions */
 static void gistInitBuffering(GISTBuildState *buildstate);
 static int	calculatePagesPerBuffer(GISTBuildState *buildstate, int levelStep);
 static void gistBuildCallback(Relation index,
@@ -76,18 +84,23 @@ static void gistBuildCallback(Relation index,
 static void gistBufferingBuildInsert(GISTBuildState *buildstate,
 						 IndexTuple itup);
 static bool gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
-				GISTBufferingInsertStack *startparent);
+				BlockNumber startblkno, int startlevel);
 static void gistbufferinginserttuples(GISTBuildState *buildstate,
 						  Buffer buffer,
 						  IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
-						  GISTBufferingInsertStack *path);
-static void gistBufferingFindCorrectParent(GISTBuildState *buildstate,
-							   GISTBufferingInsertStack *child);
+						  int level, BlockNumber parent);
+static Buffer gistBufferingFindCorrectParent(GISTBuildState *buildstate,
+							   BlockNumber childblkno,
+							   OffsetNumber *downlinkoffnum,
+							   BlockNumber *parentblkno, int level);
 static void gistProcessEmptyingQueue(GISTBuildState *buildstate);
 static void gistEmptyAllBuffers(GISTBuildState *buildstate);
-static void gistFreeUnreferencedPath(GISTBufferingInsertStack *path);
 static int	gistGetMaxLevel(Relation index);
 
+static void gistInitParentMap(GISTBuildState *buildstate);
+static void gistMemorizeParent(GISTBuildState *buildstate, BlockNumber child, BlockNumber parent);
+static void gistMemorizeAllDownlinks(GISTBuildState *buildstate, Buffer parent);
+static BlockNumber gistGetParent(GISTBuildState *buildstate, BlockNumber child);
 
 /*
  * Main entry point to GiST index build. Initially calls insert over and over,
@@ -392,6 +405,8 @@ gistInitBuffering(GISTBuildState *buildstate)
 	buildstate->gfbb = gistInitBuildBuffers(pagesPerBuffer, levelStep,
 											gistGetMaxLevel(index));
 
+	gistInitParentMap(buildstate);
+
 	buildstate->bufferingMode = GIST_BUFFERING_ACTIVE;
 
 	elog(DEBUG1, "switched to buffered GiST build; level step = %d, pagesPerBuffer = %d",
@@ -514,7 +529,7 @@ static void
 gistBufferingBuildInsert(GISTBuildState *buildstate, IndexTuple itup)
 {
 	/* Insert the tuple to buffers. */
-	gistProcessItup(buildstate, itup, NULL);
+	gistProcessItup(buildstate, itup, 0, buildstate->gfbb->rootlevel);
 
 	/* If we filled up (half of a) buffer, process buffer emptying. */
 	gistProcessEmptyingQueue(buildstate);
@@ -528,30 +543,28 @@ gistBufferingBuildInsert(GISTBuildState *buildstate, IndexTuple itup)
  */
 static bool
 gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
-				GISTBufferingInsertStack *startparent)
+				BlockNumber startblkno, int startlevel)
 {
 	GISTSTATE  *giststate = buildstate->giststate;
 	GISTBuildBuffers *gfbb = buildstate->gfbb;
 	Relation	indexrel = buildstate->indexrel;
-	GISTBufferingInsertStack *path;
 	BlockNumber childblkno;
 	Buffer		buffer;
 	bool		result = false;
+	BlockNumber	blkno;
+	int			level;
+	OffsetNumber downlinkoffnum;
+	BlockNumber parentblkno = InvalidBlockNumber;
 
-	/*
-	 * NULL passed in startparent means that we start index tuple processing
-	 * from the root.
-	 */
-	if (!startparent)
-		path = gfbb->rootitem;
-	else
-		path = startparent;
+	CHECK_FOR_INTERRUPTS();
 
 	/*
 	 * Loop until we reach a leaf page (level == 0) or a level with buffers
 	 * (not including the level we start at, because we would otherwise make
 	 * no progress).
 	 */
+	blkno = startblkno;
+	level = startlevel;
 	for (;;)
 	{
 		ItemId		iid;
@@ -559,21 +572,21 @@ gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
 					newtup;
 		Page		page;
 		OffsetNumber childoffnum;
-		GISTBufferingInsertStack *parent;
 
 		/* Have we reached a level with buffers? */
-		if (LEVEL_HAS_BUFFERS(path->level, gfbb) && path != startparent)
+		if (LEVEL_HAS_BUFFERS(level, gfbb) && level != startlevel)
 			break;
 
 		/* Have we reached a leaf page? */
-		if (path->level == 0)
+		if (level == 0)
 			break;
 
 		/*
 		 * Nope. Descend down to the next level then. Choose a child to
 		 * descend down to.
 		 */
-		buffer = ReadBuffer(indexrel, path->blkno);
+
+		buffer = ReadBuffer(indexrel, blkno);
 		LockBuffer(buffer, GIST_EXCLUSIVE);
 
 		page = (Page) BufferGetPage(buffer);
@@ -582,32 +595,32 @@ gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
 		idxtuple = (IndexTuple) PageGetItem(page, iid);
 		childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 
+		if (level > 1)
+			gistMemorizeParent(buildstate, childblkno, blkno);
+
 		/*
 		 * Check that the key representing the target child node is consistent
 		 * with the key we're inserting. Update it if it's not.
 		 */
 		newtup = gistgetadjusted(indexrel, idxtuple, itup, giststate);
 		if (newtup)
+		{
 			gistbufferinginserttuples(buildstate, buffer, &newtup, 1,
-									  childoffnum, path);
-		UnlockReleaseBuffer(buffer);
+									  childoffnum, level, InvalidBlockNumber);
+			/* gistbufferinginserttuples() released the buffer */
+		}
+		else
+			UnlockReleaseBuffer(buffer);
 
-		/* Create new path item representing current page */
-		parent = path;
-		path = (GISTBufferingInsertStack *) MemoryContextAlloc(gfbb->context,
-										   sizeof(GISTBufferingInsertStack));
-		path->parent = parent;
-		path->level = parent->level - 1;
-		path->blkno = childblkno;
-		path->downlinkoffnum = childoffnum;
-		path->refCount = 0;		/* it's unreferenced for now */
-
-		/* Adjust reference count of parent */
-		if (parent)
-			parent->refCount++;
+		/* Descend to the child */
+		parentblkno = blkno;
+		blkno = childblkno;
+		downlinkoffnum = childoffnum;
+		Assert(level > 0);
+		level--;
 	}
 
-	if (LEVEL_HAS_BUFFERS(path->level, gfbb))
+	if (LEVEL_HAS_BUFFERS(level, gfbb))
 	{
 		/*
 		 * We've reached level with buffers. Place the index tuple to the
@@ -616,8 +629,7 @@ gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
 		GISTNodeBuffer *childNodeBuffer;
 
 		/* Find the buffer or create a new one */
-		childNodeBuffer = gistGetNodeBuffer(gfbb, giststate, path->blkno,
-										 path->downlinkoffnum, path->parent);
+		childNodeBuffer = gistGetNodeBuffer(gfbb, giststate, blkno, level);
 
 		/* Add index tuple to it */
 		gistPushItupToNodeBuffer(gfbb, childNodeBuffer, itup);
@@ -630,19 +642,16 @@ gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
 		/*
 		 * We've reached a leaf page. Place the tuple here.
 		 */
-		buffer = ReadBuffer(indexrel, path->blkno);
+		Assert(level == 0);
+		buffer = ReadBuffer(indexrel, blkno);
 		LockBuffer(buffer, GIST_EXCLUSIVE);
 		gistbufferinginserttuples(buildstate, buffer, &itup, 1,
-								  InvalidOffsetNumber, path);
-		UnlockReleaseBuffer(buffer);
+								  InvalidOffsetNumber, level, parentblkno);
+		/* TODO: we should pass the downlink offnum, to avoid having to scan
+		 * the whole parent page to find the downlink */
+		/* gistbufferinginserttuples() released the buffer */
 	}
 
-	/*
-	 * Free unreferenced path items, if any. Path item may be referenced by
-	 * node buffer.
-	 */
-	gistFreeUnreferencedPath(path);
-
 	return result;
 }
 
@@ -650,11 +659,14 @@ gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
  * Insert tuples to a given page.
  *
  * This is analogous with gistinserttuples() in the regular insertion code.
+ *
+ * Caller should hold a lock on 'buffer' on entry. This function will unlock
+ * and unpin it.
  */
 static void
 gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer,
 						  IndexTuple *itup, int ntup, OffsetNumber oldoffnum,
-						  GISTBufferingInsertStack *path)
+						  int level, BlockNumber parentblkno)
 {
 	GISTBuildBuffers *gfbb = buildstate->gfbb;
 	List	   *splitinfo;
@@ -677,33 +689,40 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer,
 	 */
 	if (is_split && BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO)
 	{
-		GISTBufferingInsertStack *oldroot = gfbb->rootitem;
 		Page		page = BufferGetPage(buffer);
-		ItemId		iid;
-		IndexTuple	idxtuple;
-		BlockNumber leftmostchild;
+		OffsetNumber off;
+		OffsetNumber maxoff;
+
+		Assert(level == gfbb->rootlevel);
+		gfbb->rootlevel++;
 
-		gfbb->rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
-							gfbb->context, sizeof(GISTBufferingInsertStack));
-		gfbb->rootitem->parent = NULL;
-		gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
-		gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
-		gfbb->rootitem->level = oldroot->level + 1;
-		gfbb->rootitem->refCount = 1;
+		elog(DEBUG1, "splitting GiST root page, now %d levels deep", gfbb->rootlevel);
 
 		/*
 		 * All the downlinks on the old root page are now on one of the child
-		 * pages. Change the block number of the old root entry in the stack
-		 * to point to the leftmost child. The other child pages will be
-		 * accessible from there by walking right.
+		 * pages. Visit all the new child pages to memorize the parents of
+		 * the grandchildren.
 		 */
-		iid = PageGetItemId(page, FirstOffsetNumber);
-		idxtuple = (IndexTuple) PageGetItem(page, iid);
-		leftmostchild = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+		if (gfbb->rootlevel > 1)
+		{
+			maxoff = PageGetMaxOffsetNumber(page);
+			for (off = FirstOffsetNumber; off <= maxoff; off++)
+			{
+				ItemId iid = PageGetItemId(page, FirstOffsetNumber);
+				IndexTuple idxtuple = (IndexTuple) PageGetItem(page, iid);
+				BlockNumber childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+				Buffer childbuf = ReadBuffer(buildstate->indexrel, childblkno);
+				LockBuffer(childbuf, GIST_SHARE);
+				gistMemorizeAllDownlinks(buildstate, childbuf);
+				UnlockReleaseBuffer(childbuf);
 
-		oldroot->parent = gfbb->rootitem;
-		oldroot->blkno = leftmostchild;
-		oldroot->downlinkoffnum = InvalidOffsetNumber;
+				/*
+				 * Also remember that the parent of the new child page is
+				 * the root block.
+				 */
+				gistMemorizeParent(buildstate, childblkno, GIST_ROOT_BLKNO);
+			}
+		}
 	}
 
 	if (splitinfo)
@@ -711,16 +730,22 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer,
 		/*
 		 * Insert the downlinks to the parent. This is analogous with
 		 * gistfinishsplit() in the regular insertion code, but the locking is
-		 * simpler, and we have to maintain the buffers.
+		 * simpler, and we have to maintain the buffers on internal nodes and
+		 * the parent map.
 		 */
 		IndexTuple *downlinks;
 		int			ndownlinks,
 					i;
 		Buffer		parentBuffer;
 		ListCell   *lc;
+		OffsetNumber downlinkoffnum = InvalidOffsetNumber;
 
 		/* Parent may have changed since we memorized this path. */
-		gistBufferingFindCorrectParent(buildstate, path);
+		parentBuffer = gistBufferingFindCorrectParent(buildstate,
+													  BufferGetBlockNumber(buffer),
+													  &downlinkoffnum,
+													  &parentblkno,
+													  level);
 
 		/*
 		 * If there's a buffer associated with this page, that needs to be
@@ -732,7 +757,8 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer,
 		gistRelocateBuildBuffersOnSplit(gfbb,
 										buildstate->giststate,
 										buildstate->indexrel,
-										path, buffer, splitinfo);
+										level,
+										buffer, splitinfo);
 
 		/* Create an array of all the downlink tuples */
 		ndownlinks = list_length(splitinfo);
@@ -743,124 +769,115 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer,
 			GISTPageSplitInfo *splitinfo = lfirst(lc);
 
 			/*
+			 * Remember the parent of each new child page in our parent map.
+			 * This assumes that the downlinks fit on the parent page. If the
+			 * parent page is split, too, when we recurse up to insert the
+			 * downlinks, the recursive gistbufferinginserttuples() call
+			 * will update the map again.
+			 */
+			if (level > 0)
+				gistMemorizeParent(buildstate,
+								   BufferGetBlockNumber(splitinfo->buf),
+								   BufferGetBlockNumber(parentBuffer));
+
+			/*
+			 * Also update the parent map for all the downlinks that got moved
+			 * to a different page. (actually this also loops through the
+			 * downlinks that stayed on the original page, but it does no
+			 * harm).
+			 */
+			if (level > 1)
+				gistMemorizeAllDownlinks(buildstate, splitinfo->buf);
+
+			/*
 			 * Since there's no concurrent access, we can release the lower
-			 * level buffers immediately. Don't release the buffer for the
-			 * original page, though, because the caller will release that.
+			 * level buffers immediately. This includes the original page.
 			 */
-			if (splitinfo->buf != buffer)
-				UnlockReleaseBuffer(splitinfo->buf);
+			UnlockReleaseBuffer(splitinfo->buf);
 			downlinks[i++] = splitinfo->downlink;
 		}
 
 		/* Insert them into parent. */
-		parentBuffer = ReadBuffer(buildstate->indexrel, path->parent->blkno);
-		LockBuffer(parentBuffer, GIST_EXCLUSIVE);
 		gistbufferinginserttuples(buildstate, parentBuffer,
 								  downlinks, ndownlinks,
-								  path->downlinkoffnum, path->parent);
-		UnlockReleaseBuffer(parentBuffer);
+								  downlinkoffnum, level + 1, InvalidBlockNumber);
 
 		list_free_deep(splitinfo);		/* we don't need this anymore */
 	}
+	else
+		UnlockReleaseBuffer(buffer);
 }
 
 /*
  * Find correct parent by following rightlinks in buffering index build. This
  * method of parent searching is possible because no concurrent activity is
  * possible while index builds.
+ *
+ * Returns a buffer containing the parent page, exclusively-locked.
+ * *downlinkoffnum and *parentblkno are set to point to the downlink.
  */
-static void
+static Buffer
 gistBufferingFindCorrectParent(GISTBuildState *buildstate,
-							   GISTBufferingInsertStack *child)
+							   BlockNumber childblkno,
+							   OffsetNumber *downlinkoffnum,
+							   BlockNumber *parentblkno,
+							   int level)
 {
-	GISTBuildBuffers *gfbb = buildstate->gfbb;
-	Relation	indexrel = buildstate->indexrel;
-	GISTBufferingInsertStack *parent = child->parent;
-	OffsetNumber i,
-				maxoff;
-	ItemId		iid;
-	IndexTuple	idxtuple;
+	BlockNumber parent;
 	Buffer		buffer;
 	Page		page;
-	bool		copied = false;
+	OffsetNumber maxoff;
+	OffsetNumber off;
+
+	if (level > 0)
+		parent = gistGetParent(buildstate, childblkno);
+	else
+	{
+		if (*parentblkno == InvalidBlockNumber)
+			elog(ERROR, "no parent buffer provided of child %d", childblkno);
+		parent = *parentblkno;
+	}
 
-	buffer = ReadBuffer(indexrel, parent->blkno);
+	buffer = ReadBuffer(buildstate->indexrel, parent);
 	page = BufferGetPage(buffer);
 	LockBuffer(buffer, GIST_EXCLUSIVE);
-	gistcheckpage(indexrel, buffer);
+	gistcheckpage(buildstate->indexrel, buffer);
+	maxoff = PageGetMaxOffsetNumber(page);
 
 	/* Check if it was not moved */
-	if (child->downlinkoffnum != InvalidOffsetNumber &&
-		child->downlinkoffnum <= PageGetMaxOffsetNumber(page))
+	if (parent == *parentblkno && *parentblkno != InvalidBlockNumber &&
+		*downlinkoffnum != InvalidOffsetNumber && *downlinkoffnum <= maxoff)
 	{
-		iid = PageGetItemId(page, child->downlinkoffnum);
-		idxtuple = (IndexTuple) PageGetItem(page, iid);
-		if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
+		ItemId iid = PageGetItemId(page, *downlinkoffnum);
+		IndexTuple idxtuple = (IndexTuple) PageGetItem(page, iid);
+		if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == childblkno)
 		{
 			/* Still there */
-			UnlockReleaseBuffer(buffer);
-			return;
+			return buffer;
 		}
 	}
 
-	/* parent has changed, look child in right links until found */
-	while (true)
+	/*
+	 * Downlink was not at the offset where it used to be. Scan the page
+	 * to find it. During normal gist insertions, it might've moved to another
+	 * page, to the right, but during a buffering build, we keep track of
+	 * the parent of each page in the lookup table so we should always know
+	 * what page it's on.
+	 */
+	for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
 	{
-		/* Search for relevant downlink in the current page */
-		maxoff = PageGetMaxOffsetNumber(page);
-		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		ItemId iid = PageGetItemId(page, off);
+		IndexTuple idxtuple = (IndexTuple) PageGetItem(page, iid);
+		if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == childblkno)
 		{
-			iid = PageGetItemId(page, i);
-			idxtuple = (IndexTuple) PageGetItem(page, iid);
-			if (ItemPointerGetBlockNumber(&(idxtuple->t_tid)) == child->blkno)
-			{
-				/* yes!!, found */
-				child->downlinkoffnum = i;
-				UnlockReleaseBuffer(buffer);
-				return;
-			}
+			/* yes!!, found it */
+			*downlinkoffnum = off;
+			return buffer;
 		}
-
-		/*
-		 * We should copy parent path item because some other path items can
-		 * refer to it.
-		 */
-		if (!copied)
-		{
-			parent = (GISTBufferingInsertStack *) MemoryContextAlloc(gfbb->context,
-										   sizeof(GISTBufferingInsertStack));
-			memcpy(parent, child->parent, sizeof(GISTBufferingInsertStack));
-			if (parent->parent)
-				parent->parent->refCount++;
-			gistDecreasePathRefcount(child->parent);
-			child->parent = parent;
-			parent->refCount = 1;
-			copied = true;
-		}
-
-		/*
-		 * Not found in current page. Move towards rightlink.
-		 */
-		parent->blkno = GistPageGetOpaque(page)->rightlink;
-		UnlockReleaseBuffer(buffer);
-
-		if (parent->blkno == InvalidBlockNumber)
-		{
-			/*
-			 * End of chain and still didn't find parent. Should not happen
-			 * during index build.
-			 */
-			break;
-		}
-
-		/* Get the next page */
-		buffer = ReadBuffer(indexrel, parent->blkno);
-		page = BufferGetPage(buffer);
-		LockBuffer(buffer, GIST_EXCLUSIVE);
-		gistcheckpage(indexrel, buffer);
 	}
 
-	elog(ERROR, "failed to re-find parent for block %u", child->blkno);
+	elog(ERROR, "failed to re-find parent for block %u", childblkno);
+	return InvalidBuffer; /* keep compiler quiet */
 }
 
 /*
@@ -919,7 +936,7 @@ gistProcessEmptyingQueue(GISTBuildState *buildstate)
 			 * threshold, but we might as well keep flushing tuples from it
 			 * until we fill a lower-level buffer.
 			 */
-			if (gistProcessItup(buildstate, itup, emptyingNodeBuffer->path))
+			if (gistProcessItup(buildstate, itup, emptyingNodeBuffer->nodeBlocknum, emptyingNodeBuffer->level))
 			{
 				/*
 				 * A lower level buffer filled up. Stop emptying this buffer,
@@ -988,46 +1005,12 @@ gistEmptyAllBuffers(GISTBuildState *buildstate)
 				gfbb->buffersOnLevels[i] =
 					list_delete_first(gfbb->buffersOnLevels[i]);
 		}
+		elog(DEBUG2, "emptied all buffers at level %d", i);
 	}
 	MemoryContextSwitchTo(oldCtx);
 }
 
 /*
- * Free unreferenced parts of a path stack.
- */
-static void
-gistFreeUnreferencedPath(GISTBufferingInsertStack *path)
-{
-	while (path->refCount == 0)
-	{
-		/*
-		 * Path part is unreferenced. We can free it and decrease reference
-		 * count of parent. If parent becomes unreferenced too procedure
-		 * should be repeated for it.
-		 */
-		GISTBufferingInsertStack *tmp = path->parent;
-
-		pfree(path);
-		path = tmp;
-		if (path)
-			path->refCount--;
-		else
-			break;
-	}
-}
-
-/*
- * Decrease reference count of a path part, and free any unreferenced parts of
- * the path stack.
- */
-void
-gistDecreasePathRefcount(GISTBufferingInsertStack *path)
-{
-	path->refCount--;
-	gistFreeUnreferencedPath(path);
-}
-
-/*
  * Get the depth of the GiST index.
  */
 static int
@@ -1076,9 +1059,112 @@ gistGetMaxLevel(Relation index)
 
 		/*
 		 * We're going down on the tree. It means that there is yet one more
-		 * level is the tree.
+		 * level in the tree.
 		 */
 		maxLevel++;
 	}
 	return maxLevel;
 }
+
+
+/*
+ * Routines for managing the parent map.
+ *
+ * Whenever a page is split, we need to insert the downlinks into the parent.
+ * We need to somehow find the parent page to do that. In normal insertions,
+ * we keep a stack of nodes visited when we descend the tree. However, in
+ * buffering build, we can start descending the tree from any internal node,
+ * when we empty a buffer by cascading tuples to its children. So we don't
+ * have a full stack up to the root available at that time.
+ *
+ * So instead, we maintain a hash table to track the parent of every internal
+ * page. We don't need to track the parents of leaf nodes, however. Whenever
+ * we insert to a leaf, we've just descended down from its parent, so we know
+ * its immediate parent already. This helps a lot to limit the memory used
+ * by this hash table.
+ *
+ * Whenever an internal node is split, the parent map needs to be updated.
+ * the parent of the new child page needs to be recorded, and also the
+ * entries for all page whose downlinks are moved to a new page at the split
+ * needs to be updated.
+ *
+ * We also update the parent map whenever we descend the tree. That might seem
+ * unnecessary, because we maintain the map whenever a downlink is moved or
+ * created, but it is needed because we switch to buffering mode after
+ * creating a tree with regular index inserts. Any pages created before
+ * switching to buffering mode will not be present in the parent map initially,
+ * but will be added there the first time we visit them.
+ */
+
+typedef struct
+{
+	BlockNumber childblkno; /* hash key */
+	BlockNumber parentblkno;
+	int childlevel;
+} ParentMapEntry;
+
+static void
+gistInitParentMap(GISTBuildState *buildstate)
+{
+	HASHCTL		hashCtl;
+
+	hashCtl.keysize = sizeof(BlockNumber);
+	hashCtl.entrysize = sizeof(ParentMapEntry);
+	hashCtl.hcxt = CurrentMemoryContext;
+	hashCtl.hash = oid_hash;
+	buildstate->parentMap = hash_create("gistbuild parent map",
+										1024,
+										&hashCtl,
+										HASH_ELEM | HASH_CONTEXT
+										| HASH_FUNCTION);
+}
+
+static void
+gistMemorizeParent(GISTBuildState *buildstate, BlockNumber child, BlockNumber parent)
+{
+	ParentMapEntry *entry;
+	bool		found;
+
+	entry = (ParentMapEntry *) hash_search(buildstate->parentMap,
+										  (const void *) &child,
+										  HASH_ENTER,
+										  &found);
+	entry->parentblkno = parent;
+}
+
+static void
+gistMemorizeAllDownlinks(GISTBuildState *buildstate, Buffer parentbuf)
+{
+	OffsetNumber maxoff;
+	OffsetNumber off;
+	BlockNumber parentblkno = BufferGetBlockNumber(parentbuf);
+	Page page = BufferGetPage(parentbuf);
+
+	Assert(!GistPageIsLeaf(page));
+
+	maxoff = PageGetMaxOffsetNumber(page);
+	for (off = FirstOffsetNumber; off <= maxoff; off++)
+	{
+		ItemId iid = PageGetItemId(page, off);
+		IndexTuple idxtuple = (IndexTuple) PageGetItem(page, iid);
+		BlockNumber childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+		gistMemorizeParent(buildstate, childblkno, parentblkno);
+	}
+}
+
+static BlockNumber
+gistGetParent(GISTBuildState *buildstate, BlockNumber child)
+{
+	ParentMapEntry *entry;
+	bool		found;
+
+	/* Find node buffer in hash table */
+	entry = (ParentMapEntry *) hash_search(buildstate->parentMap,
+										  (const void *) &child,
+										  HASH_FIND,
+										  &found);
+	if (!found)
+		elog(ERROR, "could not find parent of block %d in lookup table", child);
+
+	return entry->parentblkno;
+}
diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index a40b838..211ad21 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -107,16 +107,7 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
 												   sizeof(GISTNodeBuffer *));
 	gfbb->loadedBuffersCount = 0;
 
-	/*
-	 * Root path item of the tree. Updated on each root node split.
-	 */
-	gfbb->rootitem = (GISTBufferingInsertStack *) MemoryContextAlloc(
-							gfbb->context, sizeof(GISTBufferingInsertStack));
-	gfbb->rootitem->parent = NULL;
-	gfbb->rootitem->blkno = GIST_ROOT_BLKNO;
-	gfbb->rootitem->downlinkoffnum = InvalidOffsetNumber;
-	gfbb->rootitem->level = maxLevel;
-	gfbb->rootitem->refCount = 1;
+	gfbb->rootlevel = maxLevel;
 
 	return gfbb;
 }
@@ -127,9 +118,7 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
  */
 GISTNodeBuffer *
 gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
-				  BlockNumber nodeBlocknum,
-				  OffsetNumber downlinkoffnum,
-				  GISTBufferingInsertStack *parent)
+				  BlockNumber nodeBlocknum, int level)
 {
 	GISTNodeBuffer *nodeBuffer;
 	bool		found;
@@ -144,8 +133,6 @@ gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 		/*
 		 * Node buffer wasn't found. Initialize the new buffer as empty.
 		 */
-		GISTBufferingInsertStack *path;
-		int			level;
 		MemoryContext oldcxt = MemoryContextSwitchTo(gfbb->context);
 
 		/* nodeBuffer->nodeBlocknum is the hash key and was filled in already */
@@ -153,33 +140,12 @@ gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 		nodeBuffer->pageBlocknum = InvalidBlockNumber;
 		nodeBuffer->pageBuffer = NULL;
 		nodeBuffer->queuedForEmptying = false;
-
-		/*
-		 * Create a path stack for the page.
-		 */
-		if (nodeBlocknum != GIST_ROOT_BLKNO)
-		{
-			path = (GISTBufferingInsertStack *) palloc(
-										   sizeof(GISTBufferingInsertStack));
-			path->parent = parent;
-			path->blkno = nodeBlocknum;
-			path->downlinkoffnum = downlinkoffnum;
-			path->level = parent->level - 1;
-			path->refCount = 0; /* initially unreferenced */
-			parent->refCount++; /* this path references its parent */
-			Assert(path->level > 0);
-		}
-		else
-			path = gfbb->rootitem;
-
-		nodeBuffer->path = path;
-		path->refCount++;
+		nodeBuffer->level = level;
 
 		/*
 		 * Add this buffer to the list of buffers on this level. Enlarge
 		 * buffersOnLevels array if needed.
 		 */
-		level = path->level;
 		if (level >= gfbb->buffersOnLevelsLen)
 		{
 			int			i;
@@ -210,20 +176,6 @@ gistGetNodeBuffer(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 
 		MemoryContextSwitchTo(oldcxt);
 	}
-	else
-	{
-		if (parent != nodeBuffer->path->parent)
-		{
-			/*
-			 * A different parent path item was provided than we've
-			 * remembered. We trust caller to provide more correct parent than
-			 * we have. Previous parent may be outdated by page split.
-			 */
-			gistDecreasePathRefcount(nodeBuffer->path->parent);
-			nodeBuffer->path->parent = parent;
-			parent->refCount++;
-		}
-	}
 
 	return nodeBuffer;
 }
@@ -585,7 +537,7 @@ typedef struct
  */
 void
 gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
-								Relation r, GISTBufferingInsertStack *path,
+								Relation r, int level,
 								Buffer buffer, List *splitinfo)
 {
 	RelocationBufferInfo *relocationBuffersInfos;
@@ -601,7 +553,7 @@ gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 	ListCell   *lc;
 
 	/* If the splitted page doesn't have buffers, we have nothing to do. */
-	if (!LEVEL_HAS_BUFFERS(path->level, gfbb))
+	if (!LEVEL_HAS_BUFFERS(level, gfbb))
 		return;
 
 	/*
@@ -660,14 +612,11 @@ gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
 		/*
 		 * Create a node buffer for the page. The leftmost half is on the same
 		 * block as the old page before split, so for the leftmost half this
-		 * will return the original buffer, which was emptied earlier in this
-		 * function.
+		 * will return the original buffer. The tuples on the original buffer
+		 * were relinked to the temporary buffer, so the original one is now
+		 * empty.
 		 */
-		newNodeBuffer = gistGetNodeBuffer(gfbb,
-										  giststate,
-										  BufferGetBlockNumber(si->buf),
-										  path->downlinkoffnum,
-										  path->parent);
+		newNodeBuffer = gistGetNodeBuffer(gfbb, giststate, BufferGetBlockNumber(si->buf), level);
 
 		relocationBuffersInfos[i].nodeBuffer = newNodeBuffer;
 		relocationBuffersInfos[i].splitinfo = si;
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 481ef5f..5ad9858 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -329,7 +329,7 @@ typedef struct
 	/* is this a temporary copy, not in the hash table? */
 	bool		isTemp;
 
-	struct GISTBufferingInsertStack *path;
+	int			level;			/* 0 == leaf */
 } GISTNodeBuffer;
 
 /*
@@ -338,7 +338,7 @@ typedef struct
  */
 #define LEVEL_HAS_BUFFERS(nlevel, gfbb) \
 	((nlevel) != 0 && (nlevel) % (gfbb)->levelStep == 0 && \
-	 (nlevel) != (gfbb)->rootitem->level)
+	 (nlevel) != (gfbb)->rootlevel)
 
 /* Is specified buffer at least half-filled (should be queued for emptying)? */
 #define BUFFER_HALF_FILLED(nodeBuffer, gfbb) \
@@ -353,26 +353,6 @@ typedef struct
 	((nodeBuffer)->blocksCount > (gfbb)->pagesPerBuffer)
 
 /*
- * Extended GISTInsertStack for buffering GiST index build.
- */
-typedef struct GISTBufferingInsertStack
-{
-	/* current page */
-	BlockNumber blkno;
-
-	/* offset of the downlink in the parent page, that points to this page */
-	OffsetNumber downlinkoffnum;
-
-	/* pointer to parent */
-	struct GISTBufferingInsertStack *parent;
-
-	int			refCount;
-
-	/* level number */
-	int			level;
-} GISTBufferingInsertStack;
-
-/*
  * Data structure with general information about build buffers.
  */
 typedef struct GISTBuildBuffers
@@ -416,8 +396,8 @@ typedef struct GISTBuildBuffers
 	int			loadedBuffersCount;		/* # of entries in loadedBuffers */
 	int			loadedBuffersLen;		/* allocated size of loadedBuffers */
 
-	/* A path item that points to the current root node */
-	GISTBufferingInsertStack *rootitem;
+	/* Level of the current root node (= height of the index tree - 1) */
+	int			rootlevel;
 } GISTBuildBuffers;
 
 /*
@@ -551,15 +531,13 @@ extern void gistSplitByKey(Relation r, Page page, IndexTuple *itup,
 /* gistbuild.c */
 extern Datum gistbuild(PG_FUNCTION_ARGS);
 extern void gistValidateBufferingOption(char *value);
-extern void gistDecreasePathRefcount(GISTBufferingInsertStack *path);
 
 /* gistbuildbuffers.c */
 extern GISTBuildBuffers *gistInitBuildBuffers(int pagesPerBuffer, int levelStep,
 					 int maxLevel);
 extern GISTNodeBuffer *gistGetNodeBuffer(GISTBuildBuffers *gfbb,
 				  GISTSTATE *giststate,
-				  BlockNumber blkno, OffsetNumber downlinkoffnum,
-				  GISTBufferingInsertStack *parent);
+				  BlockNumber blkno, int level);
 extern void gistPushItupToNodeBuffer(GISTBuildBuffers *gfbb,
 						 GISTNodeBuffer *nodeBuffer, IndexTuple item);
 extern bool gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb,
@@ -567,7 +545,7 @@ extern bool gistPopItupFromNodeBuffer(GISTBuildBuffers *gfbb,
 extern void gistFreeBuildBuffers(GISTBuildBuffers *gfbb);
 extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
 								GISTSTATE *giststate, Relation r,
-								GISTBufferingInsertStack *path, Buffer buffer,
+								int level, Buffer buffer,
 								List *splitinfo);
 extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
 
#7Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#6)
Re: Bug in new buffering GiST build code

On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

I think we should rewrite the way we track the parents completely. Rather
than keep a path stack attached to every node buffer, let's just maintain a
second hash table that contains the parent of every internal node. Whenever
a downlink is moved to another page, update the hash table with the new
information. That way we always have up-to-date information about the
parent of every internal node.

That's much easier to understand than the path stack structures we have
now. I think the overall memory consumption will be about the same too.
Although we need the extra hash table with one entry for every internal
node, we get rid of the path stack structs, which are also one per every
internal node at the moment.

I believe it is faster too. I added some instrumentation to the existing
gist code (with the additional getNodeBuffer() call added to fix this bug),
to measure the time spent moving right, when refinding the parent of a
page. I added gettimeofday() calls before and after moving right, and
summed the total. In my test case, the final index size was about 19GB, and
the index build took 3545 seconds (59 minutes). Of that time, 580 seconds
(~ 10 minutes) was spent moving right to refind parents. That's a lot. I
also printed a line whenever a refind operation had to move right 20 pages
or more. That happened 2482 times during the build, in the worst case we
moved right over 40000 pages.

Attached is a patch to replace the path stacks with a hash table. With
this patch, the index build time in my test case dropped from 59 minutes to
about 55 minutes. I don'ẗ know how representative or repeatable this test
case is, but this definitely seems very worthwhile, not only because it
fixes the bug and makes the code simpler, but also on performance grounds.

Cool, seems that we've both simplier and faster implementation of finding
parent. Thanks!

Alexander, do you still have the test environments and data lying around
that you used for GiST buffering testing last summer? Could you rerun some
of those tests with this patch?

I think I can restore test environment and data. Will rerun tests soon.

------
With best regards,
Alexander Korotkov.

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#7)
Re: Bug in new buffering GiST build code

On 28.05.2012 00:46, Alexander Korotkov wrote:

On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

Attached is a patch to replace the path stacks with a hash table. With
this patch, the index build time in my test case dropped from 59 minutes to
about 55 minutes. I don'ẗ know how representative or repeatable this test
case is, but this definitely seems very worthwhile, not only because it
fixes the bug and makes the code simpler, but also on performance grounds.

Cool, seems that we've both simplier and faster implementation of finding
parent. Thanks!

Ok, I committed this now.

I also spotted and fixed another little oversight: the temporary file
didn't get deleted after the index build.

Alexander, do you still have the test environments and data lying around
that you used for GiST buffering testing last summer? Could you rerun some
of those tests with this patch?

I think I can restore test environment and data. Will rerun tests soon.

Thanks!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#9Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#8)
Re: Bug in new buffering GiST build code

On Wed, May 30, 2012 at 1:01 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

I also spotted and fixed another little oversight: the temporary file
didn't get deleted after the index build.

I've one note not directly related to buffering build. While I debugging
buffering GiST index build, backend was frequently crashed. After recovery
partially built index file was remain. Do we have some tool to detect such
"dead" files? If not, probably we need some?

------
With best regards,
Alexander Korotkov.

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alexander Korotkov (#9)
Re: Bug in new buffering GiST build code

Alexander Korotkov <aekorotkov@gmail.com> writes:

I've one note not directly related to buffering build. While I debugging
buffering GiST index build, backend was frequently crashed. After recovery
partially built index file was remain. Do we have some tool to detect such
"dead" files? If not, probably we need some?

Well, it's not that hard to check for orphan files (I think
contrib/oid2name can do that, or perhaps could be extended to). I don't
like the idea of the postmaster automatically removing such files, if
that's where you were headed. Too much risk of deleting important data.

regards, tom lane

#11Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#7)
Re: Bug in new buffering GiST build code

On Mon, May 28, 2012 at 1:46 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Alexander, do you still have the test environments and data lying around
that you used for GiST buffering testing last summer? Could you rerun some
of those tests with this patch?

I think I can restore test environment and data. Will rerun tests soon.

I rerun some of tests. There are index build times in seconds for old way
of parent refind and new way of it.

old new
usnoa2 2385 2452
usnoa2_shuffled 8131 8055
uniform 8327 8359

I thinks difference can be described by round error.
Indexes seem to be exactly same. It's predictable because changing
algorithm of parent refind shouldn't change the result.

------
With best regards,
Alexander Korotkov.

#12Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#11)
Re: Bug in new buffering GiST build code

On Tue, Jun 5, 2012 at 2:00 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

On Mon, May 28, 2012 at 1:46 AM, Alexander Korotkov <aekorotkov@gmail.com>wrote:

On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

Alexander, do you still have the test environments and data lying around
that you used for GiST buffering testing last summer? Could you rerun some
of those tests with this patch?

I think I can restore test environment and data. Will rerun tests soon.

I rerun some of tests. There are index build times in seconds for old way
of parent refind and new way of it.

old new
usnoa2 2385 2452
usnoa2_shuffled 8131 8055
uniform 8327 8359

I thinks difference can be described by round error.

Oh, I mean not "round" error, but "random". I.e. not exactly same state of
shared buffers at index build start and so on.

------
With best regards,
Alexander Korotkov.

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alexander Korotkov (#12)
Re: Bug in new buffering GiST build code

On 05.06.2012 09:45, Alexander Korotkov wrote:

On Tue, Jun 5, 2012 at 2:00 AM, Alexander Korotkov<aekorotkov@gmail.com>wrote:

On Mon, May 28, 2012 at 1:46 AM, Alexander Korotkov<aekorotkov@gmail.com>wrote:

On Sat, May 26, 2012 at 12:33 AM, Heikki Linnakangas<
heikki.linnakangas@enterprisedb.com> wrote:

Alexander, do you still have the test environments and data lying around
that you used for GiST buffering testing last summer? Could you rerun some
of those tests with this patch?

I think I can restore test environment and data. Will rerun tests soon.

I rerun some of tests. There are index build times in seconds for old way
of parent refind and new way of it.

old new
usnoa2 2385 2452
usnoa2_shuffled 8131 8055
uniform 8327 8359

I thinks difference can be described by round error.

Oh, I mean not "round" error, but "random". I.e. not exactly same state of
shared buffers at index build start and so on.

Thanks. I was expecting a small performance gain from the new approach,
but I guess not. Oh well, that would've just been a bonus, the important
thing is that it now works.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com