All-zero page in GIN index causes assertion failure

Started by Heikki Linnakangasover 10 years ago11 messages

hlinnaka@iki.fi

over 10 years ago

This is a continuation of the discussion at
/messages/by-id/CAMkU=1zUc=h0oCZntaJaqqW7gxxVxCWsYq8DD2t7oHgsgVEsgA@mail.gmail.com,
I'm starting a new thread as this is a separate issue than the original
LWLock bug.

On Thu, Jul 16, 2015 at 12:03 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

I don't see how this is related to the LWLock issue, but I didn't see it
without your patch. Perhaps the system just didn't survive long enough to
uncover it without the patch (although it shows up pretty quickly). It
could just be an overzealous Assert, since the casserts off didn't show
problems.

bt and bt full are shown below.

Cheers,

Jeff

#0 0x0000003dcb632625 in raise () from /lib64/libc.so.6
#1 0x0000003dcb633e05 in abort () from /lib64/libc.so.6
#2 0x0000000000930b7a in ExceptionalCondition (
conditionName=0x9a1440 "!(((PageHeader) (page))->pd_special >=
(__builtin_offsetof (PageHeaderData, pd_linp)))", errorType=0x9a12bc
"FailedAssertion",
fileName=0x9a12b0 "ginvacuum.c", lineNumber=713) at assert.c:54
#3 0x00000000004947cf in ginvacuumcleanup (fcinfo=0x7fffee073a90) at
ginvacuum.c:713

It now looks like this *is* unrelated to the LWLock issue. The assert that
it is tripping over was added just recently (302ac7f27197855afa8c) and so I
had not been testing under its presence until now. It looks like it is
finding all-zero pages (index extended but then a crash before initializing
the page?) and it doesn't like them.

(gdb) f 3
(gdb) p *(char[8192]*)(page)
$11 = '\000' <repeats 8191 times>

Presumably before this assert, such pages would just be permanently
orphaned.

Yeah, so it seems. It's normal to have all-zero pages in the index, if
you crash immediately after the relation has been extended, but before
the new page has been WAL-logged. What is your test case like; did you
do crash-testing?

ISTM ginvacuumcleanup should check for PageIsNew, and put the page to
the FSM. That's what btvacuumpage() gistvacuumcleanup() do.
spgvacuumpage() seems to also check for PageIsNew(), but it seems broken
in a different way: it initializes the page and marks the page as dirty,
but it is not WAL-logged. That is a problem at least if checksums are
enabled: if you crash you might have a torn page on disk, with invalid
checksum.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Heikki Linnakangas (#1)

Re: All-zero page in GIN index causes assertion failure

On 07/20/2015 11:14 AM, Heikki Linnakangas wrote:

ISTM ginvacuumcleanup should check for PageIsNew, and put the page to
the FSM. That's what btvacuumpage() gistvacuumcleanup() do.
spgvacuumpage() seems to also check for PageIsNew(), but it seems broken
in a different way: it initializes the page and marks the page as dirty,
but it is not WAL-logged. That is a problem at least if checksums are
enabled: if you crash you might have a torn page on disk, with invalid
checksum.

Looking closer, heap vacuum does a similar thing, but there are two
mitigating factors that make it safe in practice, I think:

1. An empty heap page is all-zeroes except for the small page header in
the beginning of the page. For a torn page to matter, the page would
need to be torn in the header, but we rely elsewhere (control file)
anyway that a 512-byte sector update is atomic, so that shouldn't
happen. Note that this hinges on the fact that there is no special area
on heap pages, so you cannot rely on this for index pages.

2. The redo of the first insert/update on a heap page will always
re-initialize the page, even when full-page-writes are disabled. This is
the XLOG_HEAP_INIT_PAGE optimization. So it's not just an optimization,
it's required for correctness.

Heap update can also leave behind a page in the buffer cache that's been
initialized by RelationGetBufferForTuple but not yet WAL-logged.
However, it doesn't mark the buffer dirty, so the torn-page problem
cannot happen because the page will not be flushed to disk if nothing
else touches it. The XLOG_HEAP_INIT_PAGE optimization is needed in that
case too, however.

B-tree, GiST, and SP-GiST's relation extension work similarly, but they
have other mitigating factors. If a newly-initialized B-tree page is
left behind in the relation, it won't be reused for anything, and vacuum
will ignore it (by accident, I think; there is no explicit comment on
what will happen to such pages, but it will be treated like an internal
page and ignored). Eventually the buffer will be evicted from cache, and
because it's not marked as dirty, it will not be flushed to disk, and
will later be read back as all-zeros and vacuum will recycle it.

BRIN update is not quite right, however. brin_getinsertbuffer() can
initialize a page, but the caller might bail out without using the page
and WAL-logging the change. If that happens, the next update that uses
the same page will WAL-log the change but it will not use the
XLOG_BRIN_INIT_PAGE option, and redo will not initialize the page. Redo
will fail.

BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN
pages? If an empty page is missing from the FSM for any reason, there's
nothing to add it there.

This is all very subtle. The whole business of leaving behind an
already-initialized page in the buffer cache, without marking the buffer
as dirty, is pretty ugly. I wish we had a more robust pattern to handle
all-zero pages and relation extension.

Thoughts? As a minimal backpatchable fix, I think we should add the
check in ginvacuumpage() to initialize any all-zeros pages it
encounters. That needs to be WAL-logged, and WAL-logging needs to be
added to the page initialization in spgvacuumpage too.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Heikki Linnakangas (#2)

Re: All-zero page in GIN index causes assertion failure

Heikki Linnakangas wrote:

Looking closer, heap vacuum does a similar thing, but there are two
mitigating factors that make it safe in practice, I think:

1. An empty heap page is all-zeroes except for the small page header in the
beginning of the page. For a torn page to matter, the page would need to be
torn in the header, but we rely elsewhere (control file) anyway that a
512-byte sector update is atomic, so that shouldn't happen. Note that this
hinges on the fact that there is no special area on heap pages, so you
cannot rely on this for index pages.

2. The redo of the first insert/update on a heap page will always
re-initialize the page, even when full-page-writes are disabled. This is the
XLOG_HEAP_INIT_PAGE optimization. So it's not just an optimization, it's
required for correctness.

Hmm, I guess this is a case of an optimization hiding a bug :-( I mean,
if we didn't have that, we would have noticed this a lot sooner, I
imagine.

BRIN update is not quite right, however. brin_getinsertbuffer() can
initialize a page, but the caller might bail out without using the page and
WAL-logging the change. If that happens, the next update that uses the same
page will WAL-log the change but it will not use the XLOG_BRIN_INIT_PAGE
option, and redo will not initialize the page. Redo will fail.

Hmm, interesting.

BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN pages?
If an empty page is missing from the FSM for any reason, there's nothing to
add it there.

Probably, yeah, makes sense. If you recall, the original design I
proposed had a scan of the whole index (though for other reasons I am
thankful we got rid of that).

This is all very subtle.

No kidding ...

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Alvaro Herrera (#3)

1 attachment(s)

Re: All-zero page in GIN index causes assertion failure

On 07/20/2015 07:11 PM, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

Looking closer, heap vacuum does a similar thing, but there are two
mitigating factors that make it safe in practice, I think:

1. An empty heap page is all-zeroes except for the small page header in the
beginning of the page. For a torn page to matter, the page would need to be
torn in the header, but we rely elsewhere (control file) anyway that a
512-byte sector update is atomic, so that shouldn't happen. Note that this
hinges on the fact that there is no special area on heap pages, so you
cannot rely on this for index pages.

2. The redo of the first insert/update on a heap page will always
re-initialize the page, even when full-page-writes are disabled. This is the
XLOG_HEAP_INIT_PAGE optimization. So it's not just an optimization, it's
required for correctness.

Hmm, I guess this is a case of an optimization hiding a bug :-( I mean,
if we didn't have that, we would have noticed this a lot sooner, I
imagine.

Yeah, probably.

I came up with the attached, for GIN and SP-GiST. Instead of WAL-logging
the page initialization in SP-GiST vacuum, I changed it so that it
simply leaves the page as all-zeroes, and adds it to the FSM. The code
that grabs a target page from the FSM handles that, and initializes the
page anyway, so that was simpler. This made the SP-GiST is-deleted flag
obsolete, it's no longer set, although the code still checks for it for
backwards-compatibility. (even that may actually be unnecessary, as a
page that's marked as deleted must also be empty, and wherever we check
for the deleted-flag, we also check for PageIsEmpty()))

- Heikki

Attachments:

fix-gin-spgist-vacuum-1.patchapplication/x-patch; name=fix-gin-spgist-vacuum-1.patchDownload

diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index eba572b..1315762 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -710,7 +710,7 @@ ginvacuumcleanup(PG_FUNCTION_ARGS)
 		LockBuffer(buffer, GIN_SHARE);
 		page = (Page) BufferGetPage(buffer);
 
-		if (GinPageIsDeleted(page))
+		if (PageIsNew(page) || GinPageIsDeleted(page))
 		{
 			Assert(blkno != GIN_ROOT_BLKNO);
 			RecordFreeIndexPage(index, blkno);
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index dc69d1e..36e93ad 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -621,14 +621,10 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 	{
 		/*
 		 * We found an all-zero page, which could happen if the database
-		 * crashed just after extending the file.  Initialize and recycle it.
+		 * crashed just after extending the file.  Recycle it.
 		 */
-		SpGistInitBuffer(buffer, 0);
-		SpGistPageSetDeleted(page);
-		/* We don't bother to WAL-log this action; easy to redo */
-		MarkBufferDirty(buffer);
 	}
-	else if (SpGistPageIsDeleted(page))
+	else if (PageIsEmpty(page))
 	{
 		/* nothing to do */
 	}
@@ -659,25 +655,18 @@ spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
 	 */
 	if (!SpGistBlockIsRoot(blkno))
 	{
-		/* If page is now empty, mark it deleted */
-		if (PageIsEmpty(page) && !SpGistPageIsDeleted(page))
-		{
-			SpGistPageSetDeleted(page);
-			/* We don't bother to WAL-log this action; easy to redo */
-			MarkBufferDirty(buffer);
-		}
-
-		if (SpGistPageIsDeleted(page))
+		if (PageIsEmpty(page))
 		{
 			RecordFreeIndexPage(index, blkno);
 			bds->stats->pages_deleted++;
 		}
 		else
+		{
+			SpGistSetLastUsedPage(index, buffer);
 			bds->lastFilledBlock = blkno;
+		}
 	}
 
-	SpGistSetLastUsedPage(index, buffer);
-
 	UnlockReleaseBuffer(buffer);
 }
 
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index 413f71e..48dadd5 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -48,14 +48,14 @@ typedef SpGistPageOpaqueData *SpGistPageOpaque;
 
 /* Flag bits in page special space */
 #define SPGIST_META			(1<<0)
-#define SPGIST_DELETED		(1<<1)
+#define SPGIST_DELETED		(1<<1)		/* never set, but keep for backwards
+										 * compatibility */
 #define SPGIST_LEAF			(1<<2)
 #define SPGIST_NULLS		(1<<3)
 
 #define SpGistPageGetOpaque(page) ((SpGistPageOpaque) PageGetSpecialPointer(page))
 #define SpGistPageIsMeta(page) (SpGistPageGetOpaque(page)->flags & SPGIST_META)
 #define SpGistPageIsDeleted(page) (SpGistPageGetOpaque(page)->flags & SPGIST_DELETED)
-#define SpGistPageSetDeleted(page) (SpGistPageGetOpaque(page)->flags |= SPGIST_DELETED)
 #define SpGistPageIsLeaf(page) (SpGistPageGetOpaque(page)->flags & SPGIST_LEAF)
 #define SpGistPageStoresNulls(page) (SpGistPageGetOpaque(page)->flags & SPGIST_NULLS)

Jeff Janes

jeff.janes@gmail.com

over 10 years ago

In reply to: Heikki Linnakangas (#4)

Re: All-zero page in GIN index causes assertion failure

On Mon, Jul 20, 2015 at 1:23 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I came up with the attached, for GIN and SP-GiST. Instead of WAL-logging
the page initialization in SP-GiST vacuum, I changed it so that it simply
leaves the page as all-zeroes, and adds it to the FSM. The code that grabs
a target page from the FSM handles that, and initializes the page anyway,
so that was simpler. This made the SP-GiST is-deleted flag obsolete, it's
no longer set, although the code still checks for it for
backwards-compatibility. (even that may actually be unnecessary, as a page
that's marked as deleted must also be empty, and wherever we check for the
deleted-flag, we also check for PageIsEmpty()))

This patch, in conjunction with the LWLock deadlock patch, fixes all the
issues I was having with GIN indexes in 9.5.

I haven't tried SP-GiST.

Cheers,

Jeff

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Heikki Linnakangas (#2)

Re: All-zero page in GIN index causes assertion failure

Heikki Linnakangas wrote:

BRIN update is not quite right, however. brin_getinsertbuffer() can
initialize a page, but the caller might bail out without using the page and
WAL-logging the change. If that happens, the next update that uses the same
page will WAL-log the change but it will not use the XLOG_BRIN_INIT_PAGE
option, and redo will not initialize the page. Redo will fail.

There's even a worse case. In brin_getinsertbuffer, if an old buffer is
passed, and the function extends the index, and then
brin_getinsertbuffer finds that the old buffer has been used to extend
the revmap, then we don't even return the newly extended page. We do
enter it into the FSM, though, and the FSM is vacuumed.

So next time around somebody requests a page from FSM, it might return
that page; but since it wasn't initialized, the insertion will fail.
Everybody expects the page to have been initialized previously, but this
will not be the case here.

I guess that function needs some restructuring, but it's pretty hairy
already ...

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Jeff Janes (#5)

Re: All-zero page in GIN index causes assertion failure

On 07/23/2015 07:43 PM, Jeff Janes wrote:

On Mon, Jul 20, 2015 at 1:23 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I came up with the attached, for GIN and SP-GiST. Instead of WAL-logging
the page initialization in SP-GiST vacuum, I changed it so that it simply
leaves the page as all-zeroes, and adds it to the FSM. The code that grabs
a target page from the FSM handles that, and initializes the page anyway,
so that was simpler. This made the SP-GiST is-deleted flag obsolete, it's
no longer set, although the code still checks for it for
backwards-compatibility. (even that may actually be unnecessary, as a page
that's marked as deleted must also be empty, and wherever we check for the
deleted-flag, we also check for PageIsEmpty()))

This patch, in conjunction with the LWLock deadlock patch, fixes all the
issues I was having with GIN indexes in 9.5.

Thanks, I've pushed this, as well a fix to similar failure from B-tree
vacuum that Amit Langote reported in the other thread.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Langote

amitlangote09@gmail.com

over 10 years ago

In reply to: Heikki Linnakangas (#7)

Re: All-zero page in GIN index causes assertion failure

On Tue, Jul 28, 2015 at 12:21 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Thanks, I've pushed this, as well a fix to similar failure from B-tree
vacuum that Amit Langote reported in the other thread.

Thanks Heikki and sorry I didn't notice this new thread.

Regards,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Heikki Linnakangas (#2)

Re: All-zero page in GIN index causes assertion failure

Heikki Linnakangas wrote:

BRIN update is not quite right, however. brin_getinsertbuffer() can
initialize a page, but the caller might bail out without using the page and
WAL-logging the change. If that happens, the next update that uses the same
page will WAL-log the change but it will not use the XLOG_BRIN_INIT_PAGE
option, and redo will not initialize the page. Redo will fail.

Here's a fix for BRIN: when brin_getinsertbuffer extends the relation
and doesn't return the page, it initializes and logs the page as an
empty regular page before returning, and also records it into the FSM.
That way, some future process that gets a page from FSM will use it,
preventing this type of problem from bloating the index forever. Also,
this function no longer initializes the page when it returns it to the
caller: instead the caller (brin_doinsert or brin_doupdate) must do
that. (Previously, the code was initializing the page outside the
critical section that WAL-logs this action).

BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN pages?
If an empty page is missing from the FSM for any reason, there's nothing to
add it there.

Probably. I didn't change this part yet. There are two things to fix:
1. since we use log_newpage_buffer(), we log the initialization but not
the recording into FSM, so the page would be forgotten about. This can
be tested with PageIsEmpty(). An alternative to the vacuum scan is to
use our own WAL record that not only logs the initialization itself but
also the FSM update. Not sure this is worth the trouble.

2. additionally, if brin_getinsertbuffer extends the relation but we
crash before the caller initializes it, the page would be detected by
PageIsNew instead and would also need initialization.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Alvaro Herrera (#9)

1 attachment(s)

Re: All-zero page in GIN index causes assertion failure

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

brin-page-init.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 0b257d9..909e4c6 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -24,8 +24,9 @@
 
 
 static Buffer brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
-					 bool *was_extended);
+					 bool *extended);
 static Size br_page_get_freespace(Page page);
+static void initialize_empty_new_buffer(Relation idxrel, Buffer buffer);
 
 
 /*
@@ -53,7 +54,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	BrinTuple  *oldtup;
 	Size		oldsz;
 	Buffer		newbuf;
-	bool		extended = false;
+	bool		extended;
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -64,11 +65,11 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	{
 		/* need a page on which to put the item */
 		newbuf = brin_getinsertbuffer(idxrel, oldbuf, newsz, &extended);
-		/* XXX delay vacuuming FSM until locks are released? */
-		if (extended)
-			FreeSpaceMapVacuum(idxrel);
 		if (!BufferIsValid(newbuf))
+		{
+			Assert(!extended);
 			return false;
+		}
 
 		/*
 		 * Note: it's possible (though unlikely) that the returned newbuf is
@@ -76,7 +77,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		 * buffer does in fact have enough space.
 		 */
 		if (newbuf == oldbuf)
+		{
+			Assert(!extended);
 			newbuf = InvalidBuffer;
+		}
 	}
 	else
 	{
@@ -94,7 +98,13 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	{
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
 		if (BufferIsValid(newbuf))
+		{
+			if (extended)
+				initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
+			if (extended)
+				FreeSpaceMapVacuum(idxrel);
+		}
 		return false;
 	}
 
@@ -106,9 +116,14 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	 */
 	if (!brin_tuples_equal(oldtup, oldsz, origtup, origsz))
 	{
-		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
 		if (BufferIsValid(newbuf))
+		{
+			if (extended)
+				initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
+			if (extended)
+				FreeSpaceMapVacuum(idxrel);
+		}
 		return false;
 	}
 
@@ -125,7 +140,12 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
 		if (BufferIsValid(newbuf))
+		{
+			/* as above */
+			if (extended)
+				initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
+		}
 
 		START_CRIT_SECTION();
 		PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
@@ -157,6 +177,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		END_CRIT_SECTION();
 
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+
+		if (extended)
+			FreeSpaceMapVacuum(idxrel);
+
 		return true;
 	}
 	else if (newbuf == InvalidBuffer)
@@ -178,11 +202,21 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		Buffer		revmapbuf;
 		ItemPointerData newtid;
 		OffsetNumber newoff;
+		BlockNumber	newblk = InvalidBlockNumber;
+		Size		freespace = 0;
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
 		START_CRIT_SECTION();
 
+		/*
+		 * We need to initialize the page if it's newly obtained.  Note we will
+		 * WAL-log the initialization as part of the update, so we don't need
+		 * to do that here.
+		 */
+		if (extended)
+			brin_page_init(BufferGetPage(newbuf), BRIN_PAGETYPE_REGULAR);
+
 		PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
 		newoff = PageAddItem(newpage, (Item) newtup, newsz,
 							 InvalidOffsetNumber, false, false);
@@ -191,6 +225,13 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(oldbuf);
 		MarkBufferDirty(newbuf);
 
+		/* needed to update FSM below */
+		if (extended)
+		{
+			newblk = BufferGetBlockNumber(newbuf);
+			freespace = br_page_get_freespace(newpage);
+		}
+
 		ItemPointerSet(&newtid, BufferGetBlockNumber(newbuf), newoff);
 		brinSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, newtid);
 		MarkBufferDirty(revmapbuf);
@@ -235,6 +276,14 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
 		UnlockReleaseBuffer(newbuf);
+
+		if (extended)
+		{
+			Assert(BlockNumberIsValid(newblk));
+			RecordPageWithFreeSpace(idxrel, newblk, freespace);
+			FreeSpaceMapVacuum(idxrel);
+		}
+
 		return true;
 	}
 }
@@ -271,7 +320,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	OffsetNumber off;
 	Buffer		revmapbuf;
 	ItemPointerData tid;
-	bool		extended = false;
+	bool		extended;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -302,7 +351,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	{
 		*buffer = brin_getinsertbuffer(idxrel, InvalidBuffer, itemsz, &extended);
 		Assert(BufferIsValid(*buffer));
-		Assert(br_page_get_freespace(BufferGetPage(*buffer)) >= itemsz);
+		Assert(extended || br_page_get_freespace(BufferGetPage(*buffer)) >= itemsz);
 	}
 
 	/* Now obtain lock on revmap buffer */
@@ -312,6 +361,8 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	blk = BufferGetBlockNumber(*buffer);
 
 	START_CRIT_SECTION();
+	if (extended)
+		brin_page_init(BufferGetPage(*buffer), BRIN_PAGETYPE_REGULAR);
 	off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
 					  false, false);
 	if (off == InvalidOffsetNumber)
@@ -494,22 +545,46 @@ brin_evacuate_page(Relation idxRel, BlockNumber pagesPerRange,
  * index item of size itemsz.  If oldbuf is a valid buffer, it is also locked
  * (in an order determined to avoid deadlocks.)
  *
- * If there's no existing page with enough free space to accommodate the new
- * item, the relation is extended.  If this happens, *extended is set to true.
- *
  * If we find that the old page is no longer a regular index page (because
  * of a revmap extension), the old buffer is unlocked and we return
  * InvalidBuffer.
+ *
+ * If there's no existing page with enough free space to accommodate the new
+ * item, the relation is extended.  If this happens, *extended is set to true,
+ * and it is the caller's responsibility to initialize the page (and WAL-log
+ * that fact) prior to use.
+ *
+ * Note that in some corner cases it is possible for this routine to extend the
+ * relation and then not return the buffer.  It is this routine's
+ * responsibility to WAL-log the page initialization and to record the page in
+ * FSM if that happens.  Such a buffer may later be reused by this routine.
  */
 static Buffer
 brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
-					 bool *was_extended)
+					 bool *extended)
 {
 	BlockNumber oldblk;
 	BlockNumber newblk;
 	Page		page;
 	int			freespace;
 
+	/*
+	 * If the item is oversized, don't bother.  We have another, more precise
+	 * check below.
+	 */
+	if (itemsz > BLCKSZ - sizeof(BrinSpecialSpace))
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+						(unsigned long) itemsz,
+						(unsigned long) BLCKSZ - sizeof(BrinSpecialSpace),
+						RelationGetRelationName(irel))));
+		return InvalidBuffer;		/* keep compiler quiet */
+	}
+
+	*extended = false;
+
 	if (BufferIsValid(oldbuf))
 		oldblk = BufferGetBlockNumber(oldbuf);
 	else
@@ -528,7 +603,6 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 	{
 		Buffer		buf;
 		bool		extensionLockHeld = false;
-		bool		extended = false;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -546,7 +620,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			}
 			buf = ReadBuffer(irel, P_NEW);
 			newblk = BufferGetBlockNumber(buf);
-			*was_extended = extended = true;
+			*extended = true;
 
 			BRIN_elog((DEBUG2, "brin_getinsertbuffer: extending to page %u",
 					   BufferGetBlockNumber(buf)));
@@ -576,6 +650,21 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			if (!BRIN_IS_REGULAR_PAGE(BufferGetPage(oldbuf)))
 			{
 				LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+
+				/*
+				 * It is possible that the new page was obtained from extending
+				 * the relation.  In that case, we must be sure to record it in
+				 * the FSM before leaving, because otherwise the space would be
+				 * lost forever.  However, we cannot let an uninitialized page
+				 * get in the FSM, so we need to initialize it first.
+				 */
+				if (*extended)
+				{
+					initialize_empty_new_buffer(irel, buf);
+					/* shouldn't matter, but don't confuse caller */
+					*extended = false;
+				}
+
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
 			}
@@ -588,9 +677,6 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 
 		page = BufferGetPage(buf);
 
-		if (extended)
-			brin_page_init(page, BRIN_PAGETYPE_REGULAR);
-
 		/*
 		 * We have a new buffer to insert into.  Check that the new page has
 		 * enough free space, and return it if it does; otherwise start over.
@@ -600,7 +686,8 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		 * (br_page_get_freespace also checks that the FSM didn't hand us a
 		 * page that has since been repurposed for the revmap.)
 		 */
-		freespace = br_page_get_freespace(page);
+		freespace = *extended ?
+			BLCKSZ - sizeof(BrinSpecialSpace) : br_page_get_freespace(page);
 		if (freespace >= itemsz)
 		{
 			RelationSetTargetBlock(irel, BufferGetBlockNumber(buf));
@@ -610,7 +697,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 * invalidations, make sure we update the more permanent FSM with
 			 * data about it before going away.
 			 */
-			if (extended)
+			if (*extended)
 				RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
 										freespace);
 
@@ -634,12 +721,13 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		/*
 		 * If an entirely new page does not contain enough free space for the
 		 * new item, then surely that item is oversized.  Complain loudly; but
-		 * first make sure we record the page as free, for next time.
+		 * first make sure we initialize the page and record it as free, for
+		 * next time.
 		 */
-		if (extended)
+		if (*extended)
 		{
-			RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
-									freespace);
+			initialize_empty_new_buffer(irel, buf);
+
 			ereport(ERROR,
 					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 			errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
@@ -659,6 +747,35 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 }
 
 /*
+ * Initialize a page as an empty regular BRIN page, WAL-log this, and record
+ * the page in FSM.
+ *
+ * There are several corner situations in which we extend the relation to
+ * obtain a new page and later find that we cannot use it immediately.  When
+ * that happens, we don't want to leave the page go unrecorded in FSM, because
+ * there is no mechanism to get the space back and the index would bloat.
+ * Also, because we would not WAL-log the action that would initialize the
+ * page, the page would go uninitialized in a standby (or after recovery).
+ */
+static void
+initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
+{
+	Page	page;
+
+	START_CRIT_SECTION();
+	page = BufferGetPage(buffer);
+	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
+	MarkBufferDirty(buffer);
+	/* FIXME use own WAL record, so that FSM update is replayed also? */
+	log_newpage_buffer(buffer, true);
+	END_CRIT_SECTION();
+
+	RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buffer),
+							br_page_get_freespace(page));
+}
+
+
+/*
  * Return the amount of free space on a regular BRIN index page.
  *
  * If the page is not a regular page, or has been marked with the

#11

Alvaro Herrera

alvherre@2ndquadrant.com

over 10 years ago

In reply to: Alvaro Herrera (#9)

1 attachment(s)

Re: All-zero page in GIN index causes assertion failure

Alvaro Herrera wrote:

BTW, shouldn't there be a step in BRIN vacuum that scans all the BRIN pages?
If an empty page is missing from the FSM for any reason, there's nothing to
add it there.

Probably. I didn't change this part yet. There are two things to fix:
1. since we use log_newpage_buffer(), we log the initialization but not
the recording into FSM, so the page would be forgotten about. This can
be tested with PageIsEmpty(). An alternative to the vacuum scan is to
use our own WAL record that not only logs the initialization itself but
also the FSM update. Not sure this is worth the trouble.

2. additionally, if brin_getinsertbuffer extends the relation but we
crash before the caller initializes it, the page would be detected by
PageIsNew instead and would also need initialization.

Added this part. It's using log_newpage_buffer too. The vacuum scan
fixes the whole FSM, though, so after vacuum the FSM is up to date.
I think we could shave off a few bytes by using a separate WAL record,
but I'm not sure it's worth the trouble.

I intend to push this tomorrow.

I now think the free space calculations are broken, but I'll leave that
for later.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

brin-page-init-2.patchtext/x-diff; charset=us-asciiDownload

commit 7e75695cf90e2fdea09dc5b5a2071026736fa749[m
Author:     Alvaro Herrera <alvherre@alvh.no-ip.org>
AuthorDate: Thu Aug 6 16:20:29 2015 -0300
CommitDate: Tue Aug 11 22:55:46 2015 -0300

    Close some holes in BRIN page assignment
    
    In some corner cases, it is possible for the BRIN index relation to be
    extended by brin_getinsertbuffer but the new page not be used
    immediately for anything in brin_doupdate; when this happens, the page
    is initialized and the FSM is updated with the info about that page.  A
    later index insert/update can use the page, but since the page is
    already initialized, the initialization itself is not WAL-logged.  This
    causes recovery to fail altogether.
    
    There is another, related corner case inside brin_getinsertbuffer
    itself, in which we extend the relation to put a new index tuple there,
    but later find out that we cannot do so, and do not return the buffer;
    the page obtained from extension is not even initialized.  The resulting
    page is lost forever.
    
    To fix, shuffle the code so that initialization is not done in
    brin_getinsertbuffer at all, in normal cases; instead, the
    initialization is done by its callers (brin_doinsert and brin_doupdate)
    once they're certain that the page is going to be used.  When either
    those functions or brin_getinsertbuffer itself determine that the new
    page cannot be used, before bailing out they initialize the page as an
    empty regular page, enter it in FSM and WAL-log all this.  This way, the
    page is usable for future index insertions, and WAL replay doesn't find
    trying to insert tuples in pages whose initialization didn't make it to
    the WAL.
    
    Additionally, add a new step to vacuuming so that all pages of the index
    are scanned; whenever an uninitialized page is found, it is initialized
    as empty and WAL-logged.  We also take this opportunity to update the
    FSM, in case it has gotten out of date (which is a good thing because
    those operations are not WAL-logged.)
    
    Thanks to Heikki Linnakangas for finding the problem that kicked some
    additional analysis of BRIN page assignment.
---
 src/backend/access/brin/brin.c         |  43 ++++++
 src/backend/access/brin/brin_pageops.c | 232 +++++++++++++++++++++++++++++----
 src/include/access/brin_page.h         |   1 +
 src/include/access/brin_pageops.h      |   2 +
 4 files changed, 256 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index fc9f964..cbd40c6 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -68,6 +68,7 @@ static void brinsummarize(Relation index, Relation heapRel,
 static void form_and_insert_tuple(BrinBuildState *state);
 static void union_tuples(BrinDesc *bdesc, BrinMemTuple *a,
 			 BrinTuple *b);
+static void brin_vacuum_scan(Relation idxrel, BufferAccessStrategy strategy);
 
 
 /*
@@ -736,6 +737,8 @@ brinvacuumcleanup(PG_FUNCTION_ARGS)
 	heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
 						AccessShareLock);
 
+	brin_vacuum_scan(info->index, info->strategy);
+
 	brinsummarize(info->index, heapRel,
 				  &stats->num_index_tuples, &stats->num_index_tuples);
 
@@ -1150,3 +1153,43 @@ union_tuples(BrinDesc *bdesc, BrinMemTuple *a, BrinTuple *b)
 
 	MemoryContextDelete(cxt);
 }
+
+/*
+ * brin_vacuum_scan
+ *		Do a complete scan of the index during VACUUM.
+ *
+ * This routine scans the complete index looking for uncatalogued index pages,
+ * i.e. those that might have been lost due to a crash after index extension
+ * and such.
+ */
+static void
+brin_vacuum_scan(Relation idxrel, BufferAccessStrategy strategy)
+{
+	bool		vacuum_fsm = false;
+	BlockNumber	blkno;
+
+	/*
+	 * Scan the index in physical order, and clean up any possible mess in
+	 * each page.
+	 */
+	for (blkno = 0; blkno < RelationGetNumberOfBlocks(idxrel); blkno++)
+	{
+		Buffer	buf;
+
+		CHECK_FOR_INTERRUPTS();
+
+		buf = ReadBufferExtended(idxrel, MAIN_FORKNUM, blkno,
+								 RBM_NORMAL, strategy);
+
+		vacuum_fsm |= brin_page_cleanup(idxrel, buf);
+
+		ReleaseBuffer(buf);
+	}
+
+	/*
+	 * If we made any change to the FSM, make sure the new info is visible all
+	 * the way to the top.
+	 */
+	if (vacuum_fsm)
+		FreeSpaceMapVacuum(idxrel);
+}
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 0b257d9..c4c5c07 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -24,8 +24,9 @@
 
 
 static Buffer brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
-					 bool *was_extended);
+					 bool *extended);
 static Size br_page_get_freespace(Page page);
+static void brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer);
 
 
 /*
@@ -53,7 +54,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	BrinTuple  *oldtup;
 	Size		oldsz;
 	Buffer		newbuf;
-	bool		extended = false;
+	bool		extended;
 
 	Assert(newsz == MAXALIGN(newsz));
 
@@ -64,11 +65,11 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	{
 		/* need a page on which to put the item */
 		newbuf = brin_getinsertbuffer(idxrel, oldbuf, newsz, &extended);
-		/* XXX delay vacuuming FSM until locks are released? */
-		if (extended)
-			FreeSpaceMapVacuum(idxrel);
 		if (!BufferIsValid(newbuf))
+		{
+			Assert(!extended);
 			return false;
+		}
 
 		/*
 		 * Note: it's possible (though unlikely) that the returned newbuf is
@@ -76,7 +77,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		 * buffer does in fact have enough space.
 		 */
 		if (newbuf == oldbuf)
+		{
+			Assert(!extended);
 			newbuf = InvalidBuffer;
+		}
 	}
 	else
 	{
@@ -93,8 +97,19 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	if (!ItemIdIsNormal(oldlp))
 	{
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+		/*
+		 * If this happens, and the new buffer was obtained by extending the
+		 * relation, then we need to ensure we don't leave it uninitialized
+		 * or forget about it.
+		 */
 		if (BufferIsValid(newbuf))
+		{
+			if (extended)
+				brin_initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
+			if (extended)
+				FreeSpaceMapVacuum(idxrel);
+		}
 		return false;
 	}
 
@@ -108,7 +123,13 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 	{
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
 		if (BufferIsValid(newbuf))
+		{
+			if (extended)
+				brin_initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
+			if (extended)
+				FreeSpaceMapVacuum(idxrel);
+		}
 		return false;
 	}
 
@@ -125,7 +146,12 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		brin_can_do_samepage_update(oldbuf, origsz, newsz))
 	{
 		if (BufferIsValid(newbuf))
+		{
+			/* as above */
+			if (extended)
+				brin_initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
+		}
 
 		START_CRIT_SECTION();
 		PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
@@ -157,6 +183,10 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		END_CRIT_SECTION();
 
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+
+		if (extended)
+			FreeSpaceMapVacuum(idxrel);
+
 		return true;
 	}
 	else if (newbuf == InvalidBuffer)
@@ -178,11 +208,21 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		Buffer		revmapbuf;
 		ItemPointerData newtid;
 		OffsetNumber newoff;
+		BlockNumber	newblk = InvalidBlockNumber;
+		Size		freespace = 0;
 
 		revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
 
 		START_CRIT_SECTION();
 
+		/*
+		 * We need to initialize the page if it's newly obtained.  Note we will
+		 * WAL-log the initialization as part of the update, so we don't need
+		 * to do that here.
+		 */
+		if (extended)
+			brin_page_init(BufferGetPage(newbuf), BRIN_PAGETYPE_REGULAR);
+
 		PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
 		newoff = PageAddItem(newpage, (Item) newtup, newsz,
 							 InvalidOffsetNumber, false, false);
@@ -191,6 +231,13 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		MarkBufferDirty(oldbuf);
 		MarkBufferDirty(newbuf);
 
+		/* needed to update FSM below */
+		if (extended)
+		{
+			newblk = BufferGetBlockNumber(newbuf);
+			freespace = br_page_get_freespace(newpage);
+		}
+
 		ItemPointerSet(&newtid, BufferGetBlockNumber(newbuf), newoff);
 		brinSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, newtid);
 		MarkBufferDirty(revmapbuf);
@@ -235,6 +282,14 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
 		UnlockReleaseBuffer(newbuf);
+
+		if (extended)
+		{
+			Assert(BlockNumberIsValid(newblk));
+			RecordPageWithFreeSpace(idxrel, newblk, freespace);
+			FreeSpaceMapVacuum(idxrel);
+		}
+
 		return true;
 	}
 }
@@ -271,7 +326,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	OffsetNumber off;
 	Buffer		revmapbuf;
 	ItemPointerData tid;
-	bool		extended = false;
+	bool		extended;
 
 	Assert(itemsz == MAXALIGN(itemsz));
 
@@ -302,7 +357,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	{
 		*buffer = brin_getinsertbuffer(idxrel, InvalidBuffer, itemsz, &extended);
 		Assert(BufferIsValid(*buffer));
-		Assert(br_page_get_freespace(BufferGetPage(*buffer)) >= itemsz);
+		Assert(extended || br_page_get_freespace(BufferGetPage(*buffer)) >= itemsz);
 	}
 
 	/* Now obtain lock on revmap buffer */
@@ -312,6 +367,8 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	blk = BufferGetBlockNumber(*buffer);
 
 	START_CRIT_SECTION();
+	if (extended)
+		brin_page_init(BufferGetPage(*buffer), BRIN_PAGETYPE_REGULAR);
 	off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
 					  false, false);
 	if (off == InvalidOffsetNumber)
@@ -490,26 +547,104 @@ brin_evacuate_page(Relation idxRel, BlockNumber pagesPerRange,
 }
 
 /*
+ * Given a BRIN index page, initialize it if necessary, and record it into the
+ * FSM if necessary.  Return value is true if the FSM itself needs "vacuuming".
+ * The main use for this is when, during vacuuming, an uninitialized page is
+ * found, which could be the result of relation extension followed by a crash
+ * before the page can be used.
+ */
+bool
+brin_page_cleanup(Relation idxrel, Buffer buf)
+{
+	Page	page = BufferGetPage(buf);
+	Size	freespace;
+
+	/*
+	 * If a page was left uninitialized, initialize it now; also record it
+	 * in FSM.
+	 *
+	 * Somebody else might be extending the relation concurrently.  To avoid
+	 * re-initializing the page before they can grab the buffer lock, we
+	 * acquire the extension lock momentarily.  Since they hold the extension
+	 * lock from before getting the page and after its been initialized,
+	 * we're sure to see their initialization.
+	 */
+	if (PageIsNew(page))
+	{
+		LockRelationForExtension(idxrel, ShareLock);
+		UnlockRelationForExtension(idxrel, ShareLock);
+
+		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+		if (PageIsNew(page))
+		{
+			brin_initialize_empty_new_buffer(idxrel, buf);
+			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+			return true;
+		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	}
+
+	/* Nothing to be done for non-regular index pages */
+	if (BRIN_IS_META_PAGE(BufferGetPage(buf)) ||
+		BRIN_IS_REVMAP_PAGE(BufferGetPage(buf)))
+		return false;
+
+	/* Measure free space and record it */
+	freespace = br_page_get_freespace(page);
+	if (freespace > GetRecordedFreeSpace(idxrel, BufferGetBlockNumber(buf)))
+	{
+		RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buf), freespace);
+		return true;
+	}
+
+	return false;
+}
+
+/*
  * Return a pinned and exclusively locked buffer which can be used to insert an
  * index item of size itemsz.  If oldbuf is a valid buffer, it is also locked
  * (in an order determined to avoid deadlocks.)
  *
- * If there's no existing page with enough free space to accommodate the new
- * item, the relation is extended.  If this happens, *extended is set to true.
- *
  * If we find that the old page is no longer a regular index page (because
  * of a revmap extension), the old buffer is unlocked and we return
  * InvalidBuffer.
+ *
+ * If there's no existing page with enough free space to accommodate the new
+ * item, the relation is extended.  If this happens, *extended is set to true,
+ * and it is the caller's responsibility to initialize the page (and WAL-log
+ * that fact) prior to use.
+ *
+ * Note that in some corner cases it is possible for this routine to extend the
+ * relation and then not return the buffer.  It is this routine's
+ * responsibility to WAL-log the page initialization and to record the page in
+ * FSM if that happens.  Such a buffer may later be reused by this routine.
  */
 static Buffer
 brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
-					 bool *was_extended)
+					 bool *extended)
 {
 	BlockNumber oldblk;
 	BlockNumber newblk;
 	Page		page;
 	int			freespace;
 
+	/*
+	 * If the item is oversized, don't bother.  We have another, more precise
+	 * check below.
+	 */
+	if (itemsz > BLCKSZ - sizeof(BrinSpecialSpace))
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+						(unsigned long) itemsz,
+						(unsigned long) BLCKSZ - sizeof(BrinSpecialSpace),
+						RelationGetRelationName(irel))));
+		return InvalidBuffer;		/* keep compiler quiet */
+	}
+
+	*extended = false;
+
 	if (BufferIsValid(oldbuf))
 		oldblk = BufferGetBlockNumber(oldbuf);
 	else
@@ -528,7 +663,6 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 	{
 		Buffer		buf;
 		bool		extensionLockHeld = false;
-		bool		extended = false;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -546,7 +680,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			}
 			buf = ReadBuffer(irel, P_NEW);
 			newblk = BufferGetBlockNumber(buf);
-			*was_extended = extended = true;
+			*extended = true;
 
 			BRIN_elog((DEBUG2, "brin_getinsertbuffer: extending to page %u",
 					   BufferGetBlockNumber(buf)));
@@ -576,6 +710,24 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			if (!BRIN_IS_REGULAR_PAGE(BufferGetPage(oldbuf)))
 			{
 				LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+
+				/*
+				 * It is possible that the new page was obtained from extending
+				 * the relation.  In that case, we must be sure to record it in
+				 * the FSM before leaving, because otherwise the space would be
+				 * lost forever.  However, we cannot let an uninitialized page
+				 * get in the FSM, so we need to initialize it first.
+				 */
+				if (*extended)
+				{
+					brin_initialize_empty_new_buffer(irel, buf);
+					/* shouldn't matter, but don't confuse caller */
+					*extended = false;
+				}
+
+				if (extensionLockHeld)
+					UnlockRelationForExtension(irel, ExclusiveLock);
+
 				ReleaseBuffer(buf);
 				return InvalidBuffer;
 			}
@@ -588,9 +740,6 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 
 		page = BufferGetPage(buf);
 
-		if (extended)
-			brin_page_init(page, BRIN_PAGETYPE_REGULAR);
-
 		/*
 		 * We have a new buffer to insert into.  Check that the new page has
 		 * enough free space, and return it if it does; otherwise start over.
@@ -600,7 +749,8 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		 * (br_page_get_freespace also checks that the FSM didn't hand us a
 		 * page that has since been repurposed for the revmap.)
 		 */
-		freespace = br_page_get_freespace(page);
+		freespace = *extended ?
+			BLCKSZ - sizeof(BrinSpecialSpace) : br_page_get_freespace(page);
 		if (freespace >= itemsz)
 		{
 			RelationSetTargetBlock(irel, BufferGetBlockNumber(buf));
@@ -610,7 +760,7 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 * invalidations, make sure we update the more permanent FSM with
 			 * data about it before going away.
 			 */
-			if (extended)
+			if (*extended)
 				RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
 										freespace);
 
@@ -634,12 +784,13 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 		/*
 		 * If an entirely new page does not contain enough free space for the
 		 * new item, then surely that item is oversized.  Complain loudly; but
-		 * first make sure we record the page as free, for next time.
+		 * first make sure we initialize the page and record it as free, for
+		 * next time.
 		 */
-		if (extended)
+		if (*extended)
 		{
-			RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
-									freespace);
+			brin_initialize_empty_new_buffer(irel, buf);
+
 			ereport(ERROR,
 					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
 			errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
@@ -659,6 +810,43 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 }
 
 /*
+ * Initialize a page as an empty regular BRIN page, WAL-log this, and record
+ * the page in FSM.
+ *
+ * There are several corner situations in which we extend the relation to
+ * obtain a new page and later find that we cannot use it immediately.  When
+ * that happens, we don't want to leave the page go unrecorded in FSM, because
+ * there is no mechanism to get the space back and the index would bloat.
+ * Also, because we would not WAL-log the action that would initialize the
+ * page, the page would go uninitialized in a standby (or after recovery).
+ */
+static void
+brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
+{
+	Page	page;
+
+	BRIN_elog((DEBUG2,
+			   "brin_initialize_empty_new_buffer: initializing blank page %u",
+			   BufferGetBlockNumber(buffer)));
+
+	START_CRIT_SECTION();
+	page = BufferGetPage(buffer);
+	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
+	MarkBufferDirty(buffer);
+	log_newpage_buffer(buffer, true);
+	END_CRIT_SECTION();
+
+	/*
+	 * We update the FSM for this page, but this is not WAL-logged.  This is
+	 * acceptable because VACUUM will scan the index and update the FSM with
+	 * pages whose FSM records were forgotten in a crash.
+	 */
+	RecordPageWithFreeSpace(idxrel, BufferGetBlockNumber(buffer),
+							br_page_get_freespace(page));
+}
+
+
+/*
  * Return the amount of free space on a regular BRIN index page.
  *
  * If the page is not a regular page, or has been marked with the
diff --git a/src/include/access/brin_page.h b/src/include/access/brin_page.h
index ecbd13a..c827ca3 100644
--- a/src/include/access/brin_page.h
+++ b/src/include/access/brin_page.h
@@ -52,6 +52,7 @@ typedef struct BrinSpecialSpace
 #define		BRIN_PAGETYPE_REVMAP		0xF092
 #define		BRIN_PAGETYPE_REGULAR		0xF093
 
+#define BRIN_IS_META_PAGE(page) (BrinPageType(page) == BRIN_PAGETYPE_META)
 #define BRIN_IS_REVMAP_PAGE(page) (BrinPageType(page) == BRIN_PAGETYPE_REVMAP)
 #define BRIN_IS_REGULAR_PAGE(page) (BrinPageType(page) == BRIN_PAGETYPE_REGULAR)
 
diff --git a/src/include/access/brin_pageops.h b/src/include/access/brin_pageops.h
index 6ebe8ef..1007e07 100644
--- a/src/include/access/brin_pageops.h
+++ b/src/include/access/brin_pageops.h
@@ -33,4 +33,6 @@ extern bool brin_start_evacuating_page(Relation idxRel, Buffer buf);
 extern void brin_evacuate_page(Relation idxRel, BlockNumber pagesPerRange,
 				   BrinRevmap *revmap, Buffer buf);
 
+extern bool brin_page_cleanup(Relation idxrel, Buffer buf);
+
 #endif   /* BRIN_PAGEOPS_H */