Should buffer of initialization fork have a BM_PERMANENT flag

Started by Wang Haoalmost 9 years ago11 messages

whberet@gmail.com

almost 9 years ago

An unlogged table has an initialization fork. The initialization fork does
not have an BM_PERMANENT flag when get a buffer.
In checkpoint (not shutdown or end of recovery), it will not write to disk.
after a crash recovery, the page of initialization fork will not correctly,
then make the main fork not correctly too.

Here is an example for GIN index.

create unlogged table gin_test_tbl(i int4[]);
create index gin_test_idx on gin_test_tbl using gin (i);
checkpoint;

kill all the postgres process, and restart again.

vacuum gin_test_tbl; -- crash.

It seems have same problem in BRIN, GIN, GiST and HASH index which using
buffer for meta page initialize in ambuildempty function.

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Wang Hao (#1)

1 attachment(s)

Re: Should buffer of initialization fork have a BM_PERMANENT flag

(Adding Robert in CC.)

On Thu, Jan 26, 2017 at 4:34 AM, Wang Hao <whberet@gmail.com> wrote:

An unlogged table has an initialization fork. The initialization fork does
not have an BM_PERMANENT flag when get a buffer.
In checkpoint (not shutdown or end of recovery), it will not write to disk.
after a crash recovery, the page of initialization fork will not correctly,
then make the main fork not correctly too.

For init forks the flush need absolutely to happen, so that's really
not good. We ought to fix BufferAlloc() appropriately here.

Here is an example for GIN index.

create unlogged table gin_test_tbl(i int4[]);
create index gin_test_idx on gin_test_tbl using gin (i);
checkpoint;

kill all the postgres process, and restart again.

vacuum gin_test_tbl; -- crash.

It seems have same problem in BRIN, GIN, GiST and HASH index which using
buffer for meta page initialize in ambuildempty function.

Yeah, other index AMs deal directly with the sync of the page, that's
why there is no issue for them.

So the patch attached fixes the problem by changing BufferAlloc() in
such a way that initialization forks are permanently written to disk,
which is what you are suggesting. As a simple fix for back-branches
that's enough, though on HEAD I think that we should really rework the
empty() routines so as the write goes through shared buffers first,
that seems more solid than relying on the sgmr routines to do this
work. Robert, what do you think?
--
Michael

Attachments:

unlogged-flush-fix.patchapplication/octet-stream; name=unlogged-flush-fix.patchDownload

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cb51204dc..66fe9ea529 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1302,6 +1302,14 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	else
 		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 
+	/*
+	 * Initialization forks of unlogged tables need to have their dirty
+	 * buffers written permanently to survive in case of a crash if the
+	 * redo point is moved past the WAL-logging of the page itself.
+	 */
+	if (forkNum == INIT_FORKNUM)
+		buf_state |= BM_PERMANENT;
+
 	UnlockBufHdr(buf, buf_state);
 
 	if (oldPartitionLock != NULL)

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Michael Paquier (#2)

1 attachment(s)

Re: Should buffer of initialization fork have a BM_PERMANENT flag

On Thu, Jan 26, 2017 at 9:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

So the patch attached fixes the problem by changing BufferAlloc() in
such a way that initialization forks are permanently written to disk,
which is what you are suggesting. As a simple fix for back-branches
that's enough, though on HEAD I think that we should really rework the
empty() routines so as the write goes through shared buffers first,
that seems more solid than relying on the sgmr routines to do this
work. Robert, what do you think?

Attached is what I have in mind for HEAD. btree, gist, spgist and
bloom indexes are changed so as the init forks created go through the
shared buffers instead of having their empty() routines handle the
flush of the page created. This removes any kind of race conditions
between the checkpointer and the init fork creations, which is I think
a good thing.

Here are the tests I have done.
First running those commands to create all types of indexes.
create extension bloom;
create extension btree_gist;
create extension btree_gin;
create unlogged table foo (a int);
create index foo_bt on foo(a);
create index foo_bloom on foo using bloom(a);
create index foo_gin on foo using gin (a);
create index foo_gist on foo using gist (a);
create index foo_brin on foo using brin (a);
create unlogged table foo_geo (a box);
create index foo_spgist ON foo_geo using spgist(a);
checkpoint;

Then crash the server, restart it, and the following vacuums are able
to complete.
vacuum foo;
vacuum foo_geo;

I have as well created a CF entry for this set of patches:
https://commitfest.postgresql.org/13/971/

Thanks,
--
Michael

Attachments:

unlogged-flush-fix-head.patchapplication/x-patch; name=unlogged-flush-fix-head.patchDownload

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 29bc5cea79..d04298add5 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -158,31 +158,24 @@ blbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 blbuildempty(Relation index)
 {
-	Page		metapage;
+	Buffer		MetaBuffer;
 
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	BloomFillMetapage(index, metapage);
+	/* An empty bloom index has one meta page */
+	MetaBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * Write the page and log it.  It might seem that an immediate sync
-	 * would be sufficient to guarantee that the file exists on disk, but
-	 * recovery itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
-	 * need this even when wal_level=minimal.
-	 */
-	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
-			  (char *) metapage, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				BLOOM_METAPAGE_BLKNO, metapage, false);
+	/* Initialize the meta buffer */
+	BloomFillMetapage(index, BufferGetPage(MetaBuffer));
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	/* And log it */
+	START_CRIT_SECTION();
+	MarkBufferDirty(MetaBuffer);
+	log_newpage_buffer(MetaBuffer, false);
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffer */
+	UnlockReleaseBuffer(MetaBuffer);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1bb1acfea6..215d3c6c02 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -237,31 +238,22 @@ btbuildCallback(Relation index,
 void
 btbuildempty(Relation index)
 {
-	Page		metapage;
-
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
-
-	/*
-	 * Write the page and log it.  It might seem that an immediate sync
-	 * would be sufficient to guarantee that the file exists on disk, but
-	 * recovery itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
-	 * need this even when wal_level=minimal.
-	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, false);
-
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	Buffer		metabuffer;
+
+	/* An empty btree index has one meta page */
+	metabuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/* Initialize and xlog meta buffer */
+	START_CRIT_SECTION();
+	_bt_initmetapage(BufferGetPage(metabuffer), P_NONE, 0);
+	MarkBufferDirty(metabuffer);
+	log_newpage_buffer(metabuffer, false);
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffer */
+	UnlockReleaseBuffer(metabuffer);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index b42f4b7139..7f2997c92a 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -155,49 +155,50 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 spgbuildempty(Relation index)
 {
-	Page		page;
-
-	/* Construct metapage. */
-	page = (Page) palloc(BLCKSZ);
-	SpGistInitMetapage(page);
+	Buffer		MetaBuffer,
+				RootBuffer,
+				TupleBuffer;
 
 	/*
-	 * Write the page and log it unconditionally.  This is important
-	 * particularly for indexes created on tablespaces and databases
-	 * whose creation happened after the last redo pointer as recovery
-	 * removes any of their existing content when the corresponding
-	 * create records are replayed.
+	 * An empty SPGist index has three pages:
+	 * - one meta page.
+	 * - one root page.
+	 * - one null-tuple root page.
 	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_METAPAGE_BLKNO, page, false);
-
-	/* Likewise for the root page. */
-	SpGistInitPage(page, SPGIST_LEAF);
-
-	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_ROOT_BLKNO, page, true);
-
-	/* Likewise for the null-tuples root page. */
-	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
-
-	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_NULL_BLKNO, page, true);
+	MetaBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	RootBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
+	TupleBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(TupleBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/* Initialize and log all the pages */
+	START_CRIT_SECTION();
 
-	/*
-	 * An immediate sync is required even if we xlog'd the pages, because the
-	 * writes did not go through shared buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	/* Construct and log the meta page */
+	SpGistInitMetapage(BufferGetPage(MetaBuffer));
+	MarkBufferDirty(MetaBuffer);
+	log_newpage_buffer(MetaBuffer, false);
+
+	/* root page */
+	SpGistInitPage(BufferGetPage(RootBuffer), SPGIST_LEAF);
+	MarkBufferDirty(RootBuffer);
+	log_newpage_buffer(RootBuffer, false);
+
+	/* null-tuples root page. */
+	SpGistInitPage(BufferGetPage(TupleBuffer), SPGIST_LEAF | SPGIST_NULLS);
+	MarkBufferDirty(TupleBuffer);
+	log_newpage_buffer(TupleBuffer, false);
+
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffers. */
+	UnlockReleaseBuffer(MetaBuffer);
+	UnlockReleaseBuffer(RootBuffer);
+	UnlockReleaseBuffer(TupleBuffer);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cb51204dc..66fe9ea529 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1302,6 +1302,14 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	else
 		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 
+	/*
+	 * Initialization forks of unlogged tables need to have their dirty
+	 * buffers written permanently to survive in case of a crash if the
+	 * redo point is moved past the WAL-logging of the page itself.
+	 */
+	if (forkNum == INIT_FORKNUM)
+		buf_state |= BM_PERMANENT;
+
 	UnlockBufHdr(buf, buf_state);
 
 	if (oldPartitionLock != NULL)

Artur Zakirov

a.zakirov@postgrespro.ru

almost 9 years ago

In reply to: Michael Paquier (#3)

Re: Should buffer of initialization fork have a BM_PERMANENT flag

Hello,

I wanted to review the patch. But the patch is applied with errors. I've
rebased the local copy and have done review on it. I'm not sure is it
properly to send rebased patch by reviewer, so I haven't sent it to
avoid confuses.

On 29.01.2017 17:00, Michael Paquier wrote:

Attached is what I have in mind for HEAD. btree, gist, spgist and
bloom indexes are changed so as the init forks created go through the
shared buffers instead of having their empty() routines handle the
flush of the page created. This removes any kind of race conditions
between the checkpointer and the init fork creations, which is I think
a good thing.

I think this is good fixes. I've checked them. And in my opinion they
are correct.

The code also is good.

Here are the tests I have done.
First running those commands to create all types of indexes.
create extension bloom;
create extension btree_gist;
create extension btree_gin;
create unlogged table foo (a int);
create index foo_bt on foo(a);
create index foo_bloom on foo using bloom(a);
create index foo_gin on foo using gin (a);
create index foo_gist on foo using gist (a);
create index foo_brin on foo using brin (a);
create unlogged table foo_geo (a box);
create index foo_spgist ON foo_geo using spgist(a);
checkpoint;

Then crash the server, restart it, and the following vacuums are able
to complete.
vacuum foo;
vacuum foo_geo;

I've done this tests. Before the patch server crashes on vacuum command.
After applying the patch server doesn't crash on vacuum command.

I have run regression and TAP tests. They all passed without error.

I think the patch can be marked as "Ready for Committer" after rebase.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Artur Zakirov (#4)

1 attachment(s)

Re: Should buffer of initialization fork have a BM_PERMANENT flag

On Thu, Mar 9, 2017 at 10:25 PM, Artur Zakirov <a.zakirov@postgrespro.ru> wrote:

I think this is good fixes. I've checked them. And in my opinion they are
correct.

The code also is good.

Having something with conflicts is not nice, so attached is a rebased version.

I have run regression and TAP tests. They all passed without error.

I think the patch can be marked as "Ready for Committer" after rebase.

Thanks for the review.
--
Michael

Attachments:

unlogged-flush-fix-head-v2.patchapplication/octet-stream; name=unlogged-flush-fix-head-v2.patchDownload

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 913f1f8a51..3557b106d8 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -158,31 +158,24 @@ blbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 blbuildempty(Relation index)
 {
-	Page		metapage;
+	Buffer		MetaBuffer;
 
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	BloomFillMetapage(index, metapage);
+	/* An empty bloom index has one meta page */
+	MetaBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * Write the page and log it.  It might seem that an immediate sync
-	 * would be sufficient to guarantee that the file exists on disk, but
-	 * recovery itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
-	 * need this even when wal_level=minimal.
-	 */
-	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
-			  (char *) metapage, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				BLOOM_METAPAGE_BLKNO, metapage, false);
+	/* Initialize the meta buffer */
+	BloomFillMetapage(index, BufferGetPage(MetaBuffer));
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	/* And log it */
+	START_CRIT_SECTION();
+	MarkBufferDirty(MetaBuffer);
+	log_newpage_buffer(MetaBuffer, false);
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffer */
+	UnlockReleaseBuffer(MetaBuffer);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 775f2ff1f8..e1801ea939 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
@@ -282,31 +283,22 @@ btbuildCallback(Relation index,
 void
 btbuildempty(Relation index)
 {
-	Page		metapage;
-
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
-
-	/*
-	 * Write the page and log it.  It might seem that an immediate sync
-	 * would be sufficient to guarantee that the file exists on disk, but
-	 * recovery itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
-	 * need this even when wal_level=minimal.
-	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, false);
-
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	Buffer		metabuffer;
+
+	/* An empty btree index has one meta page */
+	metabuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/* Initialize and xlog meta buffer */
+	START_CRIT_SECTION();
+	_bt_initmetapage(BufferGetPage(metabuffer), P_NONE, 0);
+	MarkBufferDirty(metabuffer);
+	log_newpage_buffer(metabuffer, false);
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffer */
+	UnlockReleaseBuffer(metabuffer);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 00a0ab4438..4252c2eb53 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -156,49 +156,50 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 spgbuildempty(Relation index)
 {
-	Page		page;
-
-	/* Construct metapage. */
-	page = (Page) palloc(BLCKSZ);
-	SpGistInitMetapage(page);
+	Buffer		MetaBuffer,
+				RootBuffer,
+				TupleBuffer;
 
 	/*
-	 * Write the page and log it unconditionally.  This is important
-	 * particularly for indexes created on tablespaces and databases
-	 * whose creation happened after the last redo pointer as recovery
-	 * removes any of their existing content when the corresponding
-	 * create records are replayed.
+	 * An empty SPGist index has three pages:
+	 * - one meta page.
+	 * - one root page.
+	 * - one null-tuple root page.
 	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_METAPAGE_BLKNO, page, false);
-
-	/* Likewise for the root page. */
-	SpGistInitPage(page, SPGIST_LEAF);
-
-	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_ROOT_BLKNO, page, true);
-
-	/* Likewise for the null-tuples root page. */
-	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
-
-	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_NULL_BLKNO, page, true);
+	MetaBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	RootBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
+	TupleBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(TupleBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/* Initialize and log all the pages */
+	START_CRIT_SECTION();
 
-	/*
-	 * An immediate sync is required even if we xlog'd the pages, because the
-	 * writes did not go through shared buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	/* Construct and log the meta page */
+	SpGistInitMetapage(BufferGetPage(MetaBuffer));
+	MarkBufferDirty(MetaBuffer);
+	log_newpage_buffer(MetaBuffer, false);
+
+	/* root page */
+	SpGistInitPage(BufferGetPage(RootBuffer), SPGIST_LEAF);
+	MarkBufferDirty(RootBuffer);
+	log_newpage_buffer(RootBuffer, false);
+
+	/* null-tuples root page. */
+	SpGistInitPage(BufferGetPage(TupleBuffer), SPGIST_LEAF | SPGIST_NULLS);
+	MarkBufferDirty(TupleBuffer);
+	log_newpage_buffer(TupleBuffer, false);
+
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffers. */
+	UnlockReleaseBuffer(MetaBuffer);
+	UnlockReleaseBuffer(RootBuffer);
+	UnlockReleaseBuffer(TupleBuffer);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cb51204dc..66fe9ea529 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1302,6 +1302,14 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	else
 		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 
+	/*
+	 * Initialization forks of unlogged tables need to have their dirty
+	 * buffers written permanently to survive in case of a crash if the
+	 * redo point is moved past the WAL-logging of the page itself.
+	 */
+	if (forkNum == INIT_FORKNUM)
+		buf_state |= BM_PERMANENT;
+
 	UnlockBufHdr(buf, buf_state);
 
 	if (oldPartitionLock != NULL)

Artur Zakirov

a.zakirov@postgrespro.ru

almost 9 years ago

In reply to: Michael Paquier (#5)

Re: Should buffer of initialization fork have a BM_PERMANENT flag

On 10.03.2017 04:00, Michael Paquier wrote:

On Thu, Mar 9, 2017 at 10:25 PM, Artur Zakirov <a.zakirov@postgrespro.ru> wrote:

I think this is good fixes. I've checked them. And in my opinion they are
correct.

The code also is good.

Having something with conflicts is not nice, so attached is a rebased version.

Thank you!

I've rerun regression and TAP tests. They all passed.

Also maybe it will be good to fix comments.

In buf_internals.h:

#define BM_PERMANENT (1U << 31) /* permanent relation (not
* unlogged) */

And in FlushBuffer():

/*
* Force XLOG flush up to buffer's LSN. This implements the basic WAL
* rule that log updates must hit disk before any of the data-file changes
* they describe do.
*
* However, this rule does not apply to unlogged relations, which will be
* lost after a crash anyway. Most unlogged relation pages do not bear

Because BM_PERMANENT is used for init forks of unlogged indexes now.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Artur Zakirov (#6)

1 attachment(s)

Re: Should buffer of initialization fork have a BM_PERMANENT flag

On Sat, Mar 11, 2017 at 12:03 AM, Artur Zakirov
<a.zakirov@postgrespro.ru> wrote:

Because BM_PERMANENT is used for init forks of unlogged indexes now.

Yes, indeed.
--
Michael

Attachments:

unlogged-flush-fix-head-v3.patchtext/x-patch; charset=US-ASCII; name=unlogged-flush-fix-head-v3.patchDownload

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 913f1f8a51..3557b106d8 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -158,31 +158,24 @@ blbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 blbuildempty(Relation index)
 {
-	Page		metapage;
+	Buffer		MetaBuffer;
 
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	BloomFillMetapage(index, metapage);
+	/* An empty bloom index has one meta page */
+	MetaBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * Write the page and log it.  It might seem that an immediate sync
-	 * would be sufficient to guarantee that the file exists on disk, but
-	 * recovery itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
-	 * need this even when wal_level=minimal.
-	 */
-	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
-			  (char *) metapage, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				BLOOM_METAPAGE_BLKNO, metapage, false);
+	/* Initialize the meta buffer */
+	BloomFillMetapage(index, BufferGetPage(MetaBuffer));
 
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	/* And log it */
+	START_CRIT_SECTION();
+	MarkBufferDirty(MetaBuffer);
+	log_newpage_buffer(MetaBuffer, false);
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffer */
+	UnlockReleaseBuffer(MetaBuffer);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 775f2ff1f8..e1801ea939 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
@@ -282,31 +283,22 @@ btbuildCallback(Relation index,
 void
 btbuildempty(Relation index)
 {
-	Page		metapage;
-
-	/* Construct metapage. */
-	metapage = (Page) palloc(BLCKSZ);
-	_bt_initmetapage(metapage, P_NONE, 0);
-
-	/*
-	 * Write the page and log it.  It might seem that an immediate sync
-	 * would be sufficient to guarantee that the file exists on disk, but
-	 * recovery itself might remove it while replaying, for example, an
-	 * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
-	 * need this even when wal_level=minimal.
-	 */
-	PageSetChecksumInplace(metapage, BTREE_METAPAGE);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
-			  (char *) metapage, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				BTREE_METAPAGE, metapage, false);
-
-	/*
-	 * An immediate sync is required even if we xlog'd the page, because the
-	 * write did not go through shared_buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	Buffer		metabuffer;
+
+	/* An empty btree index has one meta page */
+	metabuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metabuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/* Initialize and xlog meta buffer */
+	START_CRIT_SECTION();
+	_bt_initmetapage(BufferGetPage(metabuffer), P_NONE, 0);
+	MarkBufferDirty(metabuffer);
+	log_newpage_buffer(metabuffer, false);
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffer */
+	UnlockReleaseBuffer(metabuffer);
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 00a0ab4438..4252c2eb53 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -156,49 +156,50 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 void
 spgbuildempty(Relation index)
 {
-	Page		page;
-
-	/* Construct metapage. */
-	page = (Page) palloc(BLCKSZ);
-	SpGistInitMetapage(page);
+	Buffer		MetaBuffer,
+				RootBuffer,
+				TupleBuffer;
 
 	/*
-	 * Write the page and log it unconditionally.  This is important
-	 * particularly for indexes created on tablespaces and databases
-	 * whose creation happened after the last redo pointer as recovery
-	 * removes any of their existing content when the corresponding
-	 * create records are replayed.
+	 * An empty SPGist index has three pages:
+	 * - one meta page.
+	 * - one root page.
+	 * - one null-tuple root page.
 	 */
-	PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
-			  (char *) page, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_METAPAGE_BLKNO, page, false);
-
-	/* Likewise for the root page. */
-	SpGistInitPage(page, SPGIST_LEAF);
-
-	PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
-			  (char *) page, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_ROOT_BLKNO, page, true);
-
-	/* Likewise for the null-tuples root page. */
-	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
-
-	PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
-			  (char *) page, true);
-	log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-				SPGIST_NULL_BLKNO, page, true);
+	MetaBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	RootBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
+	TupleBuffer =
+		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(TupleBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/* Initialize and log all the pages */
+	START_CRIT_SECTION();
 
-	/*
-	 * An immediate sync is required even if we xlog'd the pages, because the
-	 * writes did not go through shared buffers and therefore a concurrent
-	 * checkpoint may have moved the redo pointer past our xlog record.
-	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	/* Construct and log the meta page */
+	SpGistInitMetapage(BufferGetPage(MetaBuffer));
+	MarkBufferDirty(MetaBuffer);
+	log_newpage_buffer(MetaBuffer, false);
+
+	/* root page */
+	SpGistInitPage(BufferGetPage(RootBuffer), SPGIST_LEAF);
+	MarkBufferDirty(RootBuffer);
+	log_newpage_buffer(RootBuffer, false);
+
+	/* null-tuples root page. */
+	SpGistInitPage(BufferGetPage(TupleBuffer), SPGIST_LEAF | SPGIST_NULLS);
+	MarkBufferDirty(TupleBuffer);
+	log_newpage_buffer(TupleBuffer, false);
+
+	END_CRIT_SECTION();
+
+	/* Unlock and release the buffers. */
+	UnlockReleaseBuffer(MetaBuffer);
+	UnlockReleaseBuffer(RootBuffer);
+	UnlockReleaseBuffer(TupleBuffer);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cb51204dc..66fe9ea529 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1302,6 +1302,14 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	else
 		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 
+	/*
+	 * Initialization forks of unlogged tables need to have their dirty
+	 * buffers written permanently to survive in case of a crash if the
+	 * redo point is moved past the WAL-logging of the page itself.
+	 */
+	if (forkNum == INIT_FORKNUM)
+		buf_state |= BM_PERMANENT;
+
 	UnlockBufHdr(buf, buf_state);
 
 	if (oldPartitionLock != NULL)
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index d117b66537..23054a6f6f 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -64,8 +64,9 @@
 #define BM_JUST_DIRTIED			(1U << 28)		/* dirtied since write started */
 #define BM_PIN_COUNT_WAITER		(1U << 29)		/* have waiter for sole pin */
 #define BM_CHECKPOINT_NEEDED	(1U << 30)		/* must write for checkpoint */
-#define BM_PERMANENT			(1U << 31)		/* permanent relation (not
-												 * unlogged) */
+#define BM_PERMANENT			(1U << 31)		/* permanent relation (init
+												 * forks for unlogged) */
+
 /*
  * The maximum allowed value of usage_count represents a tradeoff between
  * accuracy and speed of the clock-sweep buffer management algorithm.  A

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Michael Paquier (#2)

1 attachment(s)

Re: Should buffer of initialization fork have a BM_PERMANENT flag

On Wed, Jan 25, 2017 at 7:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

(Adding Robert in CC.)

On Thu, Jan 26, 2017 at 4:34 AM, Wang Hao <whberet@gmail.com> wrote:

An unlogged table has an initialization fork. The initialization fork does
not have an BM_PERMANENT flag when get a buffer.
In checkpoint (not shutdown or end of recovery), it will not write to disk.
after a crash recovery, the page of initialization fork will not correctly,
then make the main fork not correctly too.

For init forks the flush need absolutely to happen, so that's really
not good. We ought to fix BufferAlloc() appropriately here.

I agree with that, but I propose the attached version instead. It
seems cleaner to have the entire test for setting BM_PERMANENT in one
place rather than splitting it up as you did.

I believe this sets a record for the longest-lived data corruption bug
in a commit made by me. Six years and change, woohoo. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

unlogged-flush-fix-rmh.patchapplication/octet-stream; name=unlogged-flush-fix-rmh.patchDownload

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cb5120..7d3d8cc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1292,12 +1292,17 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * paranoia.  We also reset the usage_count since any recency of use of
 	 * the old content is no longer relevant.  (The usage_count starts out at
 	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 *
+	 * Make sure BM_PERMANENT is set for buffers that must be written at
+	 * every checkpoint.  Unlogged buffers only need to be written at shutdown
+	 * checkpoints, except for their "init" forks, which need to be treated
+	 * just like permanent relations.
 	 */
 	buf->tag = newTag;
 	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
 				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
 				   BUF_USAGECOUNT_MASK);
-	if (relpersistence == RELPERSISTENCE_PERMANENT)
+	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
 		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Robert Haas (#8)

Re: Should buffer of initialization fork have a BM_PERMANENT flag

On Tue, Mar 14, 2017 at 4:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 25, 2017 at 7:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

(Adding Robert in CC.)

On Thu, Jan 26, 2017 at 4:34 AM, Wang Hao <whberet@gmail.com> wrote:

An unlogged table has an initialization fork. The initialization fork does
not have an BM_PERMANENT flag when get a buffer.
In checkpoint (not shutdown or end of recovery), it will not write to disk.
after a crash recovery, the page of initialization fork will not correctly,
then make the main fork not correctly too.

For init forks the flush need absolutely to happen, so that's really
not good. We ought to fix BufferAlloc() appropriately here.

I agree with that, but I propose the attached version instead. It
seems cleaner to have the entire test for setting BM_PERMANENT in one
place rather than splitting it up as you did.

Fine for me. You may want to update the comment of BM_PERMANENT in
buf_internals.h as Artur has mentioned upthread. For example by just
adding "and init forks".

I believe this sets a record for the longest-lived data corruption bug
in a commit made by me.

Really? I'll need to double-check the git history here.

Six years and change, woohoo. :-(

And that much for someone to report it.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Michael Paquier (#9)

Re: Should buffer of initialization fork have a BM_PERMANENT flag

On Mon, Mar 13, 2017 at 8:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

I agree with that, but I propose the attached version instead. It
seems cleaner to have the entire test for setting BM_PERMANENT in one
place rather than splitting it up as you did.

Fine for me. You may want to update the comment of BM_PERMANENT in
buf_internals.h as Artur has mentioned upthread. For example by just
adding "and init forks".

OK, done, and back-patched all the way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Robert Haas (#10)

Re: Should buffer of initialization fork have a BM_PERMANENT flag

On Wed, Mar 15, 2017 at 1:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 13, 2017 at 8:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

I agree with that, but I propose the attached version instead. It
seems cleaner to have the entire test for setting BM_PERMANENT in one
place rather than splitting it up as you did.

Fine for me. You may want to update the comment of BM_PERMANENT in
buf_internals.h as Artur has mentioned upthread. For example by just
adding "and init forks".

OK, done, and back-patched all the way.

Thanks.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers