Unlogged relations and WAL-logging
Unlogged relations are not WAL-logged, but creating the init-fork is.
There are a few things around that seem sloppy:
1. In index_build(), we do this:
/*
* If this is an unlogged index, we may need to write out an init fork for
* it -- but we must first check whether one already exists. If, for
* example, an unlogged relation is truncated in the transaction that
* created it, or truncated twice in a subsequent transaction, the
* relfilenode won't change, and nothing needs to be done here.
*/
if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
!smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
{
smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
indexRelation->rd_indam->ambuildempty(indexRelation);
}
Shouldn't we call log_smgrcreate() here? Creating the init fork is
otherwise not WAL-logged at all. In practice, all the ambuildempty()
implementations create and WAL-log a metapage and replay of that will
implicitly create the underlying relation. But if you created an index
access method whose ambuildempty() function does nothing, i.e. the init
fork is just an empty file, there would be no trace in the WAL about
creating the init fork, which is bad. In fact,
src/test/modules/dummy_index_am is an example of that, but because it
does nothing at all with the file, you cannot see any ill effect from
the missing init fork.
2. Some implementations of ambuildempty() use the buffer cache (hash,
gist, gin, brin), while others bypass it and call smgrimmedsync()
instead (btree, spgist, bloom). I don't see any particular reason for
those decisions, it seems to be based purely on which example the author
happened to copy-paste.
Using the buffer cache seems better to me, because it avoids the
smgrimmedsync() call. That makes creating unlogged indexes faster.
That's not significant for any real world application, but still.
3. Those ambuildempty implementations that bypass the buffer cache use
smgrwrite() to write the pages. That doesn't make any difference in
practice, but in principle it's wrong: You are supposed to use
smgrextend() when extending a relation. Using smgrwrite() skips updating
the relation size cache, which is harmless in this case because it
wasn't initialized previously either, and I'm not sure if we ever call
smgrnblocks() on the init-fork. But if you build with
CHECK_WRITE_VS_EXTEND, you get an assertion failure.
4. Also, the smgrwrite() calls are performed before WAL-logging the
pages, so the page that's written to disk has 0/0 as the LSN, not the
LSN of the WAL record. That's harmless too, but seems a bit sloppy.
(I remember we had a discussion once whether we should always extend the
relation first, and WAL-log only after that, but I can't find the thread
right now. The point is to avoid the situation that the operation fails
because you run out of disk space, and then you crash and WAL replay
also fails because you are still out of disk space. But most places
currently call log_newpage() first, and smgrextend() after that.)
5. In heapam_relation_set_new_filenode(), we do this:
/*
* If required, set up an init fork for an unlogged table so that it can
* be correctly reinitialized on restart. An immediate sync is required
* even if the page has been logged, because the write did not go through
* shared_buffers and therefore a concurrent checkpoint may have moved the
* redo pointer past our xlog record. Recovery may as well remove it
* while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
* record. Therefore, logging is necessary even if wal_level=minimal.
*/
if (persistence == RELPERSISTENCE_UNLOGGED)
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_MATVIEW ||
rel->rd_rel->relkind == RELKIND_TOASTVALUE);
smgrcreate(srel, INIT_FORKNUM, false);
log_smgrcreate(newrnode, INIT_FORKNUM);
smgrimmedsync(srel, INIT_FORKNUM);
}
The comment doesn't make much sense, we haven't written nor WAL-logged
any page here, with nor without the buffer cache. It made more sense
before commit fa0f466d53.
This is what actually led me to discover the bug I just reported at [1]/messages/by-id/d47d8122-415e-425c-d0a2-e0160829702d@iki.fi,
with regular tables. If we fix that bug I like I proposed there, then
smgrcreate() will register and fsync request, and we won't need
smgrimmedsync here anymore.
Attached is a patch to tighten those up. The third one should arguably
be part of [1]/messages/by-id/d47d8122-415e-425c-d0a2-e0160829702d@iki.fi, not this thread, but here it goes too.
[1]: /messages/by-id/d47d8122-415e-425c-d0a2-e0160829702d@iki.fi
/messages/by-id/d47d8122-415e-425c-d0a2-e0160829702d@iki.fi
- Heikki
Attachments:
0001-WAL-log-the-creation-of-the-init-fork-in-index_build.patchtext/x-patch; charset=UTF-8; name=0001-WAL-log-the-creation-of-the-init-fork-in-index_build.patchDownload
From af05b18ff62977653b68ef15625dfce71be800b5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 27 Jan 2022 21:19:58 +0200
Subject: [PATCH 1/3] WAL-log the creation of the init fork in index_build().
That's harmless in practice, because all the ambuildempty()
implementations in tree, and in known extensions, write a metapage to
the init fork, and that is WAL-logged, which implicitly logs the
cration of the fork. But it would fail if ambuildempty() didn't write
any page to the fork.
---
src/backend/catalog/index.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2308d402565..865fddcfe57 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -54,6 +54,7 @@
#include "catalog/pg_trigger.h"
#include "catalog/pg_type.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "commands/event_trigger.h"
#include "commands/progress.h"
#include "commands/tablecmds.h"
@@ -3027,6 +3028,7 @@ index_build(Relation heapRelation,
!smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
{
smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
+ log_smgrcreate(&indexRelation->rd_node, INIT_FORKNUM);
indexRelation->rd_indam->ambuildempty(indexRelation);
}
--
2.30.2
0002-Use-the-buffer-cache-when-initializing-an-unlogged-i.patchtext/x-patch; charset=UTF-8; name=0002-Use-the-buffer-cache-when-initializing-an-unlogged-i.patchDownload
From 91975d892b47f4b3180218d753f462933b9faf49 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 27 Jan 2022 21:24:49 +0200
Subject: [PATCH 2/3] Use the buffer cache when initializing an unlogged index.
Some of the ambuildempty functions used smgrwrite() directly, followed
by smgrimmedsync(). A few small problems with that:
Firstly, one is supposed to use smgrextend() when extending the
relation, not smgrwrite(). That doesn't make much difference in
practice. smgrextend() updates the relation size cache, so you miss
that, but we don't use that for the init fork of an unlogged relation.
But if you compile with CHECK_WRITE_VS_EXTEND, you get an assertion
failure.
Secondly, the smgrwrite() calls were performed before WAL-logging, so
that the page image written to disk had 0/0 as the LSN, not the LSN of
the WAL record. That's harmless in practice, but seems sloppy.
Thirdly, it's better to use the buffer cache, because then you don't
need to smgrimmedsync() the relation to disk, which adds latency.
Bypassing the cache makes sense for bulk operations like index
creation, but not when you're just initializing an empty index.
Creation of unlogged tables is hardly performance bottleneck in any
real world applications, but nevertheless.
---
contrib/bloom/blinsert.c | 29 ++----------
contrib/bloom/bloom.h | 2 +-
contrib/bloom/blutils.c | 8 ++--
src/backend/access/nbtree/nbtree.c | 41 ++++++++--------
src/backend/access/spgist/spginsert.c | 68 ++++++++++++---------------
5 files changed, 59 insertions(+), 89 deletions(-)
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c94cf34e698..b20694d6250 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -129,7 +129,7 @@ blbuild(Relation heap, Relation index, IndexInfo *indexInfo)
RelationGetRelationName(index));
/* Initialize the meta page */
- BloomInitMetapage(index);
+ BloomInitMetapage(index, MAIN_FORKNUM);
/* Initialize the bloom build state */
memset(&buildstate, 0, sizeof(buildstate));
@@ -163,31 +163,8 @@ blbuild(Relation heap, Relation index, IndexInfo *indexInfo)
void
blbuildempty(Relation index)
{
- Page metapage;
-
- /* Construct metapage. */
- metapage = (Page) palloc(BLCKSZ);
- BloomFillMetapage(index, metapage);
-
- /*
- * Write the page and log it. It might seem that an immediate sync would
- * be sufficient to guarantee that the file exists on disk, but recovery
- * itself might remove it while replaying, for example, an
- * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need
- * this even when wal_level=minimal.
- */
- PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
- smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
- (char *) metapage, true);
- log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
- BLOOM_METAPAGE_BLKNO, metapage, true);
-
- /*
- * An immediate sync is required even if we xlog'd the page, because the
- * write did not go through shared_buffers and therefore a concurrent
- * checkpoint may have moved the redo pointer past our xlog record.
- */
- smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+ /* Initialize the meta page */
+ BloomInitMetapage(index, INIT_FORKNUM);
}
/*
diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 9966f3f46e1..e3259b4d126 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -178,7 +178,7 @@ typedef BloomScanOpaqueData *BloomScanOpaque;
extern void _PG_init(void);
extern void initBloomState(BloomState *state, Relation index);
extern void BloomFillMetapage(Relation index, Page metaPage);
-extern void BloomInitMetapage(Relation index);
+extern void BloomInitMetapage(Relation index, ForkNumber forknum);
extern void BloomInitPage(Page page, uint16 flags);
extern Buffer BloomNewBuffer(Relation index);
extern void signValue(BloomState *state, BloomSignatureWord *sign, Datum value, int attno);
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index a434cf93efd..6ff6dad5cb6 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -451,7 +451,7 @@ BloomFillMetapage(Relation index, Page metaPage)
* Initialize metapage for bloom index.
*/
void
-BloomInitMetapage(Relation index)
+BloomInitMetapage(Relation index, ForkNumber forknum)
{
Buffer metaBuffer;
Page metaPage;
@@ -459,9 +459,11 @@ BloomInitMetapage(Relation index)
/*
* Make a new page; since it is first page it should be associated with
- * block number 0 (BLOOM_METAPAGE_BLKNO).
+ * block number 0 (BLOOM_METAPAGE_BLKNO). No need to hold the extension
+ * lock because there cannot be concurrent inserters yet.
*/
- metaBuffer = BloomNewBuffer(index);
+ metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+ LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
Assert(BufferGetBlockNumber(metaBuffer) == BLOOM_METAPAGE_BLKNO);
/* Initialize contents of meta page */
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 13024af2faa..a3ef3f50cea 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -150,31 +150,32 @@ bthandler(PG_FUNCTION_ARGS)
void
btbuildempty(Relation index)
{
+ Buffer metabuf;
Page metapage;
- /* Construct metapage. */
- metapage = (Page) palloc(BLCKSZ);
- _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
-
/*
- * Write the page and log it. It might seem that an immediate sync would
- * be sufficient to guarantee that the file exists on disk, but recovery
- * itself might remove it while replaying, for example, an
- * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need
- * this even when wal_level=minimal.
+ * Initalize the metapage.
+ *
+ * Regular index build bypasses the buffer manager and uses smgr functions
+ * directly, with an smgrimmedsync() call at the end. That makes sense
+ * when the index is large, but for an empty index, it's better to use
+ * the buffer cache to avoid the smgrimmedsync().
*/
- PageSetChecksumInplace(metapage, BTREE_METAPAGE);
- smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BTREE_METAPAGE,
- (char *) metapage, true);
- log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
- BTREE_METAPAGE, metapage, true);
+ metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+ Assert(BufferGetBlockNumber(metabuf) == BTREE_METAPAGE);
+ _bt_lockbuf(index, metabuf, BT_WRITE);
- /*
- * An immediate sync is required even if we xlog'd the page, because the
- * write did not go through shared_buffers and therefore a concurrent
- * checkpoint may have moved the redo pointer past our xlog record.
- */
- smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+ START_CRIT_SECTION();
+
+ metapage = BufferGetPage(metabuf);
+ _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
+ MarkBufferDirty(metabuf);
+ log_newpage_buffer(metabuf, true);
+
+ END_CRIT_SECTION();
+
+ _bt_unlockbuf(index, metabuf);
+ ReleaseBuffer(metabuf);
}
/*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bfb74049d0c..2676b0ad74a 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -155,49 +155,39 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
void
spgbuildempty(Relation index)
{
- Page page;
-
- /* Construct metapage. */
- page = (Page) palloc(BLCKSZ);
- SpGistInitMetapage(page);
+ Buffer metabuffer,
+ rootbuffer,
+ nullbuffer;
/*
- * Write the page and log it unconditionally. This is important
- * particularly for indexes created on tablespaces and databases whose
- * creation happened after the last redo pointer as recovery removes any
- * of their existing content when the corresponding create records are
- * replayed.
+ * Initialize the meta page and root pages
*/
- PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
- smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
- (char *) page, true);
- log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
- SPGIST_METAPAGE_BLKNO, page, true);
-
- /* Likewise for the root page. */
- SpGistInitPage(page, SPGIST_LEAF);
-
- PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
- smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_ROOT_BLKNO,
- (char *) page, true);
- log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
- SPGIST_ROOT_BLKNO, page, true);
-
- /* Likewise for the null-tuples root page. */
- SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
-
- PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
- smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, SPGIST_NULL_BLKNO,
- (char *) page, true);
- log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
- SPGIST_NULL_BLKNO, page, true);
+ metabuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+ rootbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
+ nullbuffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
- /*
- * An immediate sync is required even if we xlog'd the pages, because the
- * writes did not go through shared buffers and therefore a concurrent
- * checkpoint may have moved the redo pointer past our xlog record.
- */
- smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
+ Assert(BufferGetBlockNumber(metabuffer) == SPGIST_METAPAGE_BLKNO);
+ Assert(BufferGetBlockNumber(rootbuffer) == SPGIST_ROOT_BLKNO);
+ Assert(BufferGetBlockNumber(nullbuffer) == SPGIST_NULL_BLKNO);
+
+ START_CRIT_SECTION();
+
+ SpGistInitMetapage(BufferGetPage(metabuffer));
+ MarkBufferDirty(metabuffer);
+ SpGistInitBuffer(rootbuffer, SPGIST_LEAF);
+ MarkBufferDirty(rootbuffer);
+ SpGistInitBuffer(nullbuffer, SPGIST_LEAF | SPGIST_NULLS);
+ MarkBufferDirty(nullbuffer);
+
+ log_newpage_buffer(metabuffer, true);
+ log_newpage_buffer(rootbuffer, true);
+ log_newpage_buffer(nullbuffer, true);
+
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(metabuffer);
+ UnlockReleaseBuffer(rootbuffer);
+ UnlockReleaseBuffer(nullbuffer);
}
/*
--
2.30.2
0003-Remove-unnecessary-smgrimmedsync-when-creating-unlog.patchtext/x-patch; charset=UTF-8; name=0003-Remove-unnecessary-smgrimmedsync-when-creating-unlog.patchDownload
From 409c47797b3beba03f88e9386b0f0967ba28bf41 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 27 Jan 2022 21:24:04 +0200
Subject: [PATCH 3/3] Remove unnecessary smgrimmedsync() when creating unlogged
table.
In the passing, fix a comment in smgrDoPendingSyncs().
TODO: only push this after we have fixed smgrcreate() to register the new
relation for fsyncing at next checkpoint.
---
src/backend/access/heap/heapam_handler.c | 8 +-------
src/backend/catalog/storage.c | 2 +-
2 files changed, 2 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 39ef8a0b77d..a7259936fec 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -597,12 +597,7 @@ heapam_relation_set_new_filenode(Relation rel,
/*
* If required, set up an init fork for an unlogged table so that it can
- * be correctly reinitialized on restart. An immediate sync is required
- * even if the page has been logged, because the write did not go through
- * shared_buffers and therefore a concurrent checkpoint may have moved the
- * redo pointer past our xlog record. Recovery may as well remove it
- * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
- * record. Therefore, logging is necessary even if wal_level=minimal.
+ * be correctly reinitialized on restart.
*/
if (persistence == RELPERSISTENCE_UNLOGGED)
{
@@ -611,7 +606,6 @@ heapam_relation_set_new_filenode(Relation rel,
rel->rd_rel->relkind == RELKIND_TOASTVALUE);
smgrcreate(srel, INIT_FORKNUM, false);
log_smgrcreate(newrnode, INIT_FORKNUM);
- smgrimmedsync(srel, INIT_FORKNUM);
}
smgrclose(srel);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a7..468b6dafcb6 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -726,7 +726,7 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
/*
* We emit newpage WAL records for smaller relations.
*
- * Small WAL records have a chance to be emitted along with other
+ * Small WAL records have a chance to be flushed along with other
* backends' WAL records. We emit WAL records instead of syncing for
* files that are smaller than a certain threshold, expecting faster
* commit. The threshold is defined by the GUC wal_skip_threshold.
--
2.30.2
On Thu, Jan 27, 2022 at 2:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Unlogged relations are not WAL-logged, but creating the init-fork is.
There are a few things around that seem sloppy:1. In index_build(), we do this:
*/
if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
!smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
{
smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
indexRelation->rd_indam->ambuildempty(indexRelation);
}Shouldn't we call log_smgrcreate() here? Creating the init fork is
otherwise not WAL-logged at all.
Yes, that's a bug.
2. Some implementations of ambuildempty() use the buffer cache (hash,
gist, gin, brin), while others bypass it and call smgrimmedsync()
instead (btree, spgist, bloom). I don't see any particular reason for
those decisions, it seems to be based purely on which example the author
happened to copy-paste.
I thought that this inconsistency was odd when I was developing the
unlogged feature, but I tried to keep each routine's ambuildempty()
consistent with whatever ambuild() was doing. I don't mind if you want
to change it, though.
3. Those ambuildempty implementations that bypass the buffer cache use
smgrwrite() to write the pages. That doesn't make any difference in
practice, but in principle it's wrong: You are supposed to use
smgrextend() when extending a relation.
That's a mistake on my part.
4. Also, the smgrwrite() calls are performed before WAL-logging the
pages, so the page that's written to disk has 0/0 as the LSN, not the
LSN of the WAL record. That's harmless too, but seems a bit sloppy.
That is also a mistake on my part.
5. In heapam_relation_set_new_filenode(), we do this:
/*
* If required, set up an init fork for an unlogged table so that it can
* be correctly reinitialized on restart. An immediate sync is required
* even if the page has been logged, because the write did not go through
* shared_buffers and therefore a concurrent checkpoint may have moved the
* redo pointer past our xlog record. Recovery may as well remove it
* while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
* record. Therefore, logging is necessary even if wal_level=minimal.
*/
if (persistence == RELPERSISTENCE_UNLOGGED)
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_MATVIEW ||
rel->rd_rel->relkind == RELKIND_TOASTVALUE);
smgrcreate(srel, INIT_FORKNUM, false);
log_smgrcreate(newrnode, INIT_FORKNUM);
smgrimmedsync(srel, INIT_FORKNUM);
}The comment doesn't make much sense, we haven't written nor WAL-logged
any page here, with nor without the buffer cache. It made more sense
before commit fa0f466d53.
Well, it seems to me (and perhaps I am just confused) that complaining
that there's no page written here might be a technicality. The point
is that there's no synchronization between the work we're doing here
-- which is creating a fork, not writing a page -- and any concurrent
checkpoint. So we both need to log it, and also sync it immediately.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 28/01/2022 15:57, Robert Haas wrote:
On Thu, Jan 27, 2022 at 2:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Unlogged relations are not WAL-logged, but creating the init-fork is.
There are a few things around that seem sloppy:1. In index_build(), we do this:
*/
if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
!smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
{
smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
indexRelation->rd_indam->ambuildempty(indexRelation);
}Shouldn't we call log_smgrcreate() here? Creating the init fork is
otherwise not WAL-logged at all.Yes, that's a bug.
Pushed and backpatched this patch (commit 3142a8845b).
2. Some implementations of ambuildempty() use the buffer cache (hash,
gist, gin, brin), while others bypass it and call smgrimmedsync()
instead (btree, spgist, bloom). I don't see any particular reason for
those decisions, it seems to be based purely on which example the author
happened to copy-paste.I thought that this inconsistency was odd when I was developing the
unlogged feature, but I tried to keep each routine's ambuildempty()
consistent with whatever ambuild() was doing. I don't mind if you want
to change it, though.3. Those ambuildempty implementations that bypass the buffer cache use
smgrwrite() to write the pages. That doesn't make any difference in
practice, but in principle it's wrong: You are supposed to use
smgrextend() when extending a relation.That's a mistake on my part.
4. Also, the smgrwrite() calls are performed before WAL-logging the
pages, so the page that's written to disk has 0/0 as the LSN, not the
LSN of the WAL record. That's harmless too, but seems a bit sloppy.That is also a mistake on my part.
I'm still sitting on these fixes. I think the patch I posted still makes
sense, but I got carried away with a more invasive approach that
introduces a whole new set of functions for bulk-creating a relation,
which would handle WAL-logging, smgrimmedsync() and all that (see
below). We have some repetitive, error-prone code in all the index build
functions for that. But that's not backpatchable, so I'll rebase the
original approach next week.
5. In heapam_relation_set_new_filenode(), we do this:
/*
* If required, set up an init fork for an unlogged table so that it can
* be correctly reinitialized on restart. An immediate sync is required
* even if the page has been logged, because the write did not go through
* shared_buffers and therefore a concurrent checkpoint may have moved the
* redo pointer past our xlog record. Recovery may as well remove it
* while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
* record. Therefore, logging is necessary even if wal_level=minimal.
*/
if (persistence == RELPERSISTENCE_UNLOGGED)
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_MATVIEW ||
rel->rd_rel->relkind == RELKIND_TOASTVALUE);
smgrcreate(srel, INIT_FORKNUM, false);
log_smgrcreate(newrnode, INIT_FORKNUM);
smgrimmedsync(srel, INIT_FORKNUM);
}The comment doesn't make much sense, we haven't written nor WAL-logged
any page here, with nor without the buffer cache. It made more sense
before commit fa0f466d53.Well, it seems to me (and perhaps I am just confused) that complaining
that there's no page written here might be a technicality. The point
is that there's no synchronization between the work we're doing here
-- which is creating a fork, not writing a page -- and any concurrent
checkpoint. So we both need to log it, and also sync it immediately.
I see. I pushed the fix from the other thread that makes smgrcreate()
call register_dirty_segment (commit 4b4798e13). I believe that makes
this smgrimmedsync() unnecessary. If a concurrent checkpoint happens
with a redo pointer greater than this WAL record, it must've received
the fsync request created by smgrcreate(). That depends on the fact that
we write the WAL record *after* smgrcreate(). Subtle..
Hmm, we have a similar smgrimmedsync() call after index build, because
we have written pages directly with smgrextend(skipFsync=true). If no
checkpoints have occurred during the index build, we could call
register_dirty_segment() instead of smgrimmedsync(). That would avoid
the fsync() latency when creating an index on an empty or small index.
This is all very subtle to get right though. That's why I'd like to
invent a new bulk-creation facility that would handle this stuff, and
make the callers less error-prone.
--
Heikki Linnakangas
Neon (https://neon.tech)
On 07/07/2023 18:21, Heikki Linnakangas wrote:
On 28/01/2022 15:57, Robert Haas wrote:
4. Also, the smgrwrite() calls are performed before WAL-logging the
pages, so the page that's written to disk has 0/0 as the LSN, not the
LSN of the WAL record. That's harmless too, but seems a bit sloppy.That is also a mistake on my part.
I'm still sitting on these fixes. I think the patch I posted still makes
sense, but I got carried away with a more invasive approach that
introduces a whole new set of functions for bulk-creating a relation,
which would handle WAL-logging, smgrimmedsync() and all that (see
below). We have some repetitive, error-prone code in all the index build
functions for that. But that's not backpatchable, so I'll rebase the
original approach next week.
Committed this fix to master and v16. Didn't seem worth backpatching
further than that, given that there is no live user-visible issue here.
5. In heapam_relation_set_new_filenode(), we do this:
/*
* If required, set up an init fork for an unlogged table so that it can
* be correctly reinitialized on restart. An immediate sync is required
* even if the page has been logged, because the write did not go through
* shared_buffers and therefore a concurrent checkpoint may have moved the
* redo pointer past our xlog record. Recovery may as well remove it
* while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
* record. Therefore, logging is necessary even if wal_level=minimal.
*/
if (persistence == RELPERSISTENCE_UNLOGGED)
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_MATVIEW ||
rel->rd_rel->relkind == RELKIND_TOASTVALUE);
smgrcreate(srel, INIT_FORKNUM, false);
log_smgrcreate(newrnode, INIT_FORKNUM);
smgrimmedsync(srel, INIT_FORKNUM);
}The comment doesn't make much sense, we haven't written nor WAL-logged
any page here, with nor without the buffer cache. It made more sense
before commit fa0f466d53.Well, it seems to me (and perhaps I am just confused) that complaining
that there's no page written here might be a technicality. The point
is that there's no synchronization between the work we're doing here
-- which is creating a fork, not writing a page -- and any concurrent
checkpoint. So we both need to log it, and also sync it immediately.I see. I pushed the fix from the other thread that makes smgrcreate()
call register_dirty_segment (commit 4b4798e13). I believe that makes
this smgrimmedsync() unnecessary. If a concurrent checkpoint happens
with a redo pointer greater than this WAL record, it must've received
the fsync request created by smgrcreate(). That depends on the fact that
we write the WAL record *after* smgrcreate(). Subtle..Hmm, we have a similar smgrimmedsync() call after index build, because
we have written pages directly with smgrextend(skipFsync=true). If no
checkpoints have occurred during the index build, we could call
register_dirty_segment() instead of smgrimmedsync(). That would avoid
the fsync() latency when creating an index on an empty or small index.This is all very subtle to get right though. That's why I'd like to
invent a new bulk-creation facility that would handle this stuff, and
make the callers less error-prone.
Having a more generic and less error-prone bulk-creation mechanism is
still on my long TODO list..
--
Heikki Linnakangas
Neon (https://neon.tech)
Is the patch
0003-Remove-unnecessary-smgrimmedsync-when-creating-unlog.patch still
relevant, or can this commitfest entry be closed?
Show quoted text
On 23.08.23 16:40, Heikki Linnakangas wrote:
5. In heapam_relation_set_new_filenode(), we do this:
/*
* If required, set up an init fork for an unlogged table
so that it can
* be correctly reinitialized on restart. An immediate
sync is required
* even if the page has been logged, because the write did
not go through
* shared_buffers and therefore a concurrent checkpoint may
have moved the
* redo pointer past our xlog record. Recovery may as well
remove it
* while replaying, for example, XLOG_DBASE_CREATE or
XLOG_TBLSPC_CREATE
* record. Therefore, logging is necessary even if
wal_level=minimal.
*/
if (persistence == RELPERSISTENCE_UNLOGGED)
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_MATVIEW ||
rel->rd_rel->relkind ==
RELKIND_TOASTVALUE);
smgrcreate(srel, INIT_FORKNUM, false);
log_smgrcreate(newrnode, INIT_FORKNUM);
smgrimmedsync(srel, INIT_FORKNUM);
}The comment doesn't make much sense, we haven't written nor WAL-logged
any page here, with nor without the buffer cache. It made more sense
before commit fa0f466d53.Well, it seems to me (and perhaps I am just confused) that complaining
that there's no page written here might be a technicality. The point
is that there's no synchronization between the work we're doing here
-- which is creating a fork, not writing a page -- and any concurrent
checkpoint. So we both need to log it, and also sync it immediately.I see. I pushed the fix from the other thread that makes smgrcreate()
call register_dirty_segment (commit 4b4798e13). I believe that makes
this smgrimmedsync() unnecessary. If a concurrent checkpoint happens
with a redo pointer greater than this WAL record, it must've received
the fsync request created by smgrcreate(). That depends on the fact that
we write the WAL record *after* smgrcreate(). Subtle..Hmm, we have a similar smgrimmedsync() call after index build, because
we have written pages directly with smgrextend(skipFsync=true). If no
checkpoints have occurred during the index build, we could call
register_dirty_segment() instead of smgrimmedsync(). That would avoid
the fsync() latency when creating an index on an empty or small index.This is all very subtle to get right though. That's why I'd like to
invent a new bulk-creation facility that would handle this stuff, and
make the callers less error-prone.Having a more generic and less error-prone bulk-creation mechanism is
still on my long TODO list..
On 01/09/2023 15:49, Peter Eisentraut wrote:
Is the patch
0003-Remove-unnecessary-smgrimmedsync-when-creating-unlog.patch still
relevant, or can this commitfest entry be closed?
Yes. Pushed it now, thanks!
--
Heikki Linnakangas
Neon (https://neon.tech)