In-placre persistance change of a relation
Hello. This is a thread for an alternative solution to wal_level=none
[*1] for bulk data loading.
*1: /messages/by-id/TYAPR01MB29901EBE5A3ACCE55BA99186FE320@TYAPR01MB2990.jpnprd01.prod.outlook.com
At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in
Greetings,
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case. Crash recovery seems
working by some brief testing by hand.Somehow missed that this patch more-or-less does what I was referring to
down-thread, but I did want to mention that it looks like it's missing a
necessary FlushRelationBuffers() call before the sync, otherwise there
could be dirty buffers for the relation that's being set to LOGGED (with
wal_level=minimal), which wouldn't be good. See the comments above
smgrimmedsync().
Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
called just above scans shared buffers so I don't want to just call
FlushRelationBuffers() separately. Instead, I added buffer-flush to
SetRelFileNodeBuffersPersistence().
FWIW this is a revised version of the PoC, which has some known
problems.
- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).
- This version handles indexes but not yet handle toast relatins.
- tableAMs are supposed to support this feature. (but I'm not sure
it's worth allowing them not to do so).
Of course, I haven't performed intensive test on it.
Reading through the thread, it didn't seem very clear, but we should
definitely make sure that it does the right thing on replicas when going
between unlogged and logged (and between logged and unlogged too), of
course.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
PoC_in-place_set_persistence_v2.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dcaea7135f..0c6ce70484 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -613,6 +613,27 @@ heapam_relation_set_new_filenode(Relation rel,
smgrclose(srel);
}
+static void
+heapam_relation_set_persistence(Relation rel, char persistence)
+{
+ Assert(rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+ rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
+
+ Assert (rel->rd_rel->relpersistence != persistence);
+
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ {
+ Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
+ rel->rd_rel->relkind == RELKIND_MATVIEW ||
+ rel->rd_rel->relkind == RELKIND_TOASTVALUE);
+
+ RelationCreateInitFork(rel->rd_node, false);
+ }
+ else
+ RelationDropInitFork(rel->rd_node);
+}
+
+
static void
heapam_relation_nontransactional_truncate(Relation rel)
{
@@ -2540,6 +2561,7 @@ static const TableAmRoutine heapam_methods = {
.compute_xid_horizon_for_tuples = heap_compute_xid_horizon_for_tuples,
.relation_set_new_filenode = heapam_relation_set_new_filenode,
+ .relation_set_persistence = heapam_relation_set_persistence,
.relation_nontransactional_truncate = heapam_relation_nontransactional_truncate,
.relation_copy_data = heapam_relation_copy_data,
.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index a7c0cb1bc3..8397002613 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,14 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +63,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
}
return id;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d538f25726..ac5aea3d38 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -60,6 +60,8 @@ int wal_skip_threshold = 2048; /* in kilobytes */
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ bool deleteinitfork; /* delete only init fork if true */
+ bool createinitfork; /* create init fork if true */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -153,6 +155,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->deleteinitfork = false;
+ pending->createinitfork = false;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -168,6 +172,95 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for a relation.
+ *
+ * Create the underlying disk file storage for the relation. This only
+ * creates the main fork; additional forks are created lazily by the
+ * modules that need them.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the storage will be destroyed.
+ */
+void
+RelationCreateInitFork(RelFileNode rnode, bool isRedo)
+{
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->deleteinitfork && pending->atCommit)
+ {
+ /* unlink and delete list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+ return;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+ srel = smgropen(rnode, InvalidBackendId);
+ smgrcreate(srel, INIT_FORKNUM, isRedo);
+ if (!isRedo)
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+
+ /* Add the relation to the list of stuff to delete at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->deleteinitfork = true;
+ pending->createinitfork = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false; /* delete if abort */
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+void
+RelationDropInitFork(RelFileNode rnode)
+{
+ PendingRelDelete *pending;
+ PendingRelDelete *next;
+
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->deleteinitfork && pending->atCommit)
+ {
+ /* We're done. */
+ return;
+ }
+ }
+
+ /* Add the relation to the list of stuff to delete at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->deleteinitfork = true;
+ pending->createinitfork = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true; /* create if abort */
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +280,25 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +312,8 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->createinitfork = false;
+ pending->deleteinitfork = false;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -626,19 +740,27 @@ smgrDoPendingDeletes(bool isCommit)
srel = smgropen(pending->relnode, pending->backend);
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
+ if (pending->deleteinitfork)
{
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
+ log_smgrunlink(&pending->relnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
}
- else if (maxrels <= nrels)
+ else
{
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
- srels[nrels++] = srel;
+ srels[nrels++] = srel;
+ }
}
/* must explicitly free the list entry */
pfree(pending);
@@ -917,6 +1039,14 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e3cfaf8b07..e358174b01 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4918,6 +4918,137 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+static bool
+try_inplace_persistence_change(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ /*
+ * When wal_level is replica or higher we need that the initial state of
+ * the relation be recoverable from WAL. When wal_level >= replica
+ * switching to PERMANENT needs to emit the WAL records to reconstruct the
+ * current data. This could be done by writing XLOG_FPI for all pages but
+ * it is not obvious that that is performant than normal rewriting.
+ * Otherwise what we need for the relation data is just establishing
+ * initial state on storage and no need of WAL to reconstruct it.
+ */
+ if (tab->newrelpersistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ return false;
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform im-place persistnce change");
+
+ RelationOpenSmgr(rel);
+
+ /* Change persistence then flush-out buffers of the relation */
+
+ /* Get the list of index OIDs for this relation */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ table_close(rel, lockmode);
+
+ /* Done change on storage. Update catalog including indexes. */
+ /* add the heap oid to the relation ID list */
+
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ RelationOpenSmgr(r);
+
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ {
+ RelationCreateInitFork(r->rd_node, false);
+
+ if (r->rd_rel->relkind == RELKIND_INDEX ||
+ r->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
+ r->rd_indam->ambuildempty(r);
+ else
+ {
+ Assert(r->rd_rel->relkind == RELKIND_RELATION ||
+ r->rd_rel->relkind == RELKIND_MATVIEW ||
+ r->rd_rel->relkind == RELKIND_TOASTVALUE);
+ }
+ }
+ else
+ RelationDropInitFork(r->rd_node);
+
+ table_close(r, NoLock);
+
+ /*
+ * This relation is now WAL-logged. Sync all files immediately to
+ * establish the initial state on storgae.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < MAX_FORKNUM ; i++)
+ {
+ if (smgrexists(r->rd_smgr, i))
+ smgrimmedsync(r->rd_smgr, i);
+ }
+ }
+
+
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+ }
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ RelationOpenSmgr(r);
+ SetRelationBuffersPersistence(r, persistence == RELPERSISTENCE_PERMANENT);
+ table_close(r, NoLock);
+ }
+ table_close(classRel, RowExclusiveLock);
+
+ return true;
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5038,45 +5169,51 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
- lockmode);
+ if (tab->rewrite != AT_REWRITE_ALTER_PERSISTENCE ||
+ !try_inplace_persistence_change(tab, persistence, lockmode))
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
+ lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+ }
}
else
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..c71e1a5f92 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3033,6 +3033,80 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(Relation rel, bool permanent)
+{
+ int i;
+ RelFileNodeBackend rnode = rel->rd_smgr->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ ereport(LOG, (errmsg ("#%d: %d", i, (buf_state & BM_PERMANENT) == 0), errhidestmt(true)));
+ if (permanent)
+ {
+ Assert ((buf_state & BM_PERMANENT) == 0);
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when swithing to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) ==
+ (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, rel->rd_smgr);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ Assert ((buf_state & BM_PERMANENT) != 0);
+ buf_state &= ~BM_PERMANENT;
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ ereport(LOG, (errmsg ("#%d: -> %d", i, (buf_state & BM_PERMANENT) == 0), errhidestmt(true)));
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..5eb9e97b3d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -645,6 +645,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 387eb34a61..1d19278a18 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -451,6 +451,15 @@ typedef struct TableAmRoutine
TransactionId *freezeXid,
MultiXactId *minmulti);
+ /*
+ * This callback needs to switch persistence of the relation between
+ * RELPERSISTENCE_PERMANENT and RELPERSISTENCE_UNLOGGED. Actual change on
+ * storage is performed elsewhere.
+ *
+ * See also table_relation_set_persistence().
+ */
+ void (*relation_set_persistence) (Relation rel, char persistence);
+
/*
* This callback needs to remove all contents from `rel`'s current
* relfilenode. No provisions for transactional behaviour need to be made.
@@ -1404,6 +1413,18 @@ table_relation_set_new_filenode(Relation rel,
freezeXid, minmulti);
}
+/*
+ * Switch storage persistence between RELPERSISTENCE_PERMANENT and
+ * RELPERSISTENCE_UNLOGGED.
+ *
+ * This is used during in-place persistence switching
+ */
+static inline void
+table_relation_set_persistence(Relation rel, char persistence)
+{
+ rel->rd_tableam->relation_set_persistence(rel, persistence);
+}
+
/*
* Remove all table contents from `rel`, in a non-transactional manner.
* Non-transactional meaning that there's no need to support rollbacks. This
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 30c38e0ca6..43d2eb0fb4 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(RelFileNode rel, bool isRedo);
+extern void RelationDropInitFork(RelFileNode rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 7b21cab2e0..73ad2ae89e 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -29,6 +29,7 @@
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
typedef struct xl_smgr_create
{
@@ -36,6 +37,12 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +58,7 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..f65a273999 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -205,6 +205,7 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(Relation rnode, bool permanent);
extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index f28a842401..5d74631006 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
Greetings,
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case. Crash recovery seems
working by some brief testing by hand.Somehow missed that this patch more-or-less does what I was referring to
down-thread, but I did want to mention that it looks like it's missing a
necessary FlushRelationBuffers() call before the sync, otherwise there
could be dirty buffers for the relation that's being set to LOGGED (with
wal_level=minimal), which wouldn't be good. See the comments above
smgrimmedsync().Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
called just above scans shared buffers so I don't want to just call
FlushRelationBuffers() separately. Instead, I added buffer-flush to
SetRelFileNodeBuffersPersistence().
Maybe I'm missing something, but it sure looks like in the patch that
SetRelFileNodeBuffersPersistence() is being called after the
smgrimmedsync() call, and I don't think you get to just switch the order
of those- the sync is telling the kernel to make sure it's written to
disk, while the FlushBuffer() is just writing it into the kernel but
doesn't provide any guarantee that the data has actually made it to
disk. We have to FlushBuffer() first, and then call smgrimmedsync().
Perhaps there's a way to avoid having to go through shared buffers
twice, and I generally agreed it'd be good if we could avoid doing so,
but this approach doesn't look like it actually works.
FWIW this is a revised version of the PoC, which has some known
problems.- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).
Not sure if it'd be better to drop buffers or not, but figuring out how
to deal with rollback seems pretty important. How is the persistence
change in the catalog not WAL-logged though..?
- This version handles indexes but not yet handle toast relatins.
Would need to be fixed, of course.
- tableAMs are supposed to support this feature. (but I'm not sure
it's worth allowing them not to do so).
Seems like they should.
Thanks,
Stephen
Hi,
I suggest outlining what you are trying to achieve here. Starting a new
thread and expecting people to dig through another thread to infer what
you are actually trying to achive isn't great.
FWIW, I'm *extremely* doubtful it's worth adding features that depend on
a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as
I understand. If somebody added support for dynamically adapting
wal_level (e.g. wal_level=auto, that increases wal_level to
replica/logical depending on the presence of replication slots), it'd
perhaps be different.
On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote:
FWIW this is a revised version of the PoC, which has some known
problems.- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).
That's obviously a no-go. I think you might be able to address this if
you accept that the command cannot be run in a transaction (like
CONCURRENTLY). Then you can first do the catalog changes, change the
persistence level, and commit.
Greetings,
Andres Freund
At Wed, 11 Nov 2020 14:18:04 -0800, Andres Freund <andres@anarazel.de> wrote in
Hi,
I suggest outlining what you are trying to achieve here. Starting a new
thread and expecting people to dig through another thread to infer what
you are actually trying to achive isn't great.
Agreed. I'll post that. Thanks.
FWIW, I'm *extremely* doubtful it's worth adding features that depend on
a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as
I understand. If somebody added support for dynamically adapting
wal_level (e.g. wal_level=auto, that increases wal_level to
replica/logical depending on the presence of replication slots), it'd
perhaps be different.
Yes, this depends on wal_level=minimal for switching from UNLOGGED to
LOGGED, that's similar to COPY/INSERT-to-intransaction-created-tables
optimization for wal_level=minimal. And it expands that optimization
to COPY/INSERT-to-existent-tables, which seems worth doing.
Switching to LOGGED needs to emit the initial state to WAL... Hmm.. I
came to think that even in that case skipping table copy reduces I/O
significantly, even though FPI-WAL is emitted.
On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote:
FWIW this is a revised version of the PoC, which has some known
problems.- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).That's obviously a no-go. I think you might be able to address this if
you accept that the command cannot be run in a transaction (like
CONCURRENTLY). Then you can first do the catalog changes, change the
persistence level, and commit.
Of course. The next version reverts persistence change at abort.
Thanks!
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Wed, 11 Nov 2020 09:56:44 -0500, Stephen Frost <sfrost@snowman.net> wrote in
Greetings,
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in
* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
For fuel(?) of the discussion, I tried a very-quick PoC for in-place
ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
trials of several ways, I drifted to the following way after poking
several ways.1. Flip BM_PERMANENT of active buffers
2. adding/removing init fork
3. sync files,
4. Flip pg_class.relpersistence.It always skips table copy in the SET UNLOGGED case, and only when
wal_level=minimal in the SET LOGGED case. Crash recovery seems
working by some brief testing by hand.Somehow missed that this patch more-or-less does what I was referring to
down-thread, but I did want to mention that it looks like it's missing a
necessary FlushRelationBuffers() call before the sync, otherwise there
could be dirty buffers for the relation that's being set to LOGGED (with
wal_level=minimal), which wouldn't be good. See the comments above
smgrimmedsync().Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
called just above scans shared buffers so I don't want to just call
FlushRelationBuffers() separately. Instead, I added buffer-flush to
SetRelFileNodeBuffersPersistence().Maybe I'm missing something, but it sure looks like in the patch that
SetRelFileNodeBuffersPersistence() is being called after the
smgrimmedsync() call, and I don't think you get to just switch the order
of those- the sync is telling the kernel to make sure it's written to
disk, while the FlushBuffer() is just writing it into the kernel but
doesn't provide any guarantee that the data has actually made it to
disk. We have to FlushBuffer() first, and then call smgrimmedsync().
Perhaps there's a way to avoid having to go through shared buffers
twice, and I generally agreed it'd be good if we could avoid doing so,
but this approach doesn't look like it actually works.
Yeah, sorry for the rare-baked version.. I was confused about the
order at the time. The next version works like this:
LOGGED->UNLOGGED
<collect reloids to process>
for each relations:
<set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
<create init fork>
if it is index call ambuildempty() (which syncs the init fork)
else WAL-log smgr_create then sync the init file.
<update catalog>
...
commit time:
<do nogthing>
abort time:
<unlink init fork>
<revert buffer persistence>
UNLOGGED->LOGGED
<collect reloids to process>
for each relations:
<set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
<record drop-init-fork to pending-deletes>
<sync storage files>
<update catalog>
...
commit time:
<log smgrunlink>
<smgrunlink init fork>
abort time:
<revert buffer persistence>
FWIW this is a revised version of the PoC, which has some known
problems.- Flipping of Buffer persistence is not WAL-logged nor even be able to
be safely roll-backed. (It might be better to drop buffers).Not sure if it'd be better to drop buffers or not, but figuring out how
to deal with rollback seems pretty important. How is the persistence
change in the catalog not WAL-logged though..?
Rollback works as the above. Buffer persistence change is registered
in pending-deletes. Persistence change in catalog is rolled back in
the ordinary way (or automatically).
If wal_level > minimal, persistence change of buffers is propagated to
standbys by WAL. However I'm not sure we need wal-logging otherwise,
the next version emits WAL since SMGR_CREATE is always logged by
existing code.
- This version handles indexes but not yet handle toast relatins.
Would need to be fixed, of course.
Fixed.
- tableAMs are supposed to support this feature. (but I'm not sure
it's worth allowing them not to do so).Seems like they should.
Init fork of index relations needs a call to ambuildempty() instead of
"log_smgrcreate-smgrimmedsync" after smgrcreate. Instead of adding
similar interface in indexAm, I reverted changes of tableam and make
RelationCreate/DropInitFork() directly do that. That introduces new
include of amapi.h to storage.c, which is a bit uneasy.
The previous version give up the in-place persistence change in the
case where wal_level > minimal and SET LOGGED since that needs WAL to
be emitted. However, in-place change still has the advantage of not
running a table copy. So the next verson always runs persistence
change in-place.
As suggested by Andres, I'll send a summary of this patch. The patch
will be attached to the coming mail.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello. Before posting the next version, I'd like to explain what this
patch is.
1. The Issue
Bulk data loading is a long-time taking, I/O consuming task. Many
DBAs want that task is faster, even at the cost of increasing risk of
data-loss. wal_level=minimal is an answer to such a
request. Data-loading onto a table that is created in the current
transaction omits WAL-logging and synced at commit.
However, the optimization doesn't benefit the case where the
data-loading is performed onto existing tables. There are quite a few
cases where data is loaded into tables that already contains a lot of
data. Those cases don't take benefit of the optimization.
Another possible solution for bulk data-loading is UNLOGGED
tables. But when we switch LOGGED/UNLOGGED of a table, all the table
content is copied to a newly created heap, which is costly.
2. Proposed Solutions.
There are two proposed solutions are discussed on this mailing
list. One is wal_level = none (*1), which omits WAL-logging almost at
all. Another is extending the existing optimization to the ALTER TABLE
SET LOGGED/UNLOGGED cases, which is to be discussed in this new
thread.
3. In-place Persistence Change
So the attached is a PoC patch of the "another" solution. When we
want to change table persistence in-place, basically we need to do the
following steps.
(the talbe is exclusively locked)
(1) Flip BM_PERMANENT flag of all shared buffer blocks for the heap.
(2) Create or delete the init fork for existing heap.
(3) Flush all buffers of the relation to file system.
(4) Sync heap files.
(5) Make catalog changes.
4. Transactionality
The 1, 2 and 5 above need to be abort-able. 5 is rolled back by
existing infrastructure, and rolling-back of 1 and 2 are achieved by
piggybacking on the pendingDeletes mechanism.
5. Replication
Furthermore, that changes ought to be replicable to standbys. Catalog
changes are replicated as usual.
On-the-fly creation of the init fork leads to recovery mess. Even
though it is removed at abort, if the server crashed before
transaction end, the file is left alone and corrupts database in the
next recovery. I sought a way to create the init fork in
smgrPendingDelete but that needs relcache and relcache is not
available at that late of commit. Finally, I introduced the fifth fork
kind "INITTMP"(_itmp) only to signal that the init file is not
committed. I don't like that way but it seems working fine...
6. SQL Command
The second file in the patchset adds a syntax that changes persistence
of all tables in a tablespace.
ALTER TABLE ALL IN TABLESPACE <tsp> SET LOGGED/UNLOGGED [ NOWAIT ];
7. Testing
I tried to write TAP test for this, but IPC::Run::harness (or
interactive_psql) doesn't seem to work for me. I'm not sure what
exactly is happening but pty redirection doesn't work.
$in = "ls\n"; $out = ""; run ["/usr/bin/bash"], \$in, \$out; print $out;
works but
$in = "ls\n"; $out = ""; run ["/usr/bin/bash"], '<pty<', \$in, '>pty>', \$out; print $out;
doesn't respond.
The patch is attached.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v3-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 05d1971d0f4f0f42899f5d6857892128487eeb40 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v3 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
---
src/backend/access/rmgrdesc/smgrdesc.c | 23 ++
src/backend/catalog/storage.c | 355 +++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 217 ++++++++++++---
src/backend/storage/buffer/bufmgr.c | 88 ++++++
src/backend/storage/file/reinit.c | 206 ++++++++------
src/backend/storage/smgr/smgr.c | 6 +
src/common/relpath.c | 3 +-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_xlog.h | 16 ++
src/include/common/relpath.h | 5 +-
src/include/storage/bufmgr.h | 4 +
src/include/storage/smgr.h | 1 +
12 files changed, 784 insertions(+), 142 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index a7c0cb1bc3..097dacfee6 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +72,12 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d538f25726..0f1649758f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -57,9 +58,19 @@ int wal_skip_threshold = 2048; /* in kilobytes */
* but I'm being paranoid.
*/
+
+/* This is bit-map, not ordianal numbers */
+#define PDOP_DELETE 0x00
+#define PDOP_UNLINK_FORK 0x01
+#define PDOP_SET_PERSISTENCE 0x02
+
+
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -153,6 +164,7 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->op = PDOP_DELETE;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -168,6 +180,209 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, false, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have already registered pending sync entries to drop preexisting
+ * init fork since before the current transaction started. This function
+ * reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->op != PDOP_DELETE)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ create = false;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (!create)
+ return;
+
+ /* We don't have existing init fork, create it. */
+ srel = smgropen(rnode, InvalidBackendId);
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by myself.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /*
+ * We have created the init fork. If server crashes before the current
+ * transaction ends the init fork left alone corrupts data while recovery.
+ * The inittmp fork works as the sentinel to identify that situaton.
+ */
+ smgrcreate(srel, INITTMP_FORKNUM, false);
+ log_smgrcreate(&rnode, INITTMP_FORKNUM);
+ smgrimmedsync(srel, INITTMP_FORKNUM);
+
+ /* drop this init fork file at abort and revert persistence */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK | PDOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop inittmp fork at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INITTMP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop inittmp fork at commit*/
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INITTMP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, true, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * immediately remove the init and inittmp forks immediately in that case.
+ * Otherwise just reister pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->op != PDOP_DELETE)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ inxact_created = true;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+ smgrclose(srel);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ log_smgrunlink(&rnode, INITTMP_FORKNUM);
+ smgrunlink(srel, INITTMP_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +402,44 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +453,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->op = PDOP_DELETE;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -606,43 +860,68 @@ smgrDoPendingDeletes(bool isCommit)
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
+ SMgrRelation srel;
+
next = pending->next;
if (pending->nestLevel < nestLevel)
{
/* outer-level entries should not be processed yet */
prev = pending;
+ continue;
}
+
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
else
+ pendingDeletes = next;
+
+ if (pending->atCommit != isCommit)
{
- /* unlink list entry first, so we don't retry on failure */
- if (prev)
- prev->next = next;
- else
- pendingDeletes = next;
- /* do deletion if called for */
- if (pending->atCommit == isCommit)
- {
- SMgrRelation srel;
-
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
- {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
-
- srels[nrels++] = srel;
- }
/* must explicitly free the list entry */
pfree(pending);
/* prev does not change */
+ continue;
+ }
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ if (pending->op != PDOP_DELETE)
+ {
+ if (pending->op & PDOP_UNLINK_FORK)
+ {
+ BlockNumber block = 0;
+ RelFileNodeBackend rbnode;
+
+ rbnode.node = pending->relnode;
+ rbnode.backend = InvalidBackendId;
+
+ DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1,
+ &block);
+ smgrclose(srel);
+ log_smgrunlink(&pending->relnode, pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PDOP_SET_PERSISTENCE)
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ false);
+ }
+ else
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
}
}
@@ -824,7 +1103,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
nrels++;
}
if (nrels == 0)
@@ -837,7 +1117,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
{
*rptr = pending->relnode;
rptr++;
@@ -917,6 +1198,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrclose(reln);
+ smgrunlink(reln, xlrec->forkNum, true);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1005,6 +1295,15 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e3cfaf8b07..29f786142a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4916,6 +4916,142 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
tab->afterStmts = list_concat(tab->afterStmts, afterStmts);
return newcmd;
+}
+
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform im-place persistnce change");
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ RelationOpenSmgr(toastrel);
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, lockmode);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ RelationOpenSmgr(r);
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * alredy flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(r->rd_smgr, i))
+ smgrimmedsync(r->rd_smgr, i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recovery the table.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(r->rd_smgr, fork))
+ log_newpage_range(r, fork,
+ 0, smgrnblocks(r->rd_smgr, fork), false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+
+
+
+
}
/*
@@ -5038,45 +5174,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
- lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
+ lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+ }
}
else
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9abc..ddd0133cdf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3033,6 +3034,93 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when swithing to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0c2094f766..6524262a74 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -31,6 +31,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
typedef struct
{
char oid[OIDCHARS + 1];
+ bool dirty;
} unlogged_relation_entry;
/*
@@ -151,6 +152,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
@@ -160,62 +163,73 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create a ton of unlogged relations
+ * in the same database & tablespace, so we'd better use a hash table
+ * rather than an array or linked list to keep track of which files
+ * need to be reset. Otherwise, this cleanup operation would be
+ * O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(unlogged_relation_entry);
+ ctl.entrysize = sizeof(unlogged_relation_entry);
+ hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
+
+ /* Scan the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ ForkNumber forkNum;
+ int oidchars;
+ bool found;
+ unlogged_relation_entry key;
+ unlogged_relation_entry *ent;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum))
+ continue;
+
+ /* Also skip it unless this is the init fork. */
+ if (forkNum != INIT_FORKNUM && forkNum != INITTMP_FORKNUM)
+ continue;
+
+ /*
+ * Put the OID portion of the name into the hash table, if it
+ * isn't already.
+ */
+ memset(key.oid, 0, sizeof(key.oid));
+ memcpy(key.oid, de->d_name, oidchars);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ ent->dirty = 0;
+
+ /*
+ * If we have the inittmp fork, the transaction that created the
+ * corresponding init file was not committed nor aborted. Mark this
+ * init fork as dirty so that we can clean up them properly.
+ */
+ if (forkNum == INITTMP_FORKNUM)
+ ent->dirty = true;
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /*
+ * If we didn't find any init forks, there's no point in continuing;
+ * we can bail out now.
+ */
+ if (hash_get_num_entries(hash) == 0)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- memset(&ctl, 0, sizeof(ctl));
- ctl.keysize = sizeof(unlogged_relation_entry);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
-
- /* Scan the directory. */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- memset(ent.oid, 0, sizeof(ent.oid));
- memcpy(ent.oid, de->d_name, oidchars);
- hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
/*
* Now, make a second pass and remove anything that matches.
*/
@@ -224,39 +238,48 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
{
ForkNumber forkNum;
int oidchars;
- bool found;
- unlogged_relation_entry ent;
+ unlogged_relation_entry key;
+ unlogged_relation_entry *ent;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
&forkNum))
continue;
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
- continue;
-
/*
* See whether the OID portion of the name shows up in the hash
* table.
*/
- memset(ent.oid, 0, sizeof(ent.oid));
- memcpy(ent.oid, de->d_name, oidchars);
- hash_search(hash, &ent, HASH_FIND, &found);
+ memset(key.oid, 0, sizeof(key.oid));
+ memcpy(key.oid, de->d_name, oidchars);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
- /* If so, nuke it! */
- if (found)
+ /* Don't remove files if corresponding init fork is not found */
+ if (!ent)
+ continue;
+
+ if (!ent->dirty)
+ {
+ /* Don't remove clean init file */
+ if (forkNum == INIT_FORKNUM)
+ continue;
+ }else
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
- else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ /* Remove dirty init file, together with inittmp file */
+ if (forkNum != INIT_FORKNUM && forkNum != INITTMP_FORKNUM)
+ continue;
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+ else
+ elog(DEBUG2, "unlinked file \"%s\"", rm_path);
}
/* Cleanup is complete. */
@@ -273,6 +296,9 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
+ unlogged_relation_entry key;
+ unlogged_relation_entry *ent;
+
/* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
@@ -288,6 +314,38 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum))
continue;
+ /*
+ * See whether the OID portion of the name shows up in the hash
+ * table.
+ */
+ memset(key.oid, 0, sizeof(key.oid));
+ memcpy(key.oid, de->d_name, oidchars);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ /* Don't init file that doesn't have the init fork. */
+ if (!ent)
+ continue;
+
+ if (ent->dirty &&
+ (forkNum == INIT_FORKNUM || forkNum == INITTMP_FORKNUM))
+ {
+ /*
+ * The init file is dirty. The files has been removed once at
+ * cleanup time but recovery can create them again. Remove both
+ * INIT and INITTMP files.
+ */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+ else
+ elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ continue;
+ }
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dcc09df0c7..5eb9e97b3d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -645,6 +645,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/common/relpath.c b/src/common/relpath.c
index ad733d1363..2a5e5fa990 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -34,7 +34,8 @@ const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
"vm", /* VISIBILITYMAP_FORKNUM */
- "init" /* INIT_FORKNUM */
+ "init", /* INIT_FORKNUM */
+ "itmp" /* INITTMP_FORKNUM */
};
StaticAssertDecl(lengthof(forkNames) == (MAX_FORKNUM + 1),
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 30c38e0ca6..c2259cd7e3 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 7b21cab2e0..d48b5288ce 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -29,6 +29,8 @@
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_BUFPERSISTENCE 0x40
typedef struct xl_smgr_create
{
@@ -36,6 +38,18 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +65,8 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 869cabcc0d..f6e1a74a38 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -43,7 +43,8 @@ typedef enum ForkNumber
MAIN_FORKNUM = 0,
FSM_FORKNUM,
VISIBILITYMAP_FORKNUM,
- INIT_FORKNUM
+ INIT_FORKNUM,
+ INITTMP_FORKNUM
/*
* NOTE: if you add a new fork, change MAX_FORKNUM and possibly
@@ -52,7 +53,7 @@ typedef enum ForkNumber
*/
} ForkNumber;
-#define MAX_FORKNUM INIT_FORKNUM
+#define MAX_FORKNUM INITTMP_FORKNUM
#define FORKNAMECHARS 4 /* max chars for a fork name */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..e2496ed1c8 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
+struct SmgrRelationData;
+
/*
* prototypes for functions in bufmgr.c
*/
@@ -205,6 +207,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index f28a842401..5d74631006 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.18.4
v3-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload
From 5ce0551b9685dcd742bdcdf610ac80424327a9b5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v3 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 29f786142a..ec2a45357b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13665,6 +13665,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 3031c52991..7bb8fc767b 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4120,6 +4120,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5419,6 +5432,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 9aa853748d..55ab3d7039 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1856,6 +1856,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3474,6 +3486,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 051f1f1d49..08da69e32f 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1893,6 +1893,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index f398027fa6..8066e7a607 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -160,6 +160,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1733,6 +1734,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2616,6 +2623,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index c1581ad178..206de61154 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7ddd8c011b..74bf050b67 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -424,6 +424,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7ef9b0eac0..f5b4976ae1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2235,6 +2235,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.18.4
Hi Horiguchi-san,
Thank you for making a patch so quickly. I've started looking at it.
What makes you think this is a PoC? Documentation and test cases? If there's something you think that doesn't work or are concerned about, can you share it?
Do you know the reason why data copy was done before? And, it may be odd for me to ask this, but I think I saw someone referred to the past discussion that eliminating data copy is difficult due to some processing at commit. I can't find it.
(1)
@@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))
+struct SmgrRelationData;
This declaration is already in the file:
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;
/* forward declared, to avoid including smgr.h here */
struct SMgrRelationData;
Regards
Takayuki Tsunakawa
Hello, Tsunakawa-San
Do you know the reason why data copy was done before? And, it may be
odd for me to ask this, but I think I saw someone referred to the past
discussion that eliminating data copy is difficult due to some processing at
commit. I can't find it.
I can share 2 sources why to eliminate the data copy is difficult in hackers thread.
Tom's remark and the context to copy relation's data.
/messages/by-id/31724.1394163360@sss.pgh.pa.us
Amit-San quoted this thread and mentioned that point in another thread.
/messages/by-id/CAA4eK1+HDqS+1fhs5Jf9o4ZujQT=XBZ6sU0kOuEh2hqQAC+t=w@mail.gmail.com
Best,
Takamichi Osumi
At Fri, 13 Nov 2020 06:43:13 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
Hi Horiguchi-san,
Thank you for making a patch so quickly. I've started looking at it.
What makes you think this is a PoC? Documentation and test cases? If there's something you think that doesn't work or are concerned about, can you share it?
The latest version is heavily revised and is given much comment so it
might have exited from PoC state. The necessity of documentation is
doubtful since this patch doesn't user-facing behavior other than
speed. Some tests are required especialy about recovery and
replication perspective but I haven't been able to make it. (One of
the tests needs to cause crash while a transaction is running.)
Do you know the reason why data copy was done before? And, it may be odd for me to ask this, but I think I saw someone referred to the past discussion that eliminating data copy is difficult due to some processing at commit. I can't find it.
To imagine that, just because it is simpler considering rollback and
code sharing, and maybe no one have been complained that SET
LOGGED/UNLOGGED looks taking a long time than required/expected.
The current implement is simple. It's enough to just discard old or
new relfilenode according to the current transaction ends with commit
or abort. Tweaking of relfilenode under use leads-in some skews in
some places. I used pendingDelete mechanism a bit complexified way
and a violated an abstraction (I think, calling AM-routines from
storage.c is not good.) and even introduce a new fork kind only to
mark a init fork as "not committed yet". There might be better way,
but I haven't find it.
(The patch scans all shared buffer blocks for each relation).
(1)
@@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount;
*/
#define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer))+struct SmgrRelationData;
This declaration is already in the file:
/* forward declared, to avoid having to expose buf_internals.h here */
struct WritebackContext;/* forward declared, to avoid including smgr.h here */
struct SMgrRelationData;
Hmmm. Nice chatch. And will fix in the next version.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 13 Nov 2020 07:15:41 +0000, "osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com> wrote in
Hello, Tsunakawa-San
Thanks for sharing it!
Do you know the reason why data copy was done before? And, it may be
odd for me to ask this, but I think I saw someone referred to the past
discussion that eliminating data copy is difficult due to some processing at
commit. I can't find it.I can share 2 sources why to eliminate the data copy is difficult in hackers thread.
Tom's remark and the context to copy relation's data.
/messages/by-id/31724.1394163360@sss.pgh.pa.us
/messages/by-id/CA+Tgmob44LNwwU73N1aJsGQyzQ61SdhKJRC_89wCm0+aLg=x2Q@mail.gmail.com
No, not really. The issue is more around what happens if we crash
part way through. At crash recovery time, the system catalogs are not
available, because the database isn't consistent yet and, anyway, the
startup process can't be bound to a database, let alone every database
that might contain unlogged tables. So the sentinel that's used to
decide whether to flush the contents of a table or index is the
presence or absence of an _init fork, which the startup process
obviously can see just fine. The _init fork also tells us what to
stick in the relation when we reset it; for a table, we can just reset
to an empty file, but that's not legal for indexes, so the _init fork
contains a pre-initialized empty index that we can just copy over.Now, to make an unlogged table logged, you've got to at some stage
remove those _init forks. But this is not a transactional operation.
If you remove the _init forks and then the transaction rolls back,
you've left the system an inconsistent state. If you postpone the
removal until commit time, then you have a problem if it fails,
It's true. That are the cause of headache.
particularly if it works for the first file but fails for the second.
And if you crash at any point before you've fsync'd the containing
directory, you have no idea which files will still be on disk after a
hard reboot.
This is not an issue in this patch *except* the case where init fork
is failed to removed but the following removal of inittmp fork
succeeds. Another idea is adding a "not-yet-committed" property to a
fork. I added a new fork type for easiness of the patch but I could
go that way if that is an issue.
Amit-San quoted this thread and mentioned that point in another thread.
/messages/by-id/CAA4eK1+HDqS+1fhs5Jf9o4ZujQT=XBZ6sU0kOuEh2hqQAC+t=w@mail.gmail.com
This sounds like a bit differrent discussion. Making part-of-a-table
UNLOGGED looks far difficult to me.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
No, not really. The issue is more around what happens if we crash
part way through. At crash recovery time, the system catalogs are not
available, because the database isn't consistent yet and, anyway, the
startup process can't be bound to a database, let alone every database
that might contain unlogged tables. So the sentinel that's used to
decide whether to flush the contents of a table or index is the
presence or absence of an _init fork, which the startup process
obviously can see just fine. The _init fork also tells us what to
stick in the relation when we reset it; for a table, we can just reset
to an empty file, but that's not legal for indexes, so the _init fork
contains a pre-initialized empty index that we can just copy over.Now, to make an unlogged table logged, you've got to at some stage
remove those _init forks. But this is not a transactional operation.
If you remove the _init forks and then the transaction rolls back,
you've left the system an inconsistent state. If you postpone the
removal until commit time, then you have a problem if it fails,It's true. That are the cause of headache.
...
The current implement is simple. It's enough to just discard old or
new relfilenode according to the current transaction ends with commit
or abort. Tweaking of relfilenode under use leads-in some skews in
some places. I used pendingDelete mechanism a bit complexified way
and a violated an abstraction (I think, calling AM-routines from
storage.c is not good.) and even introduce a new fork kind only to
mark a init fork as "not committed yet". There might be better way,
but I haven't find it.
I have no alternative idea yet, too. I agree that we want to avoid them, especially introducing inittmp fork... Anyway, below are the rest of my review comments for 0001. I want to review 0002 when we have decided to go with 0001.
(2)
XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:
[src/include/catalog/storage_xlog.h]
/*
* Declarations for smgr-related XLOG records
*
* Note: we log file creation and truncation here, but logging of deletion
* actions is handled by xact.c, because it is part of transaction commit.
*/
[src/backend/access/transam/README]
3. Deleting a table, which requires an unlink() that could fail.
Our approach here is to WAL-log the operation first, but to treat failure
of the actual unlink() call as a warning rather than error condition.
Again, this can leave an orphan file behind, but that's cheap compared to
the alternatives. Since we can't actually do the unlink() until after
we've committed the DROP TABLE transaction, throwing an error would be out
of the question anyway. (It may be worth noting that the WAL entry about
the file deletion is actually part of the commit record for the dropping
transaction.)
(3)
+/* This is bit-map, not ordianal numbers */
There seems to be no comments using "bit-map". "Flags for ..." can be seen here and there.
(4)
Some wrong spellings:
+ /* we flush this buffer when swithing to PERMANENT */
swithing -> switching
+ * alredy flushed out by RelationCreate(Drop)InitFork called just
alredy -> already
+ * relation content to be WAL-logged to recovery the table.
recovery -> recover
+ * The inittmp fork works as the sentinel to identify that situaton.
situaton -> situation
(5)
+ table_close(classRel, NoLock);
+
+
+
+
}
These empty lines can be deleted.
(6)
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
...
+ * Make an XLOG entry reporting the file unlink.
Not unlink but buffer persistence?
(7)
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by myself.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /*
+ * We have created the init fork. If server crashes before the current
+ * transaction ends the init fork left alone corrupts data while recovery.
+ * The inittmp fork works as the sentinel to identify that situaton.
+ */
+ smgrcreate(srel, INITTMP_FORKNUM, false);
+ log_smgrcreate(&rnode, INITTMP_FORKNUM);
+ smgrimmedsync(srel, INITTMP_FORKNUM);
If the server crashes between these two processings, only the init fork exists. Is it correct to create the inittmp fork first?
(8)
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+ smgrclose(srel);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ log_smgrunlink(&rnode, INITTMP_FORKNUM);
+ smgrunlink(srel, INITTMP_FORKNUM, false);
+ return;
+ }
smgrclose() should be called just before return.
Isn't it necessary here to revert buffer persistence state change?
(9)
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
Maybe it's better to restore smgrdounlinkfork() that was removed in the older release. That function includes dropping shared buffers, which can clean up the shared buffers that may be cached by this transaction.
(10)
[RelationDropInitFork]
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
bufpersistence should be true.
(11)
+ BlockNumber block = 0;
...
+ DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1,
+ &block);
"block" is unnecessary and 0 can be passed directly.
(12)
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
nrels++;
It's better to put && at the beginning of the line to follow the existing code here.
(13)
+ table_close(rel, lockmode);
lockmode should be NoLock to retain the lock until transaction completion.
(14)
+ ctl.keysize = sizeof(unlogged_relation_entry);
+ ctl.entrysize = sizeof(unlogged_relation_entry);
+ hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM);
...
+ memset(key.oid, 0, sizeof(key.oid));
+ memcpy(key.oid, de->d_name, oidchars);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
keysize should be the oid member of the struct.
Regards
Takayuki Tsunakawa
Thanks for the comment! Sorry for the late reply.
At Fri, 4 Dec 2020 07:49:22 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
No, not really. The issue is more around what happens if we crash
part way through. At crash recovery time, the system catalogs are not
available, because the database isn't consistent yet and, anyway, the
startup process can't be bound to a database, let alone every database
that might contain unlogged tables. So the sentinel that's used to
decide whether to flush the contents of a table or index is the
presence or absence of an _init fork, which the startup process
obviously can see just fine. The _init fork also tells us what to
stick in the relation when we reset it; for a table, we can just reset
to an empty file, but that's not legal for indexes, so the _init fork
contains a pre-initialized empty index that we can just copy over.Now, to make an unlogged table logged, you've got to at some stage
remove those _init forks. But this is not a transactional operation.
If you remove the _init forks and then the transaction rolls back,
you've left the system an inconsistent state. If you postpone the
removal until commit time, then you have a problem if it fails,It's true. That are the cause of headache.
...
The current implement is simple. It's enough to just discard old or
new relfilenode according to the current transaction ends with commit
or abort. Tweaking of relfilenode under use leads-in some skews in
some places. I used pendingDelete mechanism a bit complexified way
and a violated an abstraction (I think, calling AM-routines from
storage.c is not good.) and even introduce a new fork kind only to
mark a init fork as "not committed yet". There might be better way,
but I haven't find it.I have no alternative idea yet, too. I agree that we want to avoid them, especially introducing inittmp fork... Anyway, below are the rest of my review comments for 0001. I want to review 0002 when we have decided to go with 0001.
(2)
XLOG_SMGR_UNLINK seems to necessitate modification of the following comments:[src/include/catalog/storage_xlog.h]
/*
* Declarations for smgr-related XLOG records
*
* Note: we log file creation and truncation here, but logging of deletion
* actions is handled by xact.c, because it is part of transaction commit.
*/
Sure. Rewrote it.
[src/backend/access/transam/README]
3. Deleting a table, which requires an unlink() that could fail.Our approach here is to WAL-log the operation first, but to treat failure
of the actual unlink() call as a warning rather than error condition.
Again, this can leave an orphan file behind, but that's cheap compared to
the alternatives. Since we can't actually do the unlink() until after
we've committed the DROP TABLE transaction, throwing an error would be out
of the question anyway. (It may be worth noting that the WAL entry about
the file deletion is actually part of the commit record for the dropping
transaction.)
Mmm. I didn't touched theDROP TABLE (RelationDropStorage) path, but I
added a brief description about INITTMP fork to the file.
====
The INITTMP fork file
--------------------------------
An INITTMP fork is created when new relation file is created to mark
the relfilenode needs to be cleaned up at recovery time. The file is
removed at transaction end but is left when the process crashes before
the transaction ends. In contrast to 4 above, failure to remove an
INITTMP file will lead to data loss, in which case the server will
shut down.
====
(3)
+/* This is bit-map, not ordianal numbers */There seems to be no comments using "bit-map". "Flags for ..." can be seen here and there.
I revmoed the comment and use (1 << n) notation to show the fact
instead.
(4)
Some wrong spellings:swithing -> switching
alredy -> already
recovery -> recover
situaton -> situation
Oops! Fixed them.
(5) + table_close(classRel, NoLock); + + + + }These empty lines can be deleted.
s/can/should/ :p. Fixed.
(6) +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) ... + * Make an XLOG entry reporting the file unlink.Not unlink but buffer persistence?
Silly copy-pasto. Fixed.
(7) + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by myself. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* + * We have created the init fork. If server crashes before the current + * transaction ends the init fork left alone corrupts data while recovery. + * The inittmp fork works as the sentinel to identify that situaton. + */ + smgrcreate(srel, INITTMP_FORKNUM, false); + log_smgrcreate(&rnode, INITTMP_FORKNUM); + smgrimmedsync(srel, INITTMP_FORKNUM);If the server crashes between these two processings, only the init fork exists. Is it correct to create the inittmp fork first?
Right. I change it that way, and did the same with the new code added
to RelationCreateStorage.
(8) + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + smgrclose(srel); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + log_smgrunlink(&rnode, INITTMP_FORKNUM); + smgrunlink(srel, INITTMP_FORKNUM, false); + return; + }smgrclose() should be called just before return.
Isn't it necessary here to revert buffer persistence state change?
Mmm. it's a thinko. I was confused with the case of
close/unlink. Fixed all instacnes of the same.
(9) +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +}Maybe it's better to restore smgrdounlinkfork() that was removed in the older release. That function includes dropping shared buffers, which can clean up the shared buffers that may be cached by this transaction.
INITFORK/INITTMP forks cannot be loaded to shared buffer so it's no
use to drop buffers. I added a comment like that.
| /*
| * INIT/INITTMP forks never be loaded to shared buffer so no point in
| * dropping buffers for these files.
| */
| log_smgrunlink(&rnode, INIT_FORKNUM);
I removed DropRelFileNodeBuffers from PDOP_UNLINK_FORK branch in
smgrDoPendingDeletes and added an assertion and a comment instead.
| /* other forks needs to drop buffers */
| Assert(pending->unlink_forknum == INIT_FORKNUM ||
| pending->unlink_forknum == INITTMP_FORKNUM);
|
| log_smgrunlink(&pending->relnode, pending->unlink_forknum);
| smgrunlink(srel, pending->unlink_forknum, false);
(10) [RelationDropInitFork] + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +}bufpersistence should be true.
RelationDropInitFork() chnages the relation persisitence to
"persistent" so it shoud be reverted to "non-persistent (= false)" at
abort. (I agree that the function name is somewhat confusing...)
(11) + BlockNumber block = 0; ... + DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1, + &block);"block" is unnecessary and 0 can be passed directly.
I removed the entire function call.
But, I don't think you're right here.
| DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
| int nforks, BlockNumber *firstDelBlock)
Doesn't just passing 0 lead to SEGV?
(12) - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) nrels++;It's better to put && at the beginning of the line to follow the existing code here.
It's terrible.. Fixed.
(13)
+ table_close(rel, lockmode);lockmode should be NoLock to retain the lock until transaction completion.
I tried to recall the reason for that, but didn't come up with
anything. Fixed.
(14) + ctl.keysize = sizeof(unlogged_relation_entry); + ctl.entrysize = sizeof(unlogged_relation_entry); + hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM); ... + memset(key.oid, 0, sizeof(key.oid)); + memcpy(key.oid, de->d_name, oidchars); + ent = hash_search(hash, &key, HASH_FIND, NULL);keysize should be the oid member of the struct.
It's not a problem since the first member is the oid and perhaps it
seems that I thougth to do someting more on that. Now that I don't
recall what is it and in the first place the key should be just Oid in
the context above. Fixed.
The patch is attached to the next message.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello.
At Thu, 24 Dec 2020 17:02:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
The patch is attached to the next message.
The reason for separating this message is that I modified this so that
it could solve another issue.
There's a complain about orphan files after crash. [1]
1: /messages/by-id/16771-cbef7d97ba93f4b9@postgresql.org
That is, the case where a relation file is left alone after a server
crash that happened before the end of the transaction that has created
a relation. As I read this, I noticed this feature can solve the
issue with a small change.
This version gets changes in RelationCreateStorage and
smgrDoPendingDeletes.
Previously inittmp fork is created only along with an init fork. This
version creates one always when a relation storage file is created. As
the result ResetUnloggedRelationsInDbspaceDir removes all forks if the
inttmp fork of a logged relations is found. Now that pendingDeletes
can contain multiple entries for the same relation, it has been
modified not to close the same smgr multiple times.
- It might be better to split 0001 into two peaces.
- The function name ResetUnloggedRelationsInDbspaceDir is no longer
represents the function correctly.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v2-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From dbe9ef477df8570b0b0def2b5f089b0001aa2eab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v2 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 23 ++
src/backend/access/transam/README | 10 +
src/backend/catalog/storage.c | 394 +++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 213 ++++++++++---
src/backend/storage/buffer/bufmgr.c | 88 ++++++
src/backend/storage/file/reinit.c | 164 +++++-----
src/backend/storage/smgr/md.c | 4 +-
src/backend/storage/smgr/smgr.c | 6 +
src/common/relpath.c | 3 +-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_xlog.h | 22 +-
src/include/common/relpath.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/smgr.h | 1 +
14 files changed, 800 insertions(+), 137 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index a7c0cb1bc3..097dacfee6 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +72,12 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..51616b2458 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,16 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The INITTMP fork file
+--------------------------------
+
+An INITTMP fork is created when new relation file is created to mark
+the relfilenode needs to be cleaned up at recovery time. The file is
+removed at transaction end but is left when the process crashes before
+the transaction ends. In contrast to 4 above, failure to remove an
+INITTMP file will lead to data loss, in which case the server will
+shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d538f25726..f4dddbad55 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -57,9 +58,16 @@ int wal_skip_threshold = 2048; /* in kilobytes */
* but I'm being paranoid.
*/
+#define PDOP_DELETE (0)
+#define PDOP_UNLINK_FORK (1 << 0)
+#define PDOP_SET_PERSISTENCE (1 << 1)
+
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -143,7 +151,17 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up but there's no
+ * clue to the orphan files. The inittmp fork works as the sentinel to
+ * identify that situation.
+ */
srel = smgropen(rnode, backend);
+ smgrcreate(srel, INITTMP_FORKNUM, false);
+ log_smgrcreate(&rnode, INITTMP_FORKNUM);
+ smgrimmedsync(srel, INITTMP_FORKNUM);
+
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
@@ -153,12 +171,37 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->op = PDOP_DELETE;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /* drop inittmp fork at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INITTMP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop inittmp fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INITTMP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
Assert(backend == InvalidBackendId);
@@ -168,6 +211,215 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, false, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have already registered pending sync entries to drop preexisting
+ * init fork since before the current transaction started. This function
+ * reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->op != PDOP_DELETE)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ create = false;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create the init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The inittmp fork works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ smgrcreate(srel, INITTMP_FORKNUM, false);
+ log_smgrcreate(&rnode, INITTMP_FORKNUM);
+ smgrimmedsync(srel, INITTMP_FORKNUM);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by myself.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop this init fork file at abort and revert persistence */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK | PDOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop inittmp fork at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INITTMP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop inittmp fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INITTMP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, true, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * immediately remove the init and inittmp forks immediately in that case.
+ * Otherwise just reister pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->op != PDOP_DELETE)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ inxact_created = true;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT/INITTMP forks never be loaded to shared buffer so no point in
+ * dropping buffers for these files.
+ */
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ log_smgrunlink(&rnode, INITTMP_FORKNUM);
+ smgrunlink(srel, INITTMP_FORKNUM, false);
+ smgrclose(srel);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +439,44 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +490,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->op = PDOP_DELETE;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -606,43 +897,70 @@ smgrDoPendingDeletes(bool isCommit)
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
+ SMgrRelation srel;
+
next = pending->next;
if (pending->nestLevel < nestLevel)
{
/* outer-level entries should not be processed yet */
prev = pending;
+ continue;
}
+
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
else
+ pendingDeletes = next;
+
+ if (pending->atCommit != isCommit)
{
- /* unlink list entry first, so we don't retry on failure */
- if (prev)
- prev->next = next;
- else
- pendingDeletes = next;
- /* do deletion if called for */
- if (pending->atCommit == isCommit)
- {
- SMgrRelation srel;
-
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
- {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
-
- srels[nrels++] = srel;
- }
/* must explicitly free the list entry */
pfree(pending);
/* prev does not change */
+ continue;
+ }
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ if (pending->op != PDOP_DELETE)
+ {
+ if (pending->op & PDOP_UNLINK_FORK)
+ {
+ BlockNumber block = 0;
+ RelFileNodeBackend rbnode;
+
+ rbnode.node = pending->relnode;
+ rbnode.backend = InvalidBackendId;
+
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM ||
+ pending->unlink_forknum == INITTMP_FORKNUM);
+
+ log_smgrunlink(&pending->relnode, pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ smgrclose(srel);
+ }
+
+ if (pending->op & PDOP_SET_PERSISTENCE)
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ false);
+ }
+ else
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
}
}
@@ -824,7 +1142,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == PDOP_DELETE)
nrels++;
}
if (nrels == 0)
@@ -837,7 +1156,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
{
*rptr = pending->relnode;
rptr++;
@@ -917,6 +1237,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1005,6 +1334,15 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1fa9f19f08..45be633d9f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4917,6 +4917,138 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform im-place persistnce change");
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ RelationOpenSmgr(toastrel);
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ RelationOpenSmgr(r);
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(r->rd_smgr, i))
+ smgrimmedsync(r->rd_smgr, i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(r->rd_smgr, fork))
+ log_newpage_range(r, fork,
+ 0, smgrnblocks(r->rd_smgr, fork), false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5037,45 +5169,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
- lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
+ lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+ }
}
else
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c5e8707151..6ff46fb86d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3032,6 +3033,93 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 8700f7f19a..80a1e61408 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -31,7 +31,8 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool dirty; /* to be removed */
+} relfile_entry;
/*
* Reset unlogged relations from before the last restart.
@@ -151,6 +152,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
@@ -160,88 +163,86 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create a ton of unlogged relations
+ * in the same database & tablespace, so we'd better use a hash table
+ * rather than an array or linked list to keep track of which files
+ * need to be reset. Otherwise, this cleanup operation would be
+ * O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("inittmp hash", 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect inttmp forks in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum))
+ continue;
+
+ /* Record init and inittmp forks */
+ if (forkNum == INIT_FORKNUM || forkNum == INITTMP_FORKNUM)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode. If it has INITTMP fork, the all files
+ * needs to be cleaned up. Otherwise the relfilenode is cleaned up
+ * according to the unloggedness.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ ent->dirty = false;
+
+ if (forkNum == INITTMP_FORKNUM)
+ ent->dirty = true;
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
/*
* Now, make a second pass and remove anything that matches.
*/
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
&forkNum))
continue;
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
- continue;
-
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ /* we don't remove clean init file */
+ if (ent && (ent->dirty || forkNum != INIT_FORKNUM))
{
+ /* so, nuke it! */
snprintf(rm_path, sizeof(rm_path), "%s/%s",
dbspacedirname, de->d_name);
if (unlink(rm_path) < 0)
@@ -250,13 +251,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
errmsg("could not remove file \"%s\": %m",
rm_path)));
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ elog(LOG, "unlinked file \"%s\"", rm_path);
}
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
/*
@@ -277,12 +277,42 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
char dstpath[MAXPGPATH];
+ Oid key;
+ relfile_entry *ent;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
&forkNum))
continue;
+ /*
+ * See whether the OID portion of the name shows up in the hash
+ * table.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ /* we don't remove clean init file */
+ if (ent && (ent->dirty || forkNum != INIT_FORKNUM))
+ {
+ /*
+ * The file is dirty. It shoudl have been removed once at
+ * cleanup time but recovery can create them again. Remove
+ * them.
+ */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+ else
+ elog(LOG, "unlinked file \"%s\"", rm_path);
+
+ continue;
+ }
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -351,6 +381,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
fsync_fname(dbspacedirname, true);
}
+
+ hash_destroy(hash);
}
/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 9889ad6ad8..32dad72ed3 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -338,8 +338,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
if (ret == 0 || errno != ENOENT)
{
ret = unlink(path);
+
+ /* failure of removing inittmp fork leads to a data loss. */
if (ret < 0 && errno != ENOENT)
- ereport(WARNING,
+ ereport((forkNum != INITTMP_FORKNUM ? WARNING : ERROR),
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 072bdd118f..2a1d87dc33 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -644,6 +644,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/common/relpath.c b/src/common/relpath.c
index ad733d1363..2a5e5fa990 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -34,7 +34,8 @@ const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
"vm", /* VISIBILITYMAP_FORKNUM */
- "init" /* INIT_FORKNUM */
+ "init", /* INIT_FORKNUM */
+ "itmp" /* INITTMP_FORKNUM */
};
StaticAssertDecl(lengthof(forkNames) == (MAX_FORKNUM + 1),
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 30c38e0ca6..c2259cd7e3 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 7b21cab2e0..dcf1e605c0 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -22,13 +22,17 @@
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation, deletion and persistence change
+ * here. logging of deletion actions is mainly handled by xact.c, because it is
+ * part of transaction commit, but we log deletions happens outside of a
+ * transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_BUFPERSISTENCE 0x40
typedef struct xl_smgr_create
{
@@ -36,6 +40,18 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +67,8 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 869cabcc0d..f6e1a74a38 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -43,7 +43,8 @@ typedef enum ForkNumber
MAIN_FORKNUM = 0,
FSM_FORKNUM,
VISIBILITYMAP_FORKNUM,
- INIT_FORKNUM
+ INIT_FORKNUM,
+ INITTMP_FORKNUM
/*
* NOTE: if you add a new fork, change MAX_FORKNUM and possibly
@@ -52,7 +53,7 @@ typedef enum ForkNumber
*/
} ForkNumber;
-#define MAX_FORKNUM INIT_FORKNUM
+#define MAX_FORKNUM INITTMP_FORKNUM
#define FORKNAMECHARS 4 /* max chars for a fork name */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..9697449938 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index f28a842401..5d74631006 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
v2-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload
From 421e0652fe94753921ad382e27da4010ce5db520 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v2 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 45be633d9f..002749094b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13663,6 +13663,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 70f8b718e0..222b81724a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4124,6 +4124,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5424,6 +5437,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 541e0e6b48..898f78d899 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1860,6 +1860,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3479,6 +3491,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 8f341ac006..afc4ff0447 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1885,6 +1885,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index a42ead7d69..f866b8cab2 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -161,6 +161,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1732,6 +1733,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2615,6 +2622,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index c1581ad178..206de61154 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 3684f87a88..7fb6437973 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -423,6 +423,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 48a79a7657..5d549b2476 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2234,6 +2234,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
At Fri, 25 Dec 2020 09:12:52 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Hello.
At Thu, 24 Dec 2020 17:02:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
The patch is attached to the next message.
The reason for separating this message is that I modified this so that
it could solve another issue.There's a complain about orphan files after crash. [1]
1: /messages/by-id/16771-cbef7d97ba93f4b9@postgresql.org
That is, the case where a relation file is left alone after a server
crash that happened before the end of the transaction that has created
a relation. As I read this, I noticed this feature can solve the
issue with a small change.This version gets changes in RelationCreateStorage and
smgrDoPendingDeletes.Previously inittmp fork is created only along with an init fork. This
version creates one always when a relation storage file is created. As
the result ResetUnloggedRelationsInDbspaceDir removes all forks if the
inttmp fork of a logged relations is found. Now that pendingDeletes
can contain multiple entries for the same relation, it has been
modified not to close the same smgr multiple times.- It might be better to split 0001 into two peaces.
- The function name ResetUnloggedRelationsInDbspaceDir is no longer
represents the function correctly.
As pointed by Robert in another thread [1], persisntence of (at least)
GiST index cannot be flipped in-place due to incompatibility of fake
LSNs with real ones.
This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.
1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v3-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 1d47e7872d1e7ef18007f752e55cec9772373cc9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v3 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 23 ++
src/backend/access/transam/README | 10 +
src/backend/catalog/storage.c | 420 +++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 246 ++++++++++++---
src/backend/storage/buffer/bufmgr.c | 88 ++++++
src/backend/storage/file/reinit.c | 162 ++++++----
src/backend/storage/smgr/md.c | 4 +-
src/backend/storage/smgr/smgr.c | 6 +
src/common/relpath.c | 3 +-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_xlog.h | 22 +-
src/include/common/relpath.h | 5 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/smgr.h | 1 +
14 files changed, 854 insertions(+), 140 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..2c109b8ca4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +72,12 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..51616b2458 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,16 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The INITTMP fork file
+--------------------------------
+
+An INITTMP fork is created when new relation file is created to mark
+the relfilenode needs to be cleaned up at recovery time. The file is
+removed at transaction end but is left when the process crashes before
+the transaction ends. In contrast to 4 above, failure to remove an
+INITTMP file will lead to data loss, in which case the server will
+shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cba7a9ada0..bd9680583b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -27,6 +28,7 @@
#include "access/xlogutils.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "common/hashfn.h"
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
@@ -57,9 +59,16 @@ int wal_skip_threshold = 2048; /* in kilobytes */
* but I'm being paranoid.
*/
+#define PDOP_DELETE (0)
+#define PDOP_UNLINK_FORK (1 << 0)
+#define PDOP_SET_PERSISTENCE (1 << 1)
+
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -75,6 +84,24 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
+typedef struct SRelHashEntry
+{
+ SMgrRelation srel;
+ char status; /* for simplehash use */
+} SRelHashEntry;
+
+/* define hashtable for workarea for pending deletes */
+#define SH_PREFIX srelhash
+#define SH_ELEMENT_TYPE SRelHashEntry
+#define SH_KEY_TYPE SMgrRelation
+#define SH_KEY srel
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((unsigned char *)&key, sizeof(SMgrRelation))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
/*
* AddPendingSync
@@ -143,7 +170,17 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up but there's no
+ * clue to the orphan files. The inittmp fork works as the sentinel to
+ * identify that situation.
+ */
srel = smgropen(rnode, backend);
+ smgrcreate(srel, INITTMP_FORKNUM, false);
+ log_smgrcreate(&rnode, INITTMP_FORKNUM);
+ smgrimmedsync(srel, INITTMP_FORKNUM);
+
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
@@ -153,12 +190,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->op = PDOP_DELETE;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /* drop inittmp fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INITTMP_FORKNUM;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
Assert(backend == InvalidBackendId);
@@ -168,6 +218,215 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, false, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have already registered pending sync entries to drop preexisting
+ * init fork since before the current transaction started. This function
+ * reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->op != PDOP_DELETE)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ create = false;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create the init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The inittmp fork works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ smgrcreate(srel, INITTMP_FORKNUM, false);
+ log_smgrcreate(&rnode, INITTMP_FORKNUM);
+ smgrimmedsync(srel, INITTMP_FORKNUM);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by myself.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop this init fork file at abort and revert persistence */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK | PDOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop inittmp fork at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INITTMP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop inittmp fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INITTMP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, true, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * immediately remove the init and inittmp forks immediately in that case.
+ * Otherwise just reister pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->op != PDOP_DELETE)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ inxact_created = true;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT/INITTMP forks never be loaded to shared buffer so no point in
+ * dropping buffers for these files.
+ */
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ log_smgrunlink(&rnode, INITTMP_FORKNUM);
+ smgrunlink(srel, INITTMP_FORKNUM, false);
+ smgrclose(srel);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +446,44 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +497,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->op = PDOP_DELETE;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -602,59 +900,97 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ srelhash_hash *close_srels = NULL;
+ bool found;
+
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
+ SMgrRelation srel;
+
next = pending->next;
if (pending->nestLevel < nestLevel)
{
/* outer-level entries should not be processed yet */
prev = pending;
+ continue;
}
+
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
else
+ pendingDeletes = next;
+
+ if (pending->atCommit != isCommit)
{
- /* unlink list entry first, so we don't retry on failure */
- if (prev)
- prev->next = next;
- else
- pendingDeletes = next;
- /* do deletion if called for */
- if (pending->atCommit == isCommit)
- {
- SMgrRelation srel;
-
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
- {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
-
- srels[nrels++] = srel;
- }
/* must explicitly free the list entry */
pfree(pending);
/* prev does not change */
+ continue;
+ }
+
+ if (close_srels == NULL)
+ close_srels = srelhash_create(CurrentMemoryContext, 32, NULL);
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /* Uniquify the smgr relations */
+ srelhash_insert(close_srels, srel, &found);
+
+ if (pending->op != PDOP_DELETE)
+ {
+ if (pending->op & PDOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM ||
+ pending->unlink_forknum == INITTMP_FORKNUM);
+
+ log_smgrunlink(&pending->relnode, pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+
+ }
+
+ if (pending->op & PDOP_SET_PERSISTENCE)
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ false);
+ }
+ else
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
}
}
if (nrels > 0)
{
smgrdounlinkall(srels, nrels, false);
-
- for (int i = 0; i < nrels; i++)
- smgrclose(srels[i]);
-
pfree(srels);
}
+
+ if (close_srels)
+ {
+ srelhash_iterator i;
+ SRelHashEntry *ent;
+
+ /* close smgr relatoins */
+ srelhash_start_iterate(close_srels, &i);
+ while ((ent = srelhash_iterate(close_srels, &i)) != NULL)
+ smgrclose(ent->srel);
+ srelhash_destroy(close_srels);
+ }
}
/*
@@ -824,7 +1160,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == PDOP_DELETE)
nrels++;
}
if (nrels == 0)
@@ -837,7 +1174,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
{
*rptr = pending->relnode;
rptr++;
@@ -917,6 +1255,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1005,6 +1352,15 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 993da56d43..37a15d31ee 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -50,6 +50,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -4917,6 +4918,170 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform im-place persistnce change");
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ RelationOpenSmgr(toastrel);
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * Some access methods do not accept in-place persistence change. For
+ * example, GiST uses page LSNs to figure out whether a block has
+ * changed, where UNLOGGED GiST indexes use fake LSNs that are
+ * incompatible with real LSNs used for LOGGED ones.
+ *
+ * XXXX: We don't bother allowing in-place persistence change for index
+ * methods other than btree for now.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ r->rd_rel->relam != BTREE_AM_OID)
+ {
+ int reindex_flags;
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, 0);
+
+ continue;
+ }
+
+ RelationOpenSmgr(r);
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(r->rd_smgr, i))
+ smgrimmedsync(r->rd_smgr, i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(r->rd_smgr, fork))
+ log_newpage_range(r, fork,
+ 0, smgrnblocks(r->rd_smgr, fork), false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5037,45 +5202,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
- lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
+ lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+ }
}
else
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 71b5852224..b730b4417c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3032,6 +3033,93 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 40c758d789..adcb54b0fa 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -31,7 +31,8 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool dirty; /* to be removed */
+} relfile_entry;
/*
* Reset unlogged relations from before the last restart.
@@ -151,6 +152,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
@@ -160,88 +163,86 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create a ton of unlogged relations
+ * in the same database & tablespace, so we'd better use a hash table
+ * rather than an array or linked list to keep track of which files
+ * need to be reset. Otherwise, this cleanup operation would be
+ * O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("inittmp hash", 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect inttmp forks in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum))
+ continue;
+
+ /* Record init and inittmp forks */
+ if (forkNum == INIT_FORKNUM || forkNum == INITTMP_FORKNUM)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode. If it has INITTMP fork, the all files
+ * needs to be cleaned up. Otherwise the relfilenode is cleaned up
+ * according to the unloggedness.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ ent->dirty = false;
+
+ if (forkNum == INITTMP_FORKNUM)
+ ent->dirty = true;
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
/*
* Now, make a second pass and remove anything that matches.
*/
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
&forkNum))
continue;
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
- continue;
-
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ /* we don't remove clean init file */
+ if (ent && (ent->dirty || forkNum != INIT_FORKNUM))
{
+ /* so, nuke it! */
snprintf(rm_path, sizeof(rm_path), "%s/%s",
dbspacedirname, de->d_name);
if (unlink(rm_path) < 0)
@@ -256,7 +257,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
/*
@@ -277,12 +277,42 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
char dstpath[MAXPGPATH];
+ Oid key;
+ relfile_entry *ent;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
&forkNum))
continue;
+ /*
+ * See whether the OID portion of the name shows up in the hash
+ * table.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ /* we don't remove clean init file */
+ if (ent && (ent->dirty || forkNum != INIT_FORKNUM))
+ {
+ /*
+ * The file is dirty. It shoudl have been removed once at
+ * cleanup time but recovery can create them again. Remove
+ * them.
+ */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+ else
+ elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+
+ continue;
+ }
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -351,6 +381,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
fsync_fname(dbspacedirname, true);
}
+
+ hash_destroy(hash);
}
/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 0643d714fb..416fd859e6 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -338,8 +338,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
if (ret == 0 || errno != ENOENT)
{
ret = unlink(path);
+
+ /* failure of removing inittmp fork leads to a data loss. */
if (ret < 0 && errno != ENOENT)
- ereport(WARNING,
+ ereport((forkNum != INITTMP_FORKNUM ? WARNING : ERROR),
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0f31ff3822..4102d3d59c 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -644,6 +644,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..2954cd9c24 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -34,7 +34,8 @@ const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
"vm", /* VISIBILITYMAP_FORKNUM */
- "init" /* INIT_FORKNUM */
+ "init", /* INIT_FORKNUM */
+ "itmp" /* INITTMP_FORKNUM */
};
StaticAssertDecl(lengthof(forkNames) == (MAX_FORKNUM + 1),
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..382623159c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..0fd0832a8b 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -22,13 +22,17 @@
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation, deletion and persistence change
+ * here. logging of deletion actions is mainly handled by xact.c, because it is
+ * part of transaction commit, but we log deletions happens outside of a
+ * transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_BUFPERSISTENCE 0x40
typedef struct xl_smgr_create
{
@@ -36,6 +40,18 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +67,8 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..4305bdbe96 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -43,7 +43,8 @@ typedef enum ForkNumber
MAIN_FORKNUM = 0,
FSM_FORKNUM,
VISIBILITYMAP_FORKNUM,
- INIT_FORKNUM
+ INIT_FORKNUM,
+ INITTMP_FORKNUM
/*
* NOTE: if you add a new fork, change MAX_FORKNUM and possibly
@@ -52,7 +53,7 @@ typedef enum ForkNumber
*/
} ForkNumber;
-#define MAX_FORKNUM INIT_FORKNUM
+#define MAX_FORKNUM INITTMP_FORKNUM
#define FORKNAMECHARS 4 /* max chars for a fork name */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ff6cd0fc54..d9752a8317 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index ebf4a199dc..8be17d9afc 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
v3-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload
From d5dfe5943ea790384faf431fc0bdfeff6efd49fd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v3 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 37a15d31ee..2f65abb19b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13696,6 +13696,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ba3ccc712c..127da5151d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4138,6 +4138,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5441,6 +5454,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index a2ef853dc2..4f13a1762b 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1872,6 +1872,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3494,6 +3506,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 31c95443a5..2222fd8fe3 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1934,6 +1934,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 53a511f1da..16606448bf 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -161,6 +161,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1732,6 +1733,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2619,6 +2626,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 08c463d3c4..646928466d 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index caed683ba9..16d91d3e1d 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -424,6 +424,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index dc2bb40926..c3eab6f1ab 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2253,6 +2253,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com
Hmm. This is not wroking correctly. I'll repost after fixint that.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com
Hmm. This is not wroking correctly. I'll repost after fixint that.
I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir()
handles file operations in the wrong order and with the wrong logic.
It also needed to drop buffers and forget fsync requests.
I thought that the two cases that this patch is expected to fix
(orphan relation files and uncommited init files) can share the same
"cleanup" fork but that is wrong. I had to add one more additional
fork to differentiate the cases of SET UNLOGGED and of creation of
UNLOGGED tables...
The attached is a new version, that seems working correctly but looks
somewhat messy. I'll continue working.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v4-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 88e9374529cbd8f983f2c82baadea94b475e46dd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v4 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 23 ++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 436 +++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 246 +++++++++++---
src/backend/storage/buffer/bufmgr.c | 88 +++++
src/backend/storage/file/reinit.c | 322 ++++++++++++------
src/backend/storage/smgr/md.c | 13 +-
src/backend/storage/smgr/smgr.c | 6 +
src/common/relpath.c | 4 +-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_xlog.h | 22 +-
src/include/common/relpath.h | 6 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/md.h | 2 +
src/include/storage/reinit.h | 3 +-
src/include/storage/smgr.h | 1 +
17 files changed, 1034 insertions(+), 167 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..2c109b8ca4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +72,12 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..547107a771 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The CLEANUP fork file
+--------------------------------
+
+An CLEANUP fork is created when a new relation file is created to mark
+the relfilenode needs to be cleaned up at recovery time. In contrast
+to 4 above, failure to remove an CLEANUP fork file will lead to data
+loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b18257c198..6dcbcbe387 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4442,6 +4443,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7455,6 +7464,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cba7a9ada0..c54d70747f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -27,6 +28,7 @@
#include "access/xlogutils.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "common/hashfn.h"
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
@@ -57,9 +59,16 @@ int wal_skip_threshold = 2048; /* in kilobytes */
* but I'm being paranoid.
*/
+#define PDOP_DELETE (0)
+#define PDOP_UNLINK_FORK (1 << 0)
+#define PDOP_SET_PERSISTENCE (1 << 1)
+
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -75,6 +84,24 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
+typedef struct SRelHashEntry
+{
+ SMgrRelation srel;
+ char status; /* for simplehash use */
+} SRelHashEntry;
+
+/* define hashtable for workarea for pending deletes */
+#define SH_PREFIX srelhash
+#define SH_ELEMENT_TYPE SRelHashEntry
+#define SH_KEY_TYPE SMgrRelation
+#define SH_KEY srel
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((unsigned char *)&key, sizeof(SMgrRelation))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
/*
* AddPendingSync
@@ -143,7 +170,17 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up but there's no
+ * clue to the orphan files. The cleanup fork works as the sentinel to
+ * identify that situation.
+ */
srel = smgropen(rnode, backend);
+ smgrcreate(srel, CLEANUP2_FORKNUM, false);
+ log_smgrcreate(&rnode, CLEANUP2_FORKNUM);
+ smgrimmedsync(srel, CLEANUP2_FORKNUM);
+
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
@@ -153,12 +190,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->op = PDOP_DELETE;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /* drop cleanup fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = CLEANUP2_FORKNUM;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
Assert(backend == InvalidBackendId);
@@ -168,6 +218,218 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, false, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have already registered pending delete entries to drop
+ * preexisting init fork since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->op != PDOP_DELETE &&
+ ((pending->op & PDOP_UNLINK_FORK) != 0 &&
+ pending->unlink_forknum == CLEANUP_FORKNUM))
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ create = false;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create the init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The cleanup fork works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ smgrcreate(srel, CLEANUP_FORKNUM, false);
+ log_smgrcreate(&rnode, CLEANUP_FORKNUM);
+ smgrimmedsync(srel, CLEANUP_FORKNUM);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by myself.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop this init fork file at abort and revert persistence */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK | PDOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop cleanup fork at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = CLEANUP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop cleanup fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = CLEANUP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, true, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * immediately remove the init and cleanup forks immediately in that case.
+ * Otherwise just reister pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->op != PDOP_DELETE &&
+ ((pending->op & PDOP_UNLINK_FORK) != 0 &&
+ pending->unlink_forknum == CLEANUP_FORKNUM))
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ inxact_created = true;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT/CLEANUP forks never be loaded to shared buffer so no point in
+ * dropping buffers for these files.
+ */
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ log_smgrunlink(&rnode, CLEANUP_FORKNUM);
+ smgrunlink(srel, CLEANUP_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +449,44 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +500,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->op = PDOP_DELETE;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -602,59 +903,97 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ srelhash_hash *close_srels = NULL;
+ bool found;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
+ SMgrRelation srel;
+
next = pending->next;
if (pending->nestLevel < nestLevel)
{
/* outer-level entries should not be processed yet */
prev = pending;
+ continue;
}
+
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
else
+ pendingDeletes = next;
+
+ if (pending->atCommit != isCommit)
{
- /* unlink list entry first, so we don't retry on failure */
- if (prev)
- prev->next = next;
- else
- pendingDeletes = next;
- /* do deletion if called for */
- if (pending->atCommit == isCommit)
- {
- SMgrRelation srel;
-
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
- {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
-
- srels[nrels++] = srel;
- }
/* must explicitly free the list entry */
pfree(pending);
/* prev does not change */
+ continue;
+ }
+
+ if (close_srels == NULL)
+ close_srels = srelhash_create(CurrentMemoryContext, 32, NULL);
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /* Uniquify the smgr relations */
+ srelhash_insert(close_srels, srel, &found);
+
+ if (pending->op != PDOP_DELETE)
+ {
+ if (pending->op & PDOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM ||
+ pending->unlink_forknum == CLEANUP_FORKNUM ||
+ pending->unlink_forknum == CLEANUP2_FORKNUM);
+
+ log_smgrunlink(&pending->relnode, pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+
+ }
+
+ if (pending->op & PDOP_SET_PERSISTENCE)
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+ else
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
}
}
if (nrels > 0)
{
smgrdounlinkall(srels, nrels, false);
-
- for (int i = 0; i < nrels; i++)
- smgrclose(srels[i]);
-
pfree(srels);
}
+
+ if (close_srels)
+ {
+ srelhash_iterator i;
+ SRelHashEntry *ent;
+
+ /* close smgr relatoins */
+ srelhash_start_iterate(close_srels, &i);
+ while ((ent = srelhash_iterate(close_srels, &i)) != NULL)
+ smgrclose(ent->srel);
+ srelhash_destroy(close_srels);
+ }
}
/*
@@ -824,7 +1163,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == PDOP_DELETE)
nrels++;
}
if (nrels == 0)
@@ -837,7 +1177,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
{
*rptr = pending->relnode;
rptr++;
@@ -917,6 +1258,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1005,6 +1355,28 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 993da56d43..37a15d31ee 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -50,6 +50,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -4917,6 +4918,170 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform im-place persistnce change");
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ RelationOpenSmgr(toastrel);
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * Some access methods do not accept in-place persistence change. For
+ * example, GiST uses page LSNs to figure out whether a block has
+ * changed, where UNLOGGED GiST indexes use fake LSNs that are
+ * incompatible with real LSNs used for LOGGED ones.
+ *
+ * XXXX: We don't bother allowing in-place persistence change for index
+ * methods other than btree for now.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ r->rd_rel->relam != BTREE_AM_OID)
+ {
+ int reindex_flags;
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, 0);
+
+ continue;
+ }
+
+ RelationOpenSmgr(r);
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(r->rd_smgr, i))
+ smgrimmedsync(r->rd_smgr, i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(r->rd_smgr, fork))
+ log_newpage_range(r, fork,
+ 0, smgrnblocks(r->rd_smgr, fork), false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5037,45 +5202,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
- lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
+ lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+ }
}
else
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 71b5852224..b730b4417c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3032,6 +3033,93 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 40c758d789..b07709bc4f 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,50 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
*
+ * If CLEANUP_FORKNUM (clup) is present, we remove the init fork of the same
+ * relation along with the clup fork.
+ *
+ * If CLEANUP2_FORKNUM (cln2) is present we remove the whole relation along
+ * with the cln2 fork.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
+ *
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
*/
@@ -68,7 +89,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -77,13 +98,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+ Assert(tspid != 0);
+
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -99,7 +126,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -126,6 +154,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -136,7 +166,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
snprintf(dbspace_path, sizeof(dbspace_path), "%s/%s",
tsdirname, de->d_name);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -146,125 +179,232 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create a ton of unlogged relations
+ * in the same database & tablespace, so we'd better use a hash table
+ * rather than an array or linked list to keep track of which files
+ * need to be reset. Otherwise, this cleanup operation would be
+ * O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT and CLEANUP forks in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum))
+ continue;
+
+ if (forkNum == INIT_FORKNUM ||
+ forkNum == CLEANUP_FORKNUM || forkNum == CLEANUP2_FORKNUM)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has the CLEANUP fork,
+ * the relfilenode is in dirty state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == CLEANUP_FORKNUM)
+ ent->dirty_init = true;
+ else if (forkNum == CLEANUP2_FORKNUM)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ RelFileNodeBackend *rnodes;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ rnodes = palloc(sizeof(RelFileNodeBackend) * nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ rnodes[i] = srels[i]->smgr_rnode;
+
+ DropRelFileNodesAllBuffers(rnodes, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
&forkNum))
continue;
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
- continue;
-
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM && forkNum != CLEANUP_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 0643d714fb..6b37195c52 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -338,8 +338,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
if (ret == 0 || errno != ENOENT)
{
ret = unlink(path);
+
+ /* failure of removing cleanup fork leads to a data loss. */
if (ret < 0 && errno != ENOENT)
- ereport(WARNING,
+ ereport((forkNum != CLEANUP_FORKNUM ? WARNING : ERROR),
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
@@ -1024,6 +1026,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0f31ff3822..4102d3d59c 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -644,6 +644,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..479dcc248e 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -34,7 +34,9 @@ const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
"vm", /* VISIBILITYMAP_FORKNUM */
- "init" /* INIT_FORKNUM */
+ "init", /* INIT_FORKNUM */
+ "clup", /* CLEANUP_FORKNUM */
+ "cln2" /* CLEANUP2_FORKNUM */
};
StaticAssertDecl(lengthof(forkNames) == (MAX_FORKNUM + 1),
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..382623159c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..0fd0832a8b 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -22,13 +22,17 @@
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation, deletion and persistence change
+ * here. logging of deletion actions is mainly handled by xact.c, because it is
+ * part of transaction commit, but we log deletions happens outside of a
+ * transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_BUFPERSISTENCE 0x40
typedef struct xl_smgr_create
{
@@ -36,6 +40,18 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +67,8 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..040070aa2b 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -43,7 +43,9 @@ typedef enum ForkNumber
MAIN_FORKNUM = 0,
FSM_FORKNUM,
VISIBILITYMAP_FORKNUM,
- INIT_FORKNUM
+ INIT_FORKNUM,
+ CLEANUP_FORKNUM,
+ CLEANUP2_FORKNUM
/*
* NOTE: if you add a new fork, change MAX_FORKNUM and possibly
@@ -52,7 +54,7 @@ typedef enum ForkNumber
*/
} ForkNumber;
-#define MAX_FORKNUM INIT_FORKNUM
+#define MAX_FORKNUM CLEANUP2_FORKNUM
#define FORKNAMECHARS 4 /* max chars for a fork name */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ff6cd0fc54..d9752a8317 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..3cbbbf2edd 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -41,6 +41,8 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..b969ba8e86 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -23,6 +23,7 @@ extern bool parse_filename_for_nontemp_relation(const char *name,
int *oidchars, ForkNumber *fork);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index ebf4a199dc..8be17d9afc 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
v4-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload
From 70d300969fbd2aae6c66b36f6100d3d2516a0dab Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v4 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 37a15d31ee..2f65abb19b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13696,6 +13696,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ba3ccc712c..127da5151d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4138,6 +4138,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5441,6 +5454,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index a2ef853dc2..4f13a1762b 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1872,6 +1872,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3494,6 +3506,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 31c95443a5..2222fd8fe3 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1934,6 +1934,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 53a511f1da..16606448bf 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -161,6 +161,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1732,6 +1733,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2619,6 +2626,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 08c463d3c4..646928466d 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index caed683ba9..16d91d3e1d 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -424,6 +424,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index dc2bb40926..c3eab6f1ab 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2253,6 +2253,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
At Tue, 12 Jan 2021 18:58:08 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
This version RelationChangePersistence() is changed not to choose
in-place method for indexes other than btree. It seems to be usable
with all kind of indexes other than Gist, but at the mement it applies
only to btrees.1: /messages/by-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com
Hmm. This is not wroking correctly. I'll repost after fixint that.
I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir()
handles file operations in the wrong order and with the wrong logic.
It also needed to drop buffers and forget fsync requests.I thought that the two cases that this patch is expected to fix
(orphan relation files and uncommited init files) can share the same
"cleanup" fork but that is wrong. I had to add one more additional
fork to differentiate the cases of SET UNLOGGED and of creation of
UNLOGGED tables...The attached is a new version, that seems working correctly but looks
somewhat messy. I'll continue working.
Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:p
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v5-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 5f785f181acdac18952f504ec45ce41f285c05bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v5 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 23 ++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 436 +++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 246 +++++++++++---
src/backend/storage/buffer/bufmgr.c | 88 +++++
src/backend/storage/file/reinit.c | 316 ++++++++++++------
src/backend/storage/smgr/md.c | 13 +-
src/backend/storage/smgr/smgr.c | 6 +
src/common/relpath.c | 4 +-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_xlog.h | 22 +-
src/include/common/relpath.h | 6 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/md.h | 2 +
src/include/storage/reinit.h | 3 +-
src/include/storage/smgr.h | 1 +
17 files changed, 1028 insertions(+), 167 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..2c109b8ca4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +72,12 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..547107a771 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The CLEANUP fork file
+--------------------------------
+
+An CLEANUP fork is created when a new relation file is created to mark
+the relfilenode needs to be cleaned up at recovery time. In contrast
+to 4 above, failure to remove an CLEANUP fork file will lead to data
+loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b18257c198..6dcbcbe387 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4442,6 +4443,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7455,6 +7464,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cba7a9ada0..c54d70747f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -27,6 +28,7 @@
#include "access/xlogutils.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "common/hashfn.h"
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
@@ -57,9 +59,16 @@ int wal_skip_threshold = 2048; /* in kilobytes */
* but I'm being paranoid.
*/
+#define PDOP_DELETE (0)
+#define PDOP_UNLINK_FORK (1 << 0)
+#define PDOP_SET_PERSISTENCE (1 << 1)
+
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -75,6 +84,24 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
+typedef struct SRelHashEntry
+{
+ SMgrRelation srel;
+ char status; /* for simplehash use */
+} SRelHashEntry;
+
+/* define hashtable for workarea for pending deletes */
+#define SH_PREFIX srelhash
+#define SH_ELEMENT_TYPE SRelHashEntry
+#define SH_KEY_TYPE SMgrRelation
+#define SH_KEY srel
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((unsigned char *)&key, sizeof(SMgrRelation))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
/*
* AddPendingSync
@@ -143,7 +170,17 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up but there's no
+ * clue to the orphan files. The cleanup fork works as the sentinel to
+ * identify that situation.
+ */
srel = smgropen(rnode, backend);
+ smgrcreate(srel, CLEANUP2_FORKNUM, false);
+ log_smgrcreate(&rnode, CLEANUP2_FORKNUM);
+ smgrimmedsync(srel, CLEANUP2_FORKNUM);
+
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
@@ -153,12 +190,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->op = PDOP_DELETE;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /* drop cleanup fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = CLEANUP2_FORKNUM;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
Assert(backend == InvalidBackendId);
@@ -168,6 +218,218 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, false, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have already registered pending delete entries to drop
+ * preexisting init fork since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->op != PDOP_DELETE &&
+ ((pending->op & PDOP_UNLINK_FORK) != 0 &&
+ pending->unlink_forknum == CLEANUP_FORKNUM))
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ create = false;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create the init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The cleanup fork works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ smgrcreate(srel, CLEANUP_FORKNUM, false);
+ log_smgrcreate(&rnode, CLEANUP_FORKNUM);
+ smgrimmedsync(srel, CLEANUP_FORKNUM);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by myself.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop this init fork file at abort and revert persistence */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK | PDOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop cleanup fork at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = CLEANUP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop cleanup fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = CLEANUP_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, true, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * immediately remove the init and cleanup forks immediately in that case.
+ * Otherwise just reister pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->op != PDOP_DELETE &&
+ ((pending->op & PDOP_UNLINK_FORK) != 0 &&
+ pending->unlink_forknum == CLEANUP_FORKNUM))
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ inxact_created = true;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT/CLEANUP forks never be loaded to shared buffer so no point in
+ * dropping buffers for these files.
+ */
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ log_smgrunlink(&rnode, CLEANUP_FORKNUM);
+ smgrunlink(srel, CLEANUP_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +449,44 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +500,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->op = PDOP_DELETE;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -602,59 +903,97 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ srelhash_hash *close_srels = NULL;
+ bool found;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
+ SMgrRelation srel;
+
next = pending->next;
if (pending->nestLevel < nestLevel)
{
/* outer-level entries should not be processed yet */
prev = pending;
+ continue;
}
+
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
else
+ pendingDeletes = next;
+
+ if (pending->atCommit != isCommit)
{
- /* unlink list entry first, so we don't retry on failure */
- if (prev)
- prev->next = next;
- else
- pendingDeletes = next;
- /* do deletion if called for */
- if (pending->atCommit == isCommit)
- {
- SMgrRelation srel;
-
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
- {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
-
- srels[nrels++] = srel;
- }
/* must explicitly free the list entry */
pfree(pending);
/* prev does not change */
+ continue;
+ }
+
+ if (close_srels == NULL)
+ close_srels = srelhash_create(CurrentMemoryContext, 32, NULL);
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /* Uniquify the smgr relations */
+ srelhash_insert(close_srels, srel, &found);
+
+ if (pending->op != PDOP_DELETE)
+ {
+ if (pending->op & PDOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM ||
+ pending->unlink_forknum == CLEANUP_FORKNUM ||
+ pending->unlink_forknum == CLEANUP2_FORKNUM);
+
+ log_smgrunlink(&pending->relnode, pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+
+ }
+
+ if (pending->op & PDOP_SET_PERSISTENCE)
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+ else
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
}
}
if (nrels > 0)
{
smgrdounlinkall(srels, nrels, false);
-
- for (int i = 0; i < nrels; i++)
- smgrclose(srels[i]);
-
pfree(srels);
}
+
+ if (close_srels)
+ {
+ srelhash_iterator i;
+ SRelHashEntry *ent;
+
+ /* close smgr relatoins */
+ srelhash_start_iterate(close_srels, &i);
+ while ((ent = srelhash_iterate(close_srels, &i)) != NULL)
+ smgrclose(ent->srel);
+ srelhash_destroy(close_srels);
+ }
}
/*
@@ -824,7 +1163,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == PDOP_DELETE)
nrels++;
}
if (nrels == 0)
@@ -837,7 +1177,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
{
*rptr = pending->relnode;
rptr++;
@@ -917,6 +1258,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1005,6 +1355,28 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 993da56d43..37a15d31ee 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -50,6 +50,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -4917,6 +4918,170 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform im-place persistnce change");
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ RelationOpenSmgr(toastrel);
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * Some access methods do not accept in-place persistence change. For
+ * example, GiST uses page LSNs to figure out whether a block has
+ * changed, where UNLOGGED GiST indexes use fake LSNs that are
+ * incompatible with real LSNs used for LOGGED ones.
+ *
+ * XXXX: We don't bother allowing in-place persistence change for index
+ * methods other than btree for now.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ r->rd_rel->relam != BTREE_AM_OID)
+ {
+ int reindex_flags;
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, 0);
+
+ continue;
+ }
+
+ RelationOpenSmgr(r);
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(r->rd_smgr, i))
+ smgrimmedsync(r->rd_smgr, i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(r->rd_smgr, fork))
+ log_newpage_range(r, fork,
+ 0, smgrnblocks(r->rd_smgr, fork), false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5037,45 +5202,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
- lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
+ lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+ }
}
else
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092..eacbdc6447 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3094,6 +3095,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 40c758d789..0eac1956cc 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,50 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
*
+ * If CLEANUP_FORKNUM (clup) is present, we remove the init fork of the same
+ * relation along with the clup fork.
+ *
+ * If CLEANUP2_FORKNUM (cln2) is present we remove the whole relation along
+ * with the cln2 fork.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
+ *
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
*/
@@ -68,7 +89,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -77,13 +98,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+ Assert(tspid != 0);
+
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -99,7 +126,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -126,6 +154,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -136,7 +166,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
snprintf(dbspace_path, sizeof(dbspace_path), "%s/%s",
tsdirname, de->d_name);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -146,125 +179,226 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create a ton of unlogged relations
+ * in the same database & tablespace, so we'd better use a hash table
+ * rather than an array or linked list to keep track of which files
+ * need to be reset. Otherwise, this cleanup operation would be
+ * O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT and CLEANUP forks in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum))
+ continue;
+
+ if (forkNum == INIT_FORKNUM ||
+ forkNum == CLEANUP_FORKNUM || forkNum == CLEANUP2_FORKNUM)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has the CLEANUP fork,
+ * the relfilenode is in dirty state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == CLEANUP_FORKNUM)
+ ent->dirty_init = true;
+ else if (forkNum == CLEANUP2_FORKNUM)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
&forkNum))
continue;
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
- continue;
-
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM && forkNum != CLEANUP_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 0643d714fb..6b37195c52 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -338,8 +338,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
if (ret == 0 || errno != ENOENT)
{
ret = unlink(path);
+
+ /* failure of removing cleanup fork leads to a data loss. */
if (ret < 0 && errno != ENOENT)
- ereport(WARNING,
+ ereport((forkNum != CLEANUP_FORKNUM ? WARNING : ERROR),
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
@@ -1024,6 +1026,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..96480e321d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -662,6 +662,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..479dcc248e 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -34,7 +34,9 @@ const char *const forkNames[] = {
"main", /* MAIN_FORKNUM */
"fsm", /* FSM_FORKNUM */
"vm", /* VISIBILITYMAP_FORKNUM */
- "init" /* INIT_FORKNUM */
+ "init", /* INIT_FORKNUM */
+ "clup", /* CLEANUP_FORKNUM */
+ "cln2" /* CLEANUP2_FORKNUM */
};
StaticAssertDecl(lengthof(forkNames) == (MAX_FORKNUM + 1),
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..382623159c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..0fd0832a8b 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -22,13 +22,17 @@
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation, deletion and persistence change
+ * here. logging of deletion actions is mainly handled by xact.c, because it is
+ * part of transaction commit, but we log deletions happens outside of a
+ * transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_BUFPERSISTENCE 0x40
typedef struct xl_smgr_create
{
@@ -36,6 +40,18 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +67,8 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..040070aa2b 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -43,7 +43,9 @@ typedef enum ForkNumber
MAIN_FORKNUM = 0,
FSM_FORKNUM,
VISIBILITYMAP_FORKNUM,
- INIT_FORKNUM
+ INIT_FORKNUM,
+ CLEANUP_FORKNUM,
+ CLEANUP2_FORKNUM
/*
* NOTE: if you add a new fork, change MAX_FORKNUM and possibly
@@ -52,7 +54,7 @@ typedef enum ForkNumber
*/
} ForkNumber;
-#define MAX_FORKNUM INIT_FORKNUM
+#define MAX_FORKNUM CLEANUP2_FORKNUM
#define FORKNAMECHARS 4 /* max chars for a fork name */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fb00fda6a7..ccb0a388f6 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..3cbbbf2edd 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -41,6 +41,8 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..b969ba8e86 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -23,6 +23,7 @@ extern bool parse_filename_for_nontemp_relation(const char *name,
int *oidchars, ForkNumber *fork);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..1ac3e4a74a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
v5-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload
From 89dbb62355befa7dde815030c95cf4902a8941f1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v5 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 37a15d31ee..2f65abb19b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13696,6 +13696,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index ba3ccc712c..127da5151d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4138,6 +4138,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5441,6 +5454,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index a2ef853dc2..4f13a1762b 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1872,6 +1872,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3494,6 +3506,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 31c95443a5..2222fd8fe3 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1934,6 +1934,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 53a511f1da..16606448bf 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -161,6 +161,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1732,6 +1733,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2619,6 +2626,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 08c463d3c4..646928466d 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index caed683ba9..16d91d3e1d 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -424,6 +424,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index dc2bb40926..c3eab6f1ab 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2253,6 +2253,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
(I'm not sure when the subject was broken..)
At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:p
In this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks". The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM
and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM
in the previous version.x
I noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback. In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v6-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 27ea96d84dfc2f3e0d62c4b8f7d20cc30771cf86 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v6 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 +++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 520 +++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 246 ++++++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 +++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 346 +++++++++++-----
src/backend/storage/smgr/md.c | 92 ++++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 ++
src/common/relpath.c | 47 ++-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
22 files changed, 1384 insertions(+), 206 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..7cf77e4a02 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+A smgr mark files is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to 4 above, failure to remove smgr mark files will lead to
+data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f8810e149..27bbe17395 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4458,6 +4459,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7577,6 +7586,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index cba7a9ada0..7302a3fad4 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -27,6 +28,7 @@
#include "access/xlogutils.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "common/hashfn.h"
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
@@ -57,9 +59,18 @@ int wal_skip_threshold = 2048; /* in kilobytes */
* but I'm being paranoid.
*/
+#define PDOP_DELETE (1 << 0)
+#define PDOP_UNLINK_FORK (1 << 1)
+#define PDOP_UNLINK_MARK (1 << 2)
+#define PDOP_SET_PERSISTENCE (1 << 3)
+
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -75,6 +86,24 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
+typedef struct SRelHashEntry
+{
+ SMgrRelation srel;
+ char status; /* for simplehash use */
+} SRelHashEntry;
+
+/* define hashtable for workarea for pending deletes */
+#define SH_PREFIX srelhash
+#define SH_ELEMENT_TYPE SRelHashEntry
+#define SH_KEY_TYPE SMgrRelation
+#define SH_KEY srel
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((unsigned char *)&key, sizeof(SMgrRelation))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
/*
* AddPendingSync
@@ -143,22 +172,48 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up but there's no
+ * clue to the orphan files. The SMGR_MARK_UNCOMMITED mark file works as
+ * the signal of that situation.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
- /* Add the relation to the list of stuff to delete at abort */
+ /*
+ * Add the relation to the list of stuff to delete at abort. We don't
+ * remove the mark file at commit. It needs to persiste until the main fork
+ * file is actually deleted. See SyncPostCheckpoint.
+ */
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->op = PDOP_DELETE;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /* drop cleanup fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = MAIN_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
Assert(backend == InvalidBackendId);
@@ -168,6 +223,207 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, false, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have already registered pending delete entries to drop
+ * preexisting init fork since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ (pending->op & PDOP_DELETE) == 0 &&
+ (pending->unlink_forknum == INIT_FORKNUM ||
+ (pending->op & PDOP_SET_PERSISTENCE) != 0))
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ create = false;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create the init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The cleanup fork works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK | PDOP_UNLINK_MARK | PDOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(rel->rd_smgr, true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just reister pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ (pending->op & PDOP_DELETE) == 0 &&
+ (pending->unlink_forknum == INIT_FORKNUM ||
+ (pending->op & PDOP_SET_PERSISTENCE) != 0))
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ inxact_created = true;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +443,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +538,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->op = PDOP_DELETE;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -602,59 +941,104 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ srelhash_hash *close_srels = NULL;
+ bool found;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
+ SMgrRelation srel;
+
next = pending->next;
if (pending->nestLevel < nestLevel)
{
/* outer-level entries should not be processed yet */
prev = pending;
+ continue;
}
+
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
else
+ pendingDeletes = next;
+
+ if (pending->atCommit != isCommit)
{
- /* unlink list entry first, so we don't retry on failure */
- if (prev)
- prev->next = next;
- else
- pendingDeletes = next;
- /* do deletion if called for */
- if (pending->atCommit == isCommit)
- {
- SMgrRelation srel;
-
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
- {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
-
- srels[nrels++] = srel;
- }
/* must explicitly free the list entry */
pfree(pending);
/* prev does not change */
+ continue;
}
+
+ if (close_srels == NULL)
+ close_srels = srelhash_create(CurrentMemoryContext, 32, NULL);
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /* Uniquify the smgr relations */
+ srelhash_insert(close_srels, srel, &found);
+
+ if (pending->op & PDOP_DELETE)
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
+ }
+
+ if (pending->op & PDOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode, pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PDOP_UNLINK_MARK)
+ {
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ }
+
+ if (pending->op & PDOP_SET_PERSISTENCE)
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
}
if (nrels > 0)
{
smgrdounlinkall(srels, nrels, false);
-
- for (int i = 0; i < nrels; i++)
- smgrclose(srels[i]);
-
pfree(srels);
}
+
+ if (close_srels)
+ {
+ srelhash_iterator i;
+ SRelHashEntry *ent;
+
+ /* close smgr relatoins */
+ srelhash_start_iterate(close_srels, &i);
+ while ((ent = srelhash_iterate(close_srels, &i)) != NULL)
+ smgrclose(ent->srel);
+ srelhash_destroy(close_srels);
+ }
}
/*
@@ -824,7 +1208,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == PDOP_DELETE)
nrels++;
}
if (nrels == 0)
@@ -837,7 +1222,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
{
*rptr = pending->relnode;
rptr++;
@@ -917,6 +1303,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1005,6 +1400,65 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3349bcfaa7..4e2bceffda 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -51,6 +51,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5085,6 +5086,170 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform im-place persistnce change");
+
+ RelationOpenSmgr(rel);
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ RelationOpenSmgr(toastrel);
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * Some access methods do not accept in-place persistence change. For
+ * example, GiST uses page LSNs to figure out whether a block has
+ * changed, where UNLOGGED GiST indexes use fake LSNs that are
+ * incompatible with real LSNs used for LOGGED ones.
+ *
+ * XXXX: We don't bother allowing in-place persistence change for index
+ * methods other than btree for now.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ r->rd_rel->relam != BTREE_AM_OID)
+ {
+ int reindex_flags;
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, 0);
+
+ continue;
+ }
+
+ RelationOpenSmgr(r);
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(r->rd_smgr, i))
+ smgrimmedsync(r->rd_smgr, i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(r->rd_smgr, fork))
+ log_newpage_range(r, fork,
+ 0, smgrnblocks(r->rd_smgr, fork), false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5205,45 +5370,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
- lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence,
+ lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 56cd473f9f..bc5288de05 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1255,6 +1255,7 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1305,7 +1306,7 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 852138f9c9..50674fd027 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlog.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3100,6 +3101,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 06b57ae71f..bdf6916d63 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -342,8 +342,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3647,7 +3645,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 40c758d789..f52d2ac199 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,50 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
*
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
+ *
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
*/
@@ -68,7 +89,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -77,13 +98,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+ Assert(tspid != 0);
+
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -99,7 +126,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -126,6 +154,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -136,7 +166,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
snprintf(dbspace_path, sizeof(dbspace_path), "%s/%s",
tsdirname, de->d_name);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -146,125 +179,228 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create a ton of unlogged relations
+ * in the same database & tablespace, so we'd better use a hash table
+ * rather than an array or linked list to keep track of which files
+ * need to be reset. Otherwise, this cleanup operation would be
+ * O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -273,6 +409,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -280,9 +417,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -316,15 +455,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -367,7 +509,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -398,11 +540,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 1e12cfad8e..87a777b307 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,80 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pfree(path);
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1024,6 +1099,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1377,12 +1461,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 4dc24649df..dd3496cf51 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 708215614d..a23c03ca3e 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -88,7 +88,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -216,7 +217,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -230,6 +232,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* And remove the list entry */
pendingUnlinks = list_delete_first(pendingUnlinks);
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 59ebac7d6a..db6b658489 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..67f24890d6 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[10];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, 10, ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..382623159c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index fb00fda6a7..ccb0a388f6 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 328473bdc9..485c58e5f1 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -167,6 +167,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
v6-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload
From 625bbc0e05a698aa2c19b5fba4947009358bd560 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v6 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 4e2bceffda..26bf8298e9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13843,6 +13843,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 82d7cce5d5..3471b8e2cc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4197,6 +4197,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5503,6 +5516,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 3e980c457c..d05aef4fde 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1875,6 +1875,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3528,6 +3540,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index bc43641ffe..5c3fd1998e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1948,6 +1948,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 05bb698cf4..3c18312367 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -161,6 +161,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1694,6 +1695,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2582,6 +2589,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index b3d30acc35..b4af2db6f0 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index e22df890ef..91dfc77978 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -427,6 +427,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 68425eb2c0..b9b75dc45b 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2293,6 +2293,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
At Thu, 25 Mar 2021 14:08:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
(I'm not sure when the subject was broken..)
At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:pIn this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks". The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM
and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM
in the previous version.xI noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback. In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.
For the record, I noticed that basebackup could be confused by the
mark files but I haven't looked that yet.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Kyotaro wrote:
In this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks". The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as theCLEANUP2_FORKNUM
and uncommitted-marked init fork means the same as the
CLEANUP_FORKNUM
in the previous version.x
I noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback. In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.For the record, I noticed that basebackup could be confused by the mark files
but I haven't looked that yet.
Good morning Kyotaro,
the patch didn't apply clean (it's from March; some hunks were failing), so I've fixed it and the combined git format-patch is attached. It did conflict with the following:
b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE
7b565843a94 - Add call to object access hook at the end of table rewrite in ALTER TABLE
9ce346eabf3 - Report progress of startup operations that take a long time.
f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr().
I'm especially worried if I didn't screw up something/forgot something related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests passed".
Basic demonstration of this patch (with wal_level=minimal):
create unlogged table t6 (id bigint, t text);
-- produces 110GB table, takes ~5mins
insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 100000000);
alter table t6 set logged;
on baseline SET LOGGED takes: ~7min10s
on patched SET LOGGED takes: 25s
So basically one can - thanks to this patch - use his application (performing classic INSERTs/UPDATEs/DELETEs, so without the need to rewrite to use COPY) and perform literally batch upload and then just switch the tables to LOGGED.
Some more intensive testing also looks good, assuming table prepared to put pressure on WAL:
create unlogged table t_unlogged (id bigint, t text) partition by hash (id);
create unlogged table t_unlogged_h0 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 0);
[..]
create unlogged table t_unlogged_h3 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 3);
Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and Lock/extend .
t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~1MB of WAL
t_unlogged.sql = insert into t_unlogged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~3kB of WAL
so using: pgbench -f <tabletypetest>.sql -T 30 -P 1 -c 32 -j 3 t
with synchronous_commit =ON(default):
with t_logged.sql: tps = 229 (lat avg = 138ms)
with t_unlogged.sql tps = 283 (lat avg = 112ms) # almost all on LWLock/WALWrite
with synchronous_commit =OFF:
with t_logged.sql: tps = 413 (lat avg = 77ms)
with t_unloged.sql: tps = 782 (lat avg = 40ms)
Afterwards switching the unlogged ~16GB partitions takes 5s per partition.
As the thread didn't get a lot of traction, I've registered it into current commitfest https://commitfest.postgresql.org/36/3461/ with You as the author and in 'Ready for review' state.
I think it behaves as almost finished one and apparently after reading all those discussions that go back over 10years+ time span about this feature, and lot of failed effort towards wal_level=noWAL I think it would be nice to finally start getting some of that of it into the core.
-Jakub Wartak.
Attachments:
v7-0001-In-place-table-persistence-change-with-new-comman.patchapplication/octet-stream; name=v7-0001-In-place-table-persistence-change-with-new-comman.patchDownload
From 82ea53c317b5c785d7ee91bcdaea43e9ad2c8f77 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <Jakub.Wartak@tomtom.com>
Date: Thu, 16 Dec 2021 12:03:42 +0000
Subject: [PATCH v7] In-place table persistence change with new command ALTER
TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED: To ease invoking
ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence
of all tables in the specified tablespace.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 ++++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xlog.c | 17 ++
src/backend/catalog/storage.c | 518 +++++++++++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 374 ++++++++++++++++++++++--
src/backend/nodes/copyfuncs.c | 16 +
src/backend/nodes/equalfuncs.c | 15 +
src/backend/parser/gram.y | 20 ++
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 ++++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 318 ++++++++++++++------
src/backend/storage/smgr/md.c | 92 +++++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 20 +-
src/backend/tcop/utility.c | 11 +
src/bin/pg_rewind/parsexlog.c | 24 ++
src/common/relpath.c | 47 +--
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_xlog.h | 42 ++-
src/include/commands/tablecmds.h | 2 +
src/include/common/relpath.h | 9 +-
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 +
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 ++
29 files changed, 1583 insertions(+), 179 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..7cf77e4a02 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+A smgr mark files is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to 4 above, failure to remove smgr mark files will lead to
+data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e1fbe957f..59f4c2eacf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..a3e250515c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -27,6 +28,7 @@
#include "access/xlogutils.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "common/hashfn.h"
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
@@ -57,9 +59,18 @@ int wal_skip_threshold = 2048; /* in kilobytes */
* but I'm being paranoid.
*/
+#define PDOP_DELETE (1 << 0)
+#define PDOP_UNLINK_FORK (1 << 1)
+#define PDOP_UNLINK_MARK (1 << 2)
+#define PDOP_SET_PERSISTENCE (1 << 3)
+
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -75,6 +86,24 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
+typedef struct SRelHashEntry
+{
+ SMgrRelation srel;
+ char status; /* for simplehash use */
+} SRelHashEntry;
+
+/* define hashtable for workarea for pending deletes */
+#define SH_PREFIX srelhash
+#define SH_ELEMENT_TYPE SRelHashEntry
+#define SH_KEY_TYPE SMgrRelation
+#define SH_KEY srel
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((unsigned char *)&key, sizeof(SMgrRelation))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
/*
* AddPendingSync
@@ -143,22 +172,48 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up but there's no
+ * clue to the orphan files. The SMGR_MARK_UNCOMMITED mark file works as
+ * the signal of that situation.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
- /* Add the relation to the list of stuff to delete at abort */
+ /*
+ * Add the relation to the list of stuff to delete at abort. We don't
+ * remove the mark file at commit. It needs to persiste until the main fork
+ * file is actually deleted. See SyncPostCheckpoint.
+ */
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->op = PDOP_DELETE;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /* drop cleanup fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = MAIN_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
Assert(backend == InvalidBackendId);
@@ -169,6 +224,207 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
}
/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have already registered pending delete entries to drop
+ * preexisting init fork since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ (pending->op & PDOP_DELETE) == 0 &&
+ (pending->unlink_forknum == INIT_FORKNUM ||
+ (pending->op & PDOP_SET_PERSISTENCE) != 0))
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ create = false;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create the init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The cleanup fork works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK | PDOP_UNLINK_MARK | PDOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just reister pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ (pending->op & PDOP_DELETE) == 0 &&
+ (pending->unlink_forknum == INIT_FORKNUM ||
+ (pending->op & PDOP_SET_PERSISTENCE) != 0))
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ inxact_created = true;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
void
@@ -188,6 +444,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
}
/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
*/
@@ -200,6 +538,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->op = PDOP_DELETE;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -618,59 +957,104 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ srelhash_hash *close_srels = NULL;
+ bool found;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
+ SMgrRelation srel;
+
next = pending->next;
if (pending->nestLevel < nestLevel)
{
/* outer-level entries should not be processed yet */
prev = pending;
+ continue;
}
+
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
else
+ pendingDeletes = next;
+
+ if (pending->atCommit != isCommit)
{
- /* unlink list entry first, so we don't retry on failure */
- if (prev)
- prev->next = next;
- else
- pendingDeletes = next;
- /* do deletion if called for */
- if (pending->atCommit == isCommit)
- {
- SMgrRelation srel;
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ continue;
+ }
- srel = smgropen(pending->relnode, pending->backend);
+ if (close_srels == NULL)
+ close_srels = srelhash_create(CurrentMemoryContext, 32, NULL);
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
- {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /* Uniquify the smgr relations */
+ srelhash_insert(close_srels, srel, &found);
- srels[nrels++] = srel;
+ if (pending->op & PDOP_DELETE)
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
}
- /* must explicitly free the list entry */
- pfree(pending);
- /* prev does not change */
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
}
+
+ if (pending->op & PDOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode, pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PDOP_UNLINK_MARK)
+ {
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ }
+
+ if (pending->op & PDOP_SET_PERSISTENCE)
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
}
if (nrels > 0)
{
smgrdounlinkall(srels, nrels, false);
-
- for (int i = 0; i < nrels; i++)
- smgrclose(srels[i]);
-
pfree(srels);
}
+
+ if (close_srels)
+ {
+ srelhash_iterator i;
+ SRelHashEntry *ent;
+
+ /* close smgr relatoins */
+ srelhash_start_iterate(close_srels, &i);
+ while ((ent = srelhash_iterate(close_srels, &i)) != NULL)
+ smgrclose(ent->srel);
+ srelhash_destroy(close_srels);
+ }
}
/*
@@ -840,7 +1224,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == PDOP_DELETE)
nrels++;
}
if (nrels == 0)
@@ -853,7 +1238,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
{
*rptr = pending->relnode;
rptr++;
@@ -933,6 +1319,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1416,65 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index bf42587e38..726a0484f9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5330,6 +5331,168 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
}
/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform im-place persistnce change");
+
+ RelationGetSmgr(rel);
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ RelationGetSmgr(toastrel);
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * Some access methods do not accept in-place persistence change. For
+ * example, GiST uses page LSNs to figure out whether a block has
+ * changed, where UNLOGGED GiST indexes use fake LSNs that are
+ * incompatible with real LSNs used for LOGGED ones.
+ *
+ * XXXX: We don't bother allowing in-place persistence change for index
+ * methods other than btree for now.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ r->rd_rel->relam != BTREE_AM_OID)
+ {
+ int reindex_flags;
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, 0);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork,
+ 0, smgrnblocks(RelationGetSmgr(r), fork), false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
+/*
* ATRewriteTables: ALTER TABLE phase 3
*/
static void
@@ -5474,32 +5637,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
* persistence. That wouldn't work for pg_class, but that can't be
* unlogged anyway.
*/
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
+
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
persistence, lockmode);
+
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
-
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
-
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ /*
+ * Swap the physical files of the old and new heaps, then rebuild
+ * indexes and discard the old heap. We can use RecentXmin for
+ * the table's new relfrozenxid because we rewrote all the tuples
+ * in ATRewriteTable, so no older Xid remains in the table. Also,
+ * we never try to swap toast tables by content, since we have no
+ * interest in letting this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
+
}
else
{
@@ -14319,6 +14505,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index df0b747883..55e38cfe3f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4269,6 +4269,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5622,6 +5635,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index cb7ddd463c..a19b7874d7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1917,6 +1917,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
}
static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
+static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
COMPARE_STRING_FIELD(extname);
@@ -3625,6 +3637,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3d4dd43e47..9823d57a54 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3155,6 +3156,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
+/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
* This function removes from the buffer pool all the pages of all
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..a34aa8e9af 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+ Assert(tspid != 0);
+
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,9 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +189,228 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
- if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
+
+ /*
+ * It's possible that someone could create a ton of unlogged relations
+ * in the same database & tablespace, so we'd better use a hash table
+ * rather than an array or linked list to keep track of which files
+ * need to be reset. Otherwise, this cleanup operation would be
+ * O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- HTAB *hash;
- HASHCTL ctl;
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
- /* Scan the directory. */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
+ Oid key;
+ relfile_entry *ent;
+ bool found;
/*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
*/
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
}
+ }
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
/*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
*/
- if (hash_get_num_entries(hash) == 0)
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
{
- hash_destroy(hash);
- return;
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
}
- /*
- * Now, make a second pass and remove anything that matches.
- */
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
+ if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
+ {
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..27ca9c1ca2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -170,6 +171,80 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
}
/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pfree(path);
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
+/*
* mdcreate() -- Create a new relation on magnetic disk.
*
* If isRedo is true, it's okay for the relation to exist already.
@@ -1026,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
}
/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
+/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
void
@@ -1378,12 +1462,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -336,6 +342,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
}
/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
+/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
* All forks of all given relations are synced out to the store.
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d47..1483f9a475 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..67f24890d6 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[10];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, 10, ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..382623159c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7c657c1241..8860b2e548 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -428,6 +428,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4c5a8a39bf..c3e1bc66d1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -19,6 +19,18 @@
#include "storage/relfilenode.h"
/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
+/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
* by smgropen(), and destroyed by smgrclose(). Note that neither of these
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.11.0
On Fri, Dec 17, 2021 at 09:10:30AM +0000, Jakub Wartak wrote:
I'm especially worried if I didn't screw up something/forgot something related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests passed".
As the thread didn't get a lot of traction, I've registered it into current commitfest https://commitfest.postgresql.org/36/3461/ with You as the author and in 'Ready for review' state.
I think it behaves as almost finished one and apparently after reading all those discussions that go back over 10years+ time span about this feature, and lot of failed effort towards wal_level=noWAL I think it would be nice to finally start getting some of that of it into the core.
The patch is failing:
http://cfbot.cputube.org/kyotaro-horiguchi.html
https://api.cirrus-ci.com/v1/artifact/task/5564333871595520/regress_diffs/src/bin/pg_upgrade/tmp_check/regress/regression.diffs
I think you ran "make check", but should run something like this:
make check-world -j8 >check-world.log 2>&1 && echo Success
--
Justin
Justin wrote:
On Fri, Dec 17, 2021 at 09:10:30AM +0000, Jakub Wartak wrote:As the thread didn't get a lot of traction, I've registered it into current
commitfest
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf
est.postgresql.org%2F36%2F3461%2F&data=04%7C01%7CJakub.Wartak%
40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a
3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT
WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV
CI6Mn0%3D%7C3000&sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC
NuBwiW4%3D&reserved=0 with You as the author and in 'Ready for
review' state.The patch is failing:
[..]
I think you ran "make check", but should run something like this:
make check-world -j8 >check-world.log 2>&1 && echo Success
Hi Justin,
I've repeated the check-world and it says to me all is ok locally (also with --enable-cassert --enable-debug , at least on Amazon Linux 2) and also installcheck on default params seems to be ok
I don't seem to understand why testfarm reports errors for tests like "path, polygon, rowsecurity" e.g. on Linux/graviton2 and FreeBSD while not on MacOS(?) .
Could someone point to me where to start looking/fixing?
-J.
On Fri, Dec 17, 2021 at 02:33:25PM +0000, Jakub Wartak wrote:
Justin wrote:
On Fri, Dec 17, 2021 at 09:10:30AM +0000, Jakub Wartak wrote:As the thread didn't get a lot of traction, I've registered it into current
commitfest
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf
est.postgresql.org%2F36%2F3461%2F&data=04%7C01%7CJakub.Wartak%
40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a
3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT
WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV
CI6Mn0%3D%7C3000&sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC
NuBwiW4%3D&reserved=0 with You as the author and in 'Ready for
review' state.The patch is failing:
[..]
I think you ran "make check", but should run something like this:
make check-world -j8 >check-world.log 2>&1 && echo SuccessHi Justin,
I've repeated the check-world and it says to me all is ok locally (also with --enable-cassert --enable-debug , at least on Amazon Linux 2) and also installcheck on default params seems to be ok
I don't seem to understand why testfarm reports errors for tests like "path, polygon, rowsecurity" e.g. on Linux/graviton2 and FreeBSD while not on MacOS(?) .
Could someone point to me where to start looking/fixing?
Since it says this, it looks a lot like a memory error like a use-after-free
- like in fsync_parent_path():
CREATE TABLE PATH_TBL (f1 path);
+ERROR: could not open file <....> Pacific": No such file or directory
I see at least this one is still failing, though:
time make -C src/test/recovery check
Attachments:
0001-In-place-table-persistence-change-with-new-command-A.patchtext/x-diff; charset=us-asciiDownload
From 676ecf794b2b0e98d8f31e4245f6f455da5e19cb Mon Sep 17 00:00:00 2001
From: Jakub Wartak <Jakub.Wartak@tomtom.com>
Date: Thu, 16 Dec 2021 12:03:42 +0000
Subject: [PATCH 1/2] In-place table persistence change with new command ALTER
TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED: To ease invoking
ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence
of all tables in the specified tablespace.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 +++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 518 +++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 374 ++++++++++++++++--
src/backend/nodes/copyfuncs.c | 16 +
src/backend/nodes/equalfuncs.c | 15 +
src/backend/parser/gram.y | 20 +
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 +++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 318 +++++++++++----
src/backend/storage/smgr/md.c | 92 ++++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 20 +-
src/backend/tcop/utility.c | 11 +
src/bin/pg_rewind/parsexlog.c | 24 ++
src/common/relpath.c | 47 ++-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/commands/tablecmds.h | 2 +
src/include/common/relpath.h | 9 +-
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 +
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
29 files changed, 1583 insertions(+), 179 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..7cf77e4a02 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+A smgr mark files is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to 4 above, failure to remove smgr mark files will lead to
+data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e1fbe957f..59f4c2eacf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..a3e250515c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -27,6 +28,7 @@
#include "access/xlogutils.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "common/hashfn.h"
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
@@ -57,9 +59,18 @@ int wal_skip_threshold = 2048; /* in kilobytes */
* but I'm being paranoid.
*/
+#define PDOP_DELETE (1 << 0)
+#define PDOP_UNLINK_FORK (1 << 1)
+#define PDOP_UNLINK_MARK (1 << 2)
+#define PDOP_SET_PERSISTENCE (1 << 3)
+
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -75,6 +86,24 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
+typedef struct SRelHashEntry
+{
+ SMgrRelation srel;
+ char status; /* for simplehash use */
+} SRelHashEntry;
+
+/* define hashtable for workarea for pending deletes */
+#define SH_PREFIX srelhash
+#define SH_ELEMENT_TYPE SRelHashEntry
+#define SH_KEY_TYPE SMgrRelation
+#define SH_KEY srel
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((unsigned char *)&key, sizeof(SMgrRelation))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
/*
* AddPendingSync
@@ -143,22 +172,48 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up but there's no
+ * clue to the orphan files. The SMGR_MARK_UNCOMMITED mark file works as
+ * the signal of that situation.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
- /* Add the relation to the list of stuff to delete at abort */
+ /*
+ * Add the relation to the list of stuff to delete at abort. We don't
+ * remove the mark file at commit. It needs to persiste until the main fork
+ * file is actually deleted. See SyncPostCheckpoint.
+ */
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->op = PDOP_DELETE;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /* drop cleanup fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = MAIN_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
Assert(backend == InvalidBackendId);
@@ -168,6 +223,207 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operation of this relation, that means
+ * that we have already registered pending delete entries to drop
+ * preexisting init fork since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ (pending->op & PDOP_DELETE) == 0 &&
+ (pending->unlink_forknum == INIT_FORKNUM ||
+ (pending->op & PDOP_SET_PERSISTENCE) != 0))
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ create = false;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create the init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The cleanup fork works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK | PDOP_UNLINK_MARK | PDOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just reister pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ (pending->op & PDOP_DELETE) == 0 &&
+ (pending->unlink_forknum == INIT_FORKNUM ||
+ (pending->op & PDOP_SET_PERSISTENCE) != 0))
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ inxact_created = true;
+ }
+ else
+ {
+ /* unrelated entry, don't touch it */
+ prev = pending;
+ }
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +443,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +538,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->op = PDOP_DELETE;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -618,59 +957,104 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ srelhash_hash *close_srels = NULL;
+ bool found;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
+ SMgrRelation srel;
+
next = pending->next;
if (pending->nestLevel < nestLevel)
{
/* outer-level entries should not be processed yet */
prev = pending;
+ continue;
}
+
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
else
+ pendingDeletes = next;
+
+ if (pending->atCommit != isCommit)
{
- /* unlink list entry first, so we don't retry on failure */
- if (prev)
- prev->next = next;
- else
- pendingDeletes = next;
- /* do deletion if called for */
- if (pending->atCommit == isCommit)
- {
- SMgrRelation srel;
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ continue;
+ }
- srel = smgropen(pending->relnode, pending->backend);
+ if (close_srels == NULL)
+ close_srels = srelhash_create(CurrentMemoryContext, 32, NULL);
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
- {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /* Uniquify the smgr relations */
+ srelhash_insert(close_srels, srel, &found);
- srels[nrels++] = srel;
+ if (pending->op & PDOP_DELETE)
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
}
- /* must explicitly free the list entry */
- pfree(pending);
- /* prev does not change */
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
}
+
+ if (pending->op & PDOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode, pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PDOP_UNLINK_MARK)
+ {
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ }
+
+ if (pending->op & PDOP_SET_PERSISTENCE)
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
}
if (nrels > 0)
{
smgrdounlinkall(srels, nrels, false);
-
- for (int i = 0; i < nrels; i++)
- smgrclose(srels[i]);
-
pfree(srels);
}
+
+ if (close_srels)
+ {
+ srelhash_iterator i;
+ SRelHashEntry *ent;
+
+ /* close smgr relatoins */
+ srelhash_start_iterate(close_srels, &i);
+ while ((ent = srelhash_iterate(close_srels, &i)) != NULL)
+ smgrclose(ent->srel);
+ srelhash_destroy(close_srels);
+ }
}
/*
@@ -840,7 +1224,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == PDOP_DELETE)
nrels++;
}
if (nrels == 0)
@@ -853,7 +1238,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
{
*rptr = pending->relnode;
rptr++;
@@ -933,6 +1319,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1416,65 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0bde972af6..dbfbf12f64 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5329,6 +5330,168 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform im-place persistnce change");
+
+ RelationGetSmgr(rel);
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ RelationGetSmgr(toastrel);
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * Some access methods do not accept in-place persistence change. For
+ * example, GiST uses page LSNs to figure out whether a block has
+ * changed, where UNLOGGED GiST indexes use fake LSNs that are
+ * incompatible with real LSNs used for LOGGED ones.
+ *
+ * XXXX: We don't bother allowing in-place persistence change for index
+ * methods other than btree for now.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ r->rd_rel->relam != BTREE_AM_OID)
+ {
+ int reindex_flags;
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, 0);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork,
+ 0, smgrnblocks(RelationGetSmgr(r), fork), false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5474,32 +5637,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
* persistence. That wouldn't work for pg_class, but that can't be
* unlogged anyway.
*/
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
+
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
persistence, lockmode);
+
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
-
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
-
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ /*
+ * Swap the physical files of the old and new heaps, then rebuild
+ * indexes and discard the old heap. We can use RecentXmin for
+ * the table's new relfrozenxid because we rewrote all the tuples
+ * in ATRewriteTable, so no older Xid remains in the table. Also,
+ * we never try to swap toast tables by content, since we have no
+ * interest in letting this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
+
}
else
{
@@ -14319,6 +14505,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index dba82860fe..7be1f08735 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4316,6 +4316,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5678,6 +5691,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 02b9d7f1c2..4e89728524 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1956,6 +1956,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3674,6 +3686,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 6d1aabb812..a11def3646 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1991,6 +1991,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..a34aa8e9af 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+ Assert(tspid != 0);
+
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,9 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +189,228 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
- if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
+
+ /*
+ * It's possible that someone could create a ton of unlogged relations
+ * in the same database & tablespace, so we'd better use a hash table
+ * rather than an array or linked list to keep track of which files
+ * need to be reset. Otherwise, this cleanup operation would be
+ * O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- HTAB *hash;
- HASHCTL ctl;
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
- /* Scan the directory. */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
+ Oid key;
+ relfile_entry *ent;
+ bool found;
/*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
*/
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
}
+ }
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
/*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
*/
- if (hash_get_num_entries(hash) == 0)
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
{
- hash_destroy(hash);
- return;
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
}
- /*
- * Now, make a second pass and remove anything that matches.
- */
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
+ if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
+ {
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..27ca9c1ca2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,80 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pfree(path);
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1100,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1462,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 9a15fd4c57..f80e2cde1b 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -163,6 +163,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1748,6 +1749,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2676,6 +2683,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..67f24890d6 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[10];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, 10, ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..382623159c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 886ce9ec2f..e2b9c7cc4e 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -431,6 +431,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 5beafc9e11..76e3ac6c60 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2395,6 +2395,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.17.0
0002-fixes-from-justin.patchtext/x-diff; charset=us-asciiDownload
From 34daf017c9ace03326e4151b00200ab4a0c123f0 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Fri, 17 Dec 2021 12:05:14 -0600
Subject: [PATCH 2/2] fixes from justin
---
src/backend/access/transam/README | 2 +-
src/backend/catalog/storage.c | 10 +++++-----
src/backend/commands/tablecmds.c | 15 +++++++--------
src/backend/storage/file/reinit.c | 8 ++++----
src/backend/storage/smgr/md.c | 8 ++++----
src/common/relpath.c | 4 ++--
src/include/storage/smgr.h | 6 +++---
7 files changed, 26 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 7cf77e4a02..5c0fd3f489 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -728,7 +728,7 @@ The Smgr MARK files
--------------------------------
A smgr mark files is created when a new relation file is created to
-mark the relfilenode needs to be cleaned up at recovery time. In
+mark that the relfilenode needs to be cleaned up at recovery time. In
contrast to 4 above, failure to remove smgr mark files will lead to
data loss, in which case the server will shut down.
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index a3e250515c..9ff1520946 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -174,7 +174,7 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
/*
* We are going to create a new storage file. If server crashes before the
- * current transaction ends the file needs to be cleaned up but there's no
+ * current transaction ends, the file needs to be cleaned up but there's no
* clue to the orphan files. The SMGR_MARK_UNCOMMITED mark file works as
* the signal of that situation.
*/
@@ -188,7 +188,7 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
/*
* Add the relation to the list of stuff to delete at abort. We don't
- * remove the mark file at commit. It needs to persiste until the main fork
+ * remove the mark file at commit. It needs to persist until the main fork
* file is actually deleted. See SyncPostCheckpoint.
*/
pending = (PendingRelDelete *)
@@ -280,7 +280,7 @@ RelationCreateInitFork(Relation rel)
/*
* We are going to create the init fork. If server crashes before the
- * current transaction ends the init fork left alone corrupts data while
+ * current transaction ends, the init fork left alone corrupts data during
* recovery. The cleanup fork works as the sentinel to identify that
* situation.
*/
@@ -292,7 +292,7 @@ RelationCreateInitFork(Relation rel)
smgrcreate(srel, INIT_FORKNUM, false);
/*
- * index-init fork needs further initialization. ambuildempty shoud do
+ * index-init fork needs further initialization. ambuildempty should do
* WAL-log and file sync by itself but otherwise we do that by ourselves.
*/
if (rel->rd_rel->relkind == RELKIND_INDEX)
@@ -357,7 +357,7 @@ RelationDropInitFork(Relation rel)
* If we have entries for init-fork operations of this relation, that means
* that we have created the init fork in the current transaction. We
* remove the init fork and mark file immediately in that case. Otherwise
- * just reister pending-delete for the existing init fork.
+ * just register pending-delete for the existing init fork.
*/
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index dbfbf12f64..a9bcf90960 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5362,7 +5362,7 @@ RelationChangePersistence(AlteredTableInfo *tab, char persistence,
Assert(rel->rd_rel->relpersistence != persistence);
- elog(DEBUG1, "perform im-place persistnce change");
+ elog(DEBUG1, "perform in-place persistence change");
RelationGetSmgr(rel);
@@ -5379,7 +5379,7 @@ RelationChangePersistence(AlteredTableInfo *tab, char persistence,
{
List *toastidx;
Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
-
+
RelationGetSmgr(toastrel);
relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
toastidx = RelationGetIndexList(toastrel);
@@ -5397,8 +5397,8 @@ RelationChangePersistence(AlteredTableInfo *tab, char persistence,
{
Oid reloid = lfirst_oid(lc_oid);
Relation r = relation_open(reloid, lockmode);
-
- /*
+
+ /*
* Some access methods do not accept in-place persistence change. For
* example, GiST uses page LSNs to figure out whether a block has
* changed, where UNLOGGED GiST indexes use fake LSNs that are
@@ -5637,7 +5637,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
* persistence. That wouldn't work for pg_class, but that can't be
* unlogged anyway.
*/
-
+
if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
RelationChangePersistence(tab, persistence, lockmode);
else
@@ -5660,7 +5660,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
*/
OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
persistence, lockmode);
-
+
/*
* Copy the heap data into the new table with the desired
* modifications, and test the current data within the table
@@ -5685,7 +5685,6 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
}
-
}
else
{
@@ -14510,7 +14509,7 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
*
* Allows a user to change persistence of all objects in a given tablespace in
* the current database. Objects can be chosen based on the owner of the
- * object also, to allow users to change persistene only their objects. The
+ * object also, to allow users to change persistence only their objects. The
* main permissions handling is done by the lower-level change persistence
* function.
*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index a34aa8e9af..acbcf606ce 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -54,7 +54,7 @@ typedef struct
* If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
* whole relation along with the mark file.
*
- * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * Otherwise, if the "init" fork is found, we remove all forks of any relation
* with the "init" fork, except for the "init" fork itself.
*
* If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
@@ -284,7 +284,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
* When we come here after recovery, smgr object for this file might
* have been created. In that case we need to drop all buffers then the
* smgr object before initializing the unlogged relation. This is safe
- * as far as no other backends have accessed the relation before
+ * as long as no other backends have accessed the relation before
* starting archive recovery.
*/
HASH_SEQ_STATUS status;
@@ -302,7 +302,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
RelFileNodeBackend rel;
/*
- * The relation is persistent and stays remain persistent. Don't
+ * The relation is persistent and stays persistent. Don't
* drop the buffers for this relation.
*/
if (ent->has_init && ent->dirty_init)
@@ -367,7 +367,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
if (ent->dirty_init)
{
/*
- * The crashed trasaction did SET UNLOGGED. This relation
+ * The crashed transaction did SET UNLOGGED. This relation
* is restored to a LOGGED relation.
*/
if (forkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 27ca9c1ca2..492bd91c9e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -193,15 +193,15 @@ mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
(errcode_for_file_access(),
errmsg("could not crete mark file \"%s\": %m", path)));
- pfree(path);
- pg_fsync(fd);
- close(fd);
-
/*
* To guarantee that the creation of the file is persistent, fsync its
* parent directory.
*/
fsync_parent_path(path, ERROR);
+
+ pfree(path);
+ pg_fsync(fd);
+ close(fd);
}
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 67f24890d6..4b1de9cf1e 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -142,12 +142,12 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
int backendId, ForkNumber forkNumber, char mark)
{
char *path;
- char markstr[10];
+ char markstr[3];
if (mark == 0)
markstr[0] = 0;
else
- snprintf(markstr, 10, ".%c", mark);
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 201ecace8a..c49b3142eb 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -19,9 +19,9 @@
#include "storage/relfilenode.h"
/*
- * Storage marks is a file of which existence suggests something about a
- * file. The name of such files is "<filename>.<mark>", where the mark is one
- * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * Storage marks is a file whose existence suggests something about a file.
+ * The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files, don't
* use digits for the mark character.
*/
typedef enum StorageMarks
--
2.17.0
Hello, Jakub.
At Fri, 17 Dec 2021 09:10:30 +0000, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote in
the patch didn't apply clean (it's from March; some hunks were failing), so I've fixed it and the combined git format-patch is attached. It did conflict with the following:
Thanks for looking this. Also thanks for Justin for finding the silly
use-after-free bug. (Now I see the regression test fails and I'm not
sure how come I didn't find this before.)
b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE
7b565843a94 - Add call to object access hook at the end of table rewrite in ALTER TABLE
9ce346eabf3 - Report progress of startup operations that take a long time.
f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr().I'm especially worried if I didn't screw up something/forgot something related to the last one (rd->rd_smgr changes), but I'm getting "All 210 tests passed".
About the last one, all rel->rd_smgr acesses need to be repalced with
RelationGetSmgr(). On the other hand we can simply remove
RelationOpenSmgr() calls since the target smgrrelation is guaranteed
to be loaded by RelationGetSmgr().
The fix you made for RelationCreate/DropInitFork is correct and
changes you made would work, but I prefer that the code not being too
permissive for unknown (or unexpected) states.
Basic demonstration of this patch (with wal_level=minimal):
create unlogged table t6 (id bigint, t text);
-- produces 110GB table, takes ~5mins
insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 100000000);
alter table t6 set logged;
on baseline SET LOGGED takes: ~7min10s
on patched SET LOGGED takes: 25sSo basically one can - thanks to this patch - use his application (performing classic INSERTs/UPDATEs/DELETEs, so without the need to rewrite to use COPY) and perform literally batch upload and then just switch the tables to LOGGED.
This result is significant. That operation finally requires WAL writes
but I was not sure how much gain FPIs (or bulk WAL logging) gives in
comparison to operational WALs.
Some more intensive testing also looks good, assuming table prepared to put pressure on WAL:
create unlogged table t_unlogged (id bigint, t text) partition by hash (id);
create unlogged table t_unlogged_h0 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 0);
[..]
create unlogged table t_unlogged_h3 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 3);Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and Lock/extend .
t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~1MB of WAL
t_unlogged.sql = insert into t_unlogged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # according to pg_wal_stats.wal_bytes generates ~3kB of WALso using: pgbench -f <tabletypetest>.sql -T 30 -P 1 -c 32 -j 3 t
with synchronous_commit =ON(default):
with t_logged.sql: tps = 229 (lat avg = 138ms)
with t_unlogged.sql tps = 283 (lat avg = 112ms) # almost all on LWLock/WALWrite
with synchronous_commit =OFF:
with t_logged.sql: tps = 413 (lat avg = 77ms)
with t_unloged.sql: tps = 782 (lat avg = 40ms)
Afterwards switching the unlogged ~16GB partitions takes 5s per partition.As the thread didn't get a lot of traction, I've registered it into current commitfest https://commitfest.postgresql.org/36/3461/ with You as the author and in 'Ready for review' state.
I think it behaves as almost finished one and apparently after reading all those discussions that go back over 10years+ time span about this feature, and lot of failed effort towards wal_level=noWAL I think it would be nice to finally start getting some of that of it into the core.
Thanks for taking the performance benchmark.
I didn't register for some reasons.
1. I'm not sure that we want to have the new mark files.
2. Aside of possible bugs, I'm not confident that the crash-safety of
this patch is actually water-tight. At least we need tests for some
failure cases.
3. As mentioned in transam/README, failure in removing smgr mark files
leads to immediate shut down. I'm not sure this behavior is acceptable.
4. Including the reasons above, this is not fully functionally.
For example, if we execute the following commands on primary,
replica dones't work correctly. (boom!)
=# CREATE UNLOGGED TABLE t (a int);
=# ALTER TABLE t SET LOGGED;
The following fixes are done in the attched v8.
- Rebased. Referring to Jakub and Justin's work, I replaced direct
access to ->rd_smgr with RelationGetSmgr() and removed calls to
RelationOpenSmgr(). I still separate the "ALTER TABLE ALL IN
TABLESPACE SET LOGGED/UNLOGGED" statement part.
- Fixed RelationCreate/DropInitFork's behavior for non-target
relations. (From Jakub's work).
- Fixed wording of some comments.
- As revisited, I found a bug around recovery. If the logged-ness of a
relation gets flipped repeatedly in a transaction, duplicate
pending-delete entries are accumulated during recovery and work in a
wrong way. sgmr_redo now adds up to one entry for a action.
- The issue 4 above is not fixed (yet).
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v8-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From c665734c9e056e80a0d56281011b95e55ea14507 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v8 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 +++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 539 +++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 245 +++++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 345 +++++++++++-----
src/backend/storage/smgr/md.c | 93 ++++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 ++
src/common/relpath.c | 47 ++-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
22 files changed, 1401 insertions(+), 207 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e1fbe957f..59f4c2eacf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..f2bcc12958 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -27,6 +28,7 @@
#include "access/xlogutils.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "common/hashfn.h"
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
@@ -57,9 +59,18 @@ int wal_skip_threshold = 2048; /* in kilobytes */
* but I'm being paranoid.
*/
+#define PDOP_DELETE (1 << 0)
+#define PDOP_UNLINK_FORK (1 << 1)
+#define PDOP_UNLINK_MARK (1 << 2)
+#define PDOP_SET_PERSISTENCE (1 << 3)
+
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -75,6 +86,24 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
+typedef struct SRelHashEntry
+{
+ SMgrRelation srel;
+ char status; /* for simplehash use */
+} SRelHashEntry;
+
+/* define hashtable for workarea for pending deletes */
+#define SH_PREFIX srelhash
+#define SH_ELEMENT_TYPE SRelHashEntry
+#define SH_KEY_TYPE SMgrRelation
+#define SH_KEY srel
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((unsigned char *)&key, sizeof(SMgrRelation))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
/*
* AddPendingSync
@@ -143,22 +172,48 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up but there's no
+ * clue to the orphan files. The SMGR_MARK_UNCOMMITED mark file works as
+ * the signal of that situation.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
- /* Add the relation to the list of stuff to delete at abort */
+ /*
+ * Add the relation to the list of stuff to delete at abort. We don't
+ * remove the mark file at commit. It needs to persists until the main fork
+ * file is actually deleted. See SyncPostCheckpoint.
+ */
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->op = PDOP_DELETE;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /* drop cleanup fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = MAIN_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
Assert(backend == InvalidBackendId);
@@ -168,6 +223,226 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operations on this relation, that means
+ * that we have already registered pending delete entries to drop
+ * preexisting init-fork since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ /*
+ * We don't touch unrelated entries. Although init-fork related entries
+ * are not useful if the relation is created or dropped in this
+ * transaction, we don't bother to avoid registering entries for such
+ * relations here.
+ */
+ if (!RelFileNodeEquals(rnode, pending->relnode) ||
+ (pending->op & PDOP_DELETE) != 0 ||
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* make sure the entry is what we're expecting here */
+ Assert(((pending->op & (PDOP_UNLINK_FORK|PDOP_UNLINK_MARK)) != 0 &&
+ pending->unlink_forknum == INIT_FORKNUM) ||
+ (pending->op & PDOP_SET_PERSISTENCE) != 0);
+
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ create = false;
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create an init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The mark file works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK | PDOP_UNLINK_MARK | PDOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just register pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ /*
+ * We don't touch unrelated entries. Although init-fork related entries
+ * are not useful if the relation is created or dropped in this
+ * transaction, we don't bother to avoid registering entries for such
+ * relations here.
+ */
+ if (!RelFileNodeEquals(rnode, pending->relnode) ||
+ (pending->op & PDOP_DELETE) != 0 ||
+ pending->unlink_forknum != INIT_FORKNUM))
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* make sure the entry is what we're expecting here */
+ Assert(((pending->op & (PDOP_UNLINK_FORK|PDOP_UNLINK_MARK)) != 0 &&
+ pending->unlink_forknum == INIT_FORKNUM) ||
+ (pending->op & PDOP_SET_PERSISTENCE) != 0);
+
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ inxact_created = true;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +462,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +557,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->op = PDOP_DELETE;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -618,59 +976,104 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ srelhash_hash *close_srels = NULL;
+ bool found;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
+ SMgrRelation srel;
+
next = pending->next;
if (pending->nestLevel < nestLevel)
{
/* outer-level entries should not be processed yet */
prev = pending;
+ continue;
}
+
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
else
+ pendingDeletes = next;
+
+ if (pending->atCommit != isCommit)
{
- /* unlink list entry first, so we don't retry on failure */
- if (prev)
- prev->next = next;
- else
- pendingDeletes = next;
- /* do deletion if called for */
- if (pending->atCommit == isCommit)
- {
- SMgrRelation srel;
-
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
- {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
-
- srels[nrels++] = srel;
- }
/* must explicitly free the list entry */
pfree(pending);
/* prev does not change */
+ continue;
}
+
+ if (close_srels == NULL)
+ close_srels = srelhash_create(CurrentMemoryContext, 32, NULL);
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /* Uniquify the smgr relations */
+ srelhash_insert(close_srels, srel, &found);
+
+ if (pending->op & PDOP_DELETE)
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
+ }
+
+ if (pending->op & PDOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode, pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PDOP_UNLINK_MARK)
+ {
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ }
+
+ if (pending->op & PDOP_SET_PERSISTENCE)
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
}
if (nrels > 0)
{
smgrdounlinkall(srels, nrels, false);
-
- for (int i = 0; i < nrels; i++)
- smgrclose(srels[i]);
-
pfree(srels);
}
+
+ if (close_srels)
+ {
+ srelhash_iterator i;
+ SRelHashEntry *ent;
+
+ /* close smgr relatoins */
+ srelhash_start_iterate(close_srels, &i);
+ while ((ent = srelhash_iterate(close_srels, &i)) != NULL)
+ smgrclose(ent->srel);
+ srelhash_destroy(close_srels);
+ }
}
/*
@@ -840,7 +1243,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == PDOP_DELETE)
nrels++;
}
if (nrels == 0)
@@ -853,7 +1257,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
{
*rptr = pending->relnode;
rptr++;
@@ -933,6 +1338,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1435,65 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index bf42587e38..afc77f0d98 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5329,6 +5330,166 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * Some access methods do not accept in-place persistence change. For
+ * example, GiST uses page LSNs to figure out whether a block has
+ * changed, where UNLOGGED GiST indexes use fake LSNs that are
+ * incompatible with real LSNs used for LOGGED ones.
+ *
+ * XXXX: We don't bother to allow in-place persistence change for index
+ * methods other than btree for now.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ r->rd_rel->relam != BTREE_AM_OID)
+ {
+ int reindex_flags;
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, 0);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5459,47 +5620,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..f8458a1e1e 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,228 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create a ton of unlogged relations
+ * in the same database & tablespace, so we'd better use a hash table
+ * rather than an array or linked list to keep track of which files
+ * need to be reset. Otherwise, this cleanup operation would be
+ * O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +420,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +428,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +466,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +520,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +551,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..580b74839f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..67f24890d6 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[10];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, 10, ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..382623159c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
v8-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload
From 2d74ca97ae66dff87a883e2efa60f02fb8c883c3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v8 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index afc77f0d98..211ca3641a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14488,6 +14488,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index df0b747883..55e38cfe3f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4269,6 +4269,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5622,6 +5635,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index cb7ddd463c..a19b7874d7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3625,6 +3637,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3d4dd43e47..9823d57a54 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d47..1483f9a475 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7c657c1241..8860b2e548 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -428,6 +428,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4c5a8a39bf..c3e1bc66d1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
At Mon, 20 Dec 2021 15:28:23 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
4. Including the reasons above, this is not fully functionally.
For example, if we execute the following commands on primary,
replica dones't work correctly. (boom!)=# CREATE UNLOGGED TABLE t (a int);
=# ALTER TABLE t SET LOGGED;
- The issue 4 above is not fixed (yet).
Not only for the case, RelationChangePersistence needs to send a
truncate record before FPIs. If primary crashes amid of the
operation, the table content will be vanish with the persistence
change. That is the correct behavior.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v9-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From b28163fd7b3527e69f5b76f252891f800d7ac98c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v9 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 +++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 593 +++++++++++++++++++++++--
src/backend/commands/tablecmds.c | 256 +++++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 ++++++++++----
src/backend/storage/smgr/md.c | 93 +++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 +
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
22 files changed, 1465 insertions(+), 207 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e1fbe957f..59f4c2eacf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..03fccc3c3b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -27,6 +28,7 @@
#include "access/xlogutils.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "common/hashfn.h"
#include "miscadmin.h"
#include "storage/freespace.h"
#include "storage/smgr.h"
@@ -57,9 +59,18 @@ int wal_skip_threshold = 2048; /* in kilobytes */
* but I'm being paranoid.
*/
+#define PDOP_DELETE (1 << 0)
+#define PDOP_UNLINK_FORK (1 << 1)
+#define PDOP_UNLINK_MARK (1 << 2)
+#define PDOP_SET_PERSISTENCE (1 << 3)
+
typedef struct PendingRelDelete
{
RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -75,6 +86,24 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
+typedef struct SRelHashEntry
+{
+ SMgrRelation srel;
+ char status; /* for simplehash use */
+} SRelHashEntry;
+
+/* define hashtable for workarea for pending deletes */
+#define SH_PREFIX srelhash
+#define SH_ELEMENT_TYPE SRelHashEntry
+#define SH_KEY_TYPE SMgrRelation
+#define SH_KEY srel
+#define SH_HASH_KEY(tb, key) \
+ hash_bytes((unsigned char *)&key, sizeof(SMgrRelation))
+#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0)
+#define SH_SCOPE static inline
+#define SH_DEFINE
+#define SH_DECLARE
+#include "lib/simplehash.h"
/*
* AddPendingSync
@@ -143,22 +172,47 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file works as the signal of orphan files.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
- /* Add the relation to the list of stuff to delete at abort */
+ /*
+ * Add the relation to the list of stuff to delete at abort. We don't
+ * remove the mark file at commit. It needs to persists until the main fork
+ * file is actually deleted. See SyncPostCheckpoint.
+ */
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rnode;
+ pending->op = PDOP_DELETE;
pending->backend = backend;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
+ /* drop cleanup fork at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = MAIN_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = backend;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
Assert(backend == InvalidBackendId);
@@ -168,6 +222,226 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ SMgrRelation srel;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operations on this relation, that means
+ * that we have already registered pending delete entries to drop an
+ * init-fork preexisting since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ /*
+ * We don't touch unrelated entries. Although init-fork related entries
+ * are not useful if the relation is created or dropped in this
+ * transaction, we don't bother to avoid registering entries for such
+ * relations here.
+ */
+ if (!RelFileNodeEquals(rnode, pending->relnode) ||
+ pending->unlink_forknum != INIT_FORKNUM ||
+ (pending->op & PDOP_DELETE) != 0)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* make sure the entry is what we're expecting here */
+ Assert(((pending->op & (PDOP_UNLINK_FORK|PDOP_UNLINK_MARK)) != 0 &&
+ pending->unlink_forknum == INIT_FORKNUM) ||
+ (pending->op & PDOP_SET_PERSISTENCE) != 0);
+
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ create = false;
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create an init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The mark file works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK | PDOP_UNLINK_MARK | PDOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev;
+ PendingRelDelete *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just register pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingDeletes; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ /*
+ * We don't touch unrelated entries. Although init-fork related entries
+ * are not useful if the relation is created or dropped in this
+ * transaction, we don't bother to avoid registering entries for such
+ * relations here.
+ */
+ if (!RelFileNodeEquals(rnode, pending->relnode) ||
+ pending->unlink_forknum != INIT_FORKNUM ||
+ (pending->op & PDOP_DELETE) != 0)
+ {
+ prev = pending;
+ continue;
+ }
+
+ /* make sure the entry is what we're expecting here */
+ Assert(((pending->op & (PDOP_UNLINK_FORK|PDOP_UNLINK_MARK)) != 0 &&
+ pending->unlink_forknum == INIT_FORKNUM) ||
+ (pending->op & PDOP_SET_PERSISTENCE) != 0);
+
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingDeletes = next;
+ pfree(pending);
+
+ inxact_created = true;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +461,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -200,6 +556,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->relnode = rel->rd_node;
+ pending->op = PDOP_DELETE;
pending->backend = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -618,59 +975,104 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ srelhash_hash *close_srels = NULL;
+ bool found;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
+ SMgrRelation srel;
+
next = pending->next;
if (pending->nestLevel < nestLevel)
{
/* outer-level entries should not be processed yet */
prev = pending;
+ continue;
}
+
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
else
+ pendingDeletes = next;
+
+ if (pending->atCommit != isCommit)
{
- /* unlink list entry first, so we don't retry on failure */
- if (prev)
- prev->next = next;
- else
- pendingDeletes = next;
- /* do deletion if called for */
- if (pending->atCommit == isCommit)
- {
- SMgrRelation srel;
-
- srel = smgropen(pending->relnode, pending->backend);
-
- /* allocate the initial array, or extend it, if needed */
- if (maxrels == 0)
- {
- maxrels = 8;
- srels = palloc(sizeof(SMgrRelation) * maxrels);
- }
- else if (maxrels <= nrels)
- {
- maxrels *= 2;
- srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
- }
-
- srels[nrels++] = srel;
- }
/* must explicitly free the list entry */
pfree(pending);
/* prev does not change */
+ continue;
}
+
+ if (close_srels == NULL)
+ close_srels = srelhash_create(CurrentMemoryContext, 32, NULL);
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ /* Uniquify the smgr relations */
+ srelhash_insert(close_srels, srel, &found);
+
+ if (pending->op & PDOP_DELETE)
+ {
+ /* allocate the initial array, or extend it, if needed */
+ if (maxrels == 0)
+ {
+ maxrels = 8;
+ srels = palloc(sizeof(SMgrRelation) * maxrels);
+ }
+ else if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ srels[nrels++] = srel;
+ }
+
+ if (pending->op & PDOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode, pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PDOP_UNLINK_MARK)
+ {
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ }
+
+ if (pending->op & PDOP_SET_PERSISTENCE)
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
}
if (nrels > 0)
{
smgrdounlinkall(srels, nrels, false);
-
- for (int i = 0; i < nrels; i++)
- smgrclose(srels[i]);
-
pfree(srels);
}
+
+ if (close_srels)
+ {
+ srelhash_iterator i;
+ SRelHashEntry *ent;
+
+ /* close smgr relatoins */
+ srelhash_start_iterate(close_srels, &i);
+ while ((ent = srelhash_iterate(close_srels, &i)) != NULL)
+ smgrclose(ent->srel);
+ srelhash_destroy(close_srels);
+ }
}
/*
@@ -840,7 +1242,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId
+ && pending->op == PDOP_DELETE)
nrels++;
}
if (nrels == 0)
@@ -853,7 +1256,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->backend == InvalidBackendId)
+ && pending->backend == InvalidBackendId &&
+ pending->op == PDOP_DELETE)
{
*rptr = pending->relnode;
rptr++;
@@ -933,6 +1337,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1434,120 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingRelDelete *prev = NULL;
+
+ for (pending = pendingDeletes; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PDOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingDeletes = pending->next;
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingRelDelete *pending;
+ PendingRelDelete *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingDeletes; pending != NULL; pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PDOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingDeletes = pending->next;
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->relnode = xlrec->rnode;
+ pending->op = PDOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index bf42587e38..0d9c801535 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5329,6 +5330,177 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * Some access methods do not accept in-place persistence change. For
+ * example, GiST uses page LSNs to figure out whether a block has
+ * changed, where UNLOGGED GiST indexes use fake LSNs that are
+ * incompatible with real LSNs used for LOGGED ones.
+ *
+ * XXXX: We don't bother to allow in-place persistence change for index
+ * methods other than btree for now.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ r->rd_rel->relam != BTREE_AM_OID)
+ {
+ int reindex_flags;
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, 0);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5459,47 +5631,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..0137902bb2 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..580b74839f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..67f24890d6 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[10];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, 10, ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..382623159c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
v9-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LOG.patchtext/x-patch; charset=us-asciiDownload
From 951e264c26bbb0523a872268fb28981227dda041 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v9 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0d9c801535..7c18ed9e75 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14499,6 +14499,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index df0b747883..55e38cfe3f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4269,6 +4269,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5622,6 +5635,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index cb7ddd463c..a19b7874d7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3625,6 +3637,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3d4dd43e47..9823d57a54 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d47..1483f9a475 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7c657c1241..8860b2e548 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -428,6 +428,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4c5a8a39bf..c3e1bc66d1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
Hi Kyotaro, I'm glad you are still into this
I didn't register for some reasons.
Right now in v8 there's a typo in ./src/backend/catalog/storage.c :
storage.c: In function 'RelationDropInitFork':
storage.c:385:44: error: expected statement before ')' token
pending->unlink_forknum != INIT_FORKNUM)) <-- here, one ) too much
1. I'm not sure that we want to have the new mark files.
I can't help with such design decision, but if there are doubts maybe then add checking return codes around:
a) pg_fsync() and fsync_parent_path() (??) inside mdcreatemark()
b) mdunlinkmark() inside mdunlinkmark()
and PANIC if something goes wrong?
2. Aside of possible bugs, I'm not confident that the crash-safety of
this patch is actually water-tight. At least we need tests for some
failure cases.3. As mentioned in transam/README, failure in removing smgr mark files
leads to immediate shut down. I'm not sure this behavior is acceptable.
Doesn't it happen for most of the stuff already? There's even data_sync_retry GUC.
4. Including the reasons above, this is not fully functionally.
For example, if we execute the following commands on primary,
replica dones't work correctly. (boom!)=# CREATE UNLOGGED TABLE t (a int);
=# ALTER TABLE t SET LOGGED;The following fixes are done in the attched v8.
- Rebased. Referring to Jakub and Justin's work, I replaced direct
access to ->rd_smgr with RelationGetSmgr() and removed calls to
RelationOpenSmgr(). I still separate the "ALTER TABLE ALL IN
TABLESPACE SET LOGGED/UNLOGGED" statement part.- Fixed RelationCreate/DropInitFork's behavior for non-target
relations. (From Jakub's work).- Fixed wording of some comments.
- As revisited, I found a bug around recovery. If the logged-ness of a
relation gets flipped repeatedly in a transaction, duplicate
pending-delete entries are accumulated during recovery and work in a
wrong way. sgmr_redo now adds up to one entry for a action.- The issue 4 above is not fixed (yet).
Thanks again, If you have any list of crush tests ideas maybe I'll have some minutes
to try to figure them out. Is there is any goto list of stuff to be checked to add confidence
to this patch (as per point #2) ?
BTW fast feedback regarding that ALTER patch (there were 4 unlogged tables):
# ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
WARNING: unrecognized node type: 349
-J.
At Mon, 20 Dec 2021 07:59:29 +0000, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote in
Hi Kyotaro, I'm glad you are still into this
I didn't register for some reasons.
Right now in v8 there's a typo in ./src/backend/catalog/storage.c :
storage.c: In function 'RelationDropInitFork':
storage.c:385:44: error: expected statement before ')' token
pending->unlink_forknum != INIT_FORKNUM)) <-- here, one ) too much
Yeah, I thought that I had removed it. v9 patch I believe is correct.
1. I'm not sure that we want to have the new mark files.
I can't help with such design decision, but if there are doubts maybe then add checking return codes around:
a) pg_fsync() and fsync_parent_path() (??) inside mdcreatemark()
b) mdunlinkmark() inside mdunlinkmark()
and PANIC if something goes wrong?
The point is it is worth the complexity it adds. Since the mark file
can resolve another existing (but I don't recall in detail) issue and
this patchset actually fixes it, it can be said to have a certain
extent of persuasiveness. But that doesn't change the fact that it's
additional complexity.
2. Aside of possible bugs, I'm not confident that the crash-safety of
this patch is actually water-tight. At least we need tests for some
failure cases.3. As mentioned in transam/README, failure in removing smgr mark files
leads to immediate shut down. I'm not sure this behavior is acceptable.Doesn't it happen for most of the stuff already? There's even data_sync_retry GUC.
Hmm. Yes, actually it is "as water-tight as possible". I just want
others' eyes on that perspective. CF could be the entry point of
others but I'm a bit hesitent to add a new entry..
4. Including the reasons above, this is not fully functionally.
For example, if we execute the following commands on primary,
replica dones't work correctly. (boom!)=# CREATE UNLOGGED TABLE t (a int);
=# ALTER TABLE t SET LOGGED;The following fixes are done in the attched v8.
- Rebased. Referring to Jakub and Justin's work, I replaced direct
access to ->rd_smgr with RelationGetSmgr() and removed calls to
RelationOpenSmgr(). I still separate the "ALTER TABLE ALL IN
TABLESPACE SET LOGGED/UNLOGGED" statement part.- Fixed RelationCreate/DropInitFork's behavior for non-target
relations. (From Jakub's work).- Fixed wording of some comments.
- As revisited, I found a bug around recovery. If the logged-ness of a
relation gets flipped repeatedly in a transaction, duplicate
pending-delete entries are accumulated during recovery and work in a
wrong way. sgmr_redo now adds up to one entry for a action.- The issue 4 above is not fixed (yet).
Thanks again, If you have any list of crush tests ideas maybe I'll have some minutes
to try to figure them out. Is there is any goto list of stuff to be checked to add confidence
to this patch (as per point #2) ?
Just causing a crash (kill -9) after executing problem-prone command
sequence, then seeing recovery works well would sufficient.
For example:
create unlogged table; begin; insert ..; alter table set logged;
<crash>. Recovery works.
"create logged; begin; {alter unlogged; alter logged;} * 1000; alter
logged; commit/abort" doesn't pollute pgdata.
BTW fast feedback regarding that ALTER patch (there were 4 unlogged tables):
# ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
WARNING: unrecognized node type: 349
lol I met a server crash. Will fix. Thanks!
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Mon, 20 Dec 2021 17:39:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Mon, 20 Dec 2021 07:59:29 +0000, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote in
BTW fast feedback regarding that ALTER patch (there were 4 unlogged tables):
# ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
WARNING: unrecognized node type: 349lol I met a server crash. Will fix. Thanks!
That crash vanished after a recompilation for me and I don't see that
error. On my dev env node# 349 is T_ALterTableSetLoggedAllStmt, which
0002 adds. So perhaps make clean/make all would fix that.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hi Kyotaro,
At Mon, 20 Dec 2021 17:39:27 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote inAt Mon, 20 Dec 2021 07:59:29 +0000, Jakub Wartak
<Jakub.Wartak@tomtom.com> wrote inBTW fast feedback regarding that ALTER patch (there were 4 unlogged
tables):
# ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
WARNING: unrecognized node type: 349lol I met a server crash. Will fix. Thanks!
That crash vanished after a recompilation for me and I don't see that error. On
my dev env node# 349 is T_ALterTableSetLoggedAllStmt, which
0002 adds. So perhaps make clean/make all would fix that.
The fastest I could - I've repeated the whole cycle about that one with fresh v9 (make clean, configure, make install, fresh initdb) and I've found two problems:
1) check-worlds seems OK but make -C src/test/recovery check shows a couple of failing tests here locally and in https://cirrus-ci.com/task/4699985735319552?logs=test#L807 :
t/009_twophase.pl (Wstat: 256 Tests: 24 Failed: 1)
Failed test: 21
Non-zero exit status: 1
t/014_unlogged_reinit.pl (Wstat: 512 Tests: 12 Failed: 2)
Failed tests: 9-10
Non-zero exit status: 2
t/018_wal_optimize.pl (Wstat: 7424 Tests: 0 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 38 tests but ran 0.
t/022_crash_temp_files.pl (Wstat: 7424 Tests: 6 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 9 tests but ran 6.
018 made no sense, I've tried to take a quick look with wal_level=minimal why it is failing , it is mystery to me as the sequence seems to be pretty basic but the outcome is not:
~> cat repro.sql
create tablespace tbs1 location '/tbs1';
CREATE TABLE moved (id int);
INSERT INTO moved VALUES (1);
BEGIN;
ALTER TABLE moved SET TABLESPACE tbs1;
CREATE TABLE originated (id int);
INSERT INTO originated VALUES (1);
CREATE UNIQUE INDEX ON originated(id) TABLESPACE tbs1;
COMMIT;
~> psql -f repro.sql z3; sleep 1; /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile -m immediate stop
CREATE TABLESPACE
CREATE TABLE
INSERT 0 1
BEGIN
ALTER TABLE
CREATE TABLE
INSERT 0 1
CREATE INDEX
COMMIT
waiting for server to shut down.... done
server stopped
~> /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile start
waiting for server to start.... done
server started
z3# select * from moved;
ERROR: could not open file "pg_tblspc/32834/PG_15_202112131/32833/32838": No such file or directory
z3=# select * from originated;
ERROR: could not open file "base/32833/32839": No such file or directory
z3=# \dt+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+------------+-------+----------+-------------+---------+-------------
public | moved | table | postgres | permanent | 0 bytes |
public | originated | table | postgres | permanent | 0 bytes |
This happens even without placing on tablespace at all {for originated table , but no for moved on}, some major mishap is there (commit should guarantee correctness) or I'm tired and having sloppy fingers.
2) minor one testcase, still something is odd.
drop tablespace tbs1;
create tablespace tbs1 location '/tbs1';
CREATE UNLOGGED TABLE t4 (a int) tablespace tbs1;
CREATE UNLOGGED TABLE t5 (a int) tablespace tbs1;
CREATE UNLOGGED TABLE t6 (a int) tablespace tbs1;
CREATE TABLE t7 (a int) tablespace tbs1;
insert into t7 values (1);
insert into t5 values (1);
insert into t6 values (1);
\dt+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+------+-------+----------+-------------+------------+-------------
public | t4 | table | postgres | unlogged | 0 bytes |
public | t5 | table | postgres | unlogged | 8192 bytes |
public | t6 | table | postgres | unlogged | 8192 bytes |
public | t7 | table | postgres | permanent | 8192 bytes |
(4 rows)
ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
==> STILL WARNING: unrecognized node type: 349
\dt+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+------+-------+----------+-------------+------------+-------------
public | t4 | table | postgres | permanent | 0 bytes |
public | t5 | table | postgres | permanent | 8192 bytes |
public | t6 | table | postgres | permanent | 8192 bytes |
public | t7 | table | postgres | permanent | 8192 bytes |
So it did rewrite however this warning seems to be unfixed. I've tested on e2c52beecdea152ca680a22ef35c6a7da55aa30f.
-J.
Ugh! I completely forgot about TAP tests.. Thanks for the testing and
sorry for the bugs.
This is a bit big change so I need a bit of time before posting the
next version.
At Mon, 20 Dec 2021 13:38:35 +0000, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote in
1) check-worlds seems OK but make -C src/test/recovery check shows a couple of failing tests here locally and in https://cirrus-ci.com/task/4699985735319552?logs=test#L807 :
t/009_twophase.pl (Wstat: 256 Tests: 24 Failed: 1)
Failed test: 21
Non-zero exit status: 1
PREPARE TRANSACTION requires uncommited file creation to be
committed. Concretely we need to remove the "mark" files for the
in-transaction created relation file during PREPARE TRANSACTION.
pendingSync is not a parallel mechanism with pendingDeletes so we
cannot move mark deletion to pendingSync.
After all I decided to add a separate list pendingCleanups for pending
non-deletion tasks separately from pendingDeletes and execute it
before insering the commit record. Not only the above but also all of
the following failures vanished by the change.
t/014_unlogged_reinit.pl (Wstat: 512 Tests: 12 Failed: 2)
Failed tests: 9-10
Non-zero exit status: 2
t/018_wal_optimize.pl (Wstat: 7424 Tests: 0 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 38 tests but ran 0.
t/022_crash_temp_files.pl (Wstat: 7424 Tests: 6 Failed: 0)
Non-zero exit status: 29
Parse errors: Bad plan. You planned 9 tests but ran 6.
018 made no sense, I've tried to take a quick look with wal_level=minimal why it is failing , it is mystery to me as the sequence seems to be pretty basic but the outcome is not:
I think this shares the same cause.
~> cat repro.sql
create tablespace tbs1 location '/tbs1';
CREATE TABLE moved (id int);
INSERT INTO moved VALUES (1);
BEGIN;
ALTER TABLE moved SET TABLESPACE tbs1;
CREATE TABLE originated (id int);
INSERT INTO originated VALUES (1);
CREATE UNIQUE INDEX ON originated(id) TABLESPACE tbs1;
COMMIT;
..
ERROR: could not open file "base/32833/32839": No such file or directory
z3=# \dt+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+------------+-------+----------+-------------+---------+-------------
public | moved | table | postgres | permanent | 0 bytes |
public | originated | table | postgres | permanent | 0 bytes |This happens even without placing on tablespace at all {for originated table , but no for moved on}, some major mishap is there (commit should guarantee correctness) or I'm tired and having sloppy fingers.
2) minor one testcase, still something is odd.
drop tablespace tbs1;
create tablespace tbs1 location '/tbs1';
CREATE UNLOGGED TABLE t4 (a int) tablespace tbs1;
CREATE UNLOGGED TABLE t5 (a int) tablespace tbs1;
CREATE UNLOGGED TABLE t6 (a int) tablespace tbs1;
CREATE TABLE t7 (a int) tablespace tbs1;
insert into t7 values (1);
insert into t5 values (1);
insert into t6 values (1);
\dt+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+------+-------+----------+-------------+------------+-------------
public | t4 | table | postgres | unlogged | 0 bytes |
public | t5 | table | postgres | unlogged | 8192 bytes |
public | t6 | table | postgres | unlogged | 8192 bytes |
public | t7 | table | postgres | permanent | 8192 bytes |
(4 rows)ALTER TABLE ALL IN TABLESPACE tbs1 set logged;
==> STILL WARNING: unrecognized node type: 349
\dt+
List of relations
Schema | Name | Type | Owner | Persistence | Size | Description
--------+------+-------+----------+-------------+------------+-------------
public | t4 | table | postgres | permanent | 0 bytes |
public | t5 | table | postgres | permanent | 8192 bytes |
public | t6 | table | postgres | permanent | 8192 bytes |
public | t7 | table | postgres | permanent | 8192 bytes |So it did rewrite however this warning seems to be unfixed. I've tested on e2c52beecdea152ca680a22ef35c6a7da55aa30f.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 21 Dec 2021 17:13:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Ugh! I completely forgot about TAP tests.. Thanks for the testing and
sorry for the bugs.This is a bit big change so I need a bit of time before posting the
next version.
I took a bit too long detour but the patch gets to pass make-world for
me.
In this version:
- When relation persistence is changed from logged to unlogged, buffer
persistence is flipped then an init-fork is created along with a mark
file for the fork (RelationCreateInitFork). The mark file is removed
at commit but left alone after a crash before commit. At the next
startup, ResetUnloggedRelationsInDbspaceDir() removes the init fork
file if it finds the mark file corresponding to the file.
- When relation persistence is changed from unlogged to logged, buffer
persistence is flipped then the exisging init-fork is marked to be
dropped at commit (RelationDropInitFork). Finally the whole content
is WAL-logged in the page-wise manner (RelationChangePersistence),
- The two operations above are repeatable within a transaction and
commit makes the last operation persist and rollback make the all
operations abandoned.
- Storage files are created along with a "mark" file for the
relfilenode. It behaves the same way to the above except the mark
files corresponds to the whole relfilenode.
- The at-commit operations this patch adds require to be WAL-logged so
they don't fit pendingDeletes list, which is executed after commit. I
added a new pending-work list pendingCleanups that is executed just
after pendingSyncs. (new in this version)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v10-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From bc7e14b8af3c72e4ab99c964688d18ef4545f8b9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v10 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 +++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 545 ++++++++++++++++++++++++-
src/backend/commands/tablecmds.c | 256 ++++++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 +++++++++++-----
src/backend/storage/smgr/md.c | 93 ++++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 ++
src/common/relpath.c | 47 ++-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
23 files changed, 1450 insertions(+), 182 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e7b0bc804d..b41186d6d8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2197,6 +2197,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2447,6 +2450,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2772,6 +2778,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e1fbe957f..59f4c2eacf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..d6b30387e9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup *pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode)
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operations on this relation, that means
+ * that we have already registered pending delete entries to drop an
+ * init-fork preexisting since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create an init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The mark file works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just register pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
prev->next = next;
else
pendingDeletes = next;
+
pfree(pending);
/* prev does not change */
}
@@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * emitted before the commit record for the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert ((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index bf42587e38..0d9c801535 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5329,6 +5330,177 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * Some access methods do not accept in-place persistence change. For
+ * example, GiST uses page LSNs to figure out whether a block has
+ * changed, where UNLOGGED GiST indexes use fake LSNs that are
+ * incompatible with real LSNs used for LOGGED ones.
+ *
+ * XXXX: We don't bother to allow in-place persistence change for index
+ * methods other than btree for now.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ r->rd_rel->relam != BTREE_AM_OID)
+ {
+ int reindex_flags;
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, 0);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = rel->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5459,47 +5631,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..0137902bb2 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..580b74839f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..4945b111cc 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..584ebac391 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
v10-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From f2d6ccc64183b0d177b523faaa5c0b8777bfc195 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v10 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0d9c801535..7c18ed9e75 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14499,6 +14499,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index df0b747883..55e38cfe3f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4269,6 +4269,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5622,6 +5635,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index cb7ddd463c..a19b7874d7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3625,6 +3637,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3d4dd43e47..9823d57a54 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d47..1483f9a475 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7c657c1241..8860b2e548 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -428,6 +428,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4c5a8a39bf..c3e1bc66d1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
Hi Kyotaro,
I took a bit too long detour but the patch gets to pass make-world for me.
Good news, v10 passes all the tests for me (including TAP recover ones). There's major problem I think:
drop table t6;
create unlogged table t6 (id bigint, t text);
create sequence s1;
insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 100);
alter table t6 set logged;
select pg_sleep(1);
<--optional checkpoint, more on this later.
/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile -m immediate stop
/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile start
select count(*) from t6; -- shows 0 rows
But If I perform checkpoint before crash, data is there.. apparently the missing steps done by checkpointer
seem to help. If checkpoint is not done, then some peeking reveals that upon startup there is truncation (?!):
$ /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile -m immediate stop
$ find /var/lib/pgsql/15/data/ -name '73741*' -ls
112723206 120 -rw------- 1 postgres postgres 122880 Dec 21 12:42 /var/lib/pgsql/15/data/base/73740/73741
112723202 24 -rw------- 1 postgres postgres 24576 Dec 21 12:42 /var/lib/pgsql/15/data/base/73740/73741_fsm
$ /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile start
waiting for server to start.... done
server started
$ find /var/lib/pgsql/15/data/ -name '73741*' -ls
112723206 0 -rw------- 1 postgres postgres 0 Dec 21 12:42 /var/lib/pgsql/15/data/base/73740/73741
112723202 16 -rw------- 1 postgres postgres 16384 Dec 21 12:42 /var/lib/pgsql/15/data/base/73740/73741_fsm
So what's suspicious is that 122880 -> 0 file size truncation. I've investigated WAL and it seems to contain TRUNCATE records
after logged FPI images, so when the crash recovery would kick in it probably clears this table (while it shouldn't).
However if I perform CHECKPOINT just before crash the WAL stream contains just RUNNING_XACTS and CHECKPOINT_ONLINE
redo records, this probably prevents truncating. I'm newbie here so please take this theory with grain of salt, it can be
something completely different.
-J.
Hello, Jakub.
At Tue, 21 Dec 2021 13:07:28 +0000, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote in
So what's suspicious is that 122880 -> 0 file size truncation. I've investigated WAL and it seems to contain TRUNCATE records
after logged FPI images, so when the crash recovery would kick in it probably clears this table (while it shouldn't).
Darn.. It is too silly that I wrongly issued truncate records for the
target relation of the function (rel) instaed of the relation on which
we're currently operating at that time (r).
However if I perform CHECKPOINT just before crash the WAL stream contains just RUNNING_XACTS and CHECKPOINT_ONLINE
redo records, this probably prevents truncating. I'm newbie here so please take this theory with grain of salt, it can be
something completely different.
It is because the WAL records are inconsistent with the on-disk state.
After a crash before a checkpoint after the SET LOGGED, recovery ends with
recoverying the broken WAL records, but after that the on-disk state
is persisted and the broken WAL records are not replayed.
The following fix works.
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5478,7 +5478,7 @@ RelationChangePersistence(AlteredTableInfo *tab, char persistence,
xl_smgr_truncate xlrec;
xlrec.blkno = 0;
- xlrec.rnode = rel->rd_node;
+ xlrec.rnode = r->rd_node;
xlrec.flags = SMGR_TRUNCATE_ALL;
I made another change in this version. Previously only btree among all
index AMs was processed in the in-place manner. In this version we do
that all AMs except GiST. Maybe if gistGetFakeLSN behaved the same
way for permanent and unlogged indexes, we could skip index rebuild in
exchange of some extra WAL records emitted while it is unlogged.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v11-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 0cac0fade05322c1aa8b7ec020f8fe1f9e5fb50e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v11 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 +++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 545 ++++++++++++++++++++++++-
src/backend/commands/tablecmds.c | 265 ++++++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 +++++++++++-----
src/backend/storage/smgr/md.c | 93 ++++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 ++
src/common/relpath.c | 47 ++-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
23 files changed, 1459 insertions(+), 182 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e7b0bc804d..b41186d6d8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2197,6 +2197,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2447,6 +2450,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2772,6 +2778,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e1fbe957f..59f4c2eacf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..d6b30387e9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup *pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode)
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operations on this relation, that means
+ * that we have already registered pending delete entries to drop an
+ * init-fork preexisting since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create an init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The mark file works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just register pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
prev->next = next;
else
pendingDeletes = next;
+
pfree(pending);
/* prev does not change */
}
@@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * emitted before the commit record for the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert ((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index bf42587e38..451ed9adb1 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5329,6 +5330,186 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, 0);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5459,47 +5640,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..0137902bb2 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..580b74839f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..4945b111cc 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..584ebac391 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
v11-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From 71cdfe083182d5cf6872571f48cf6ddd2602c043 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v11 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 451ed9adb1..4a42390928 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14508,6 +14508,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index df0b747883..55e38cfe3f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4269,6 +4269,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5622,6 +5635,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index cb7ddd463c..a19b7874d7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3625,6 +3637,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3d4dd43e47..9823d57a54 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d47..1483f9a475 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7c657c1241..8860b2e548 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -428,6 +428,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4c5a8a39bf..c3e1bc66d1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
Hi Kyotaro,
At Tue, 21 Dec 2021 13:07:28 +0000, Jakub Wartak
<Jakub.Wartak@tomtom.com> wrote inSo what's suspicious is that 122880 -> 0 file size truncation. I've
investigated WAL and it seems to contain TRUNCATE records after loggedFPI images, so when the crash recovery would kick in it probably clears this
table (while it shouldn't).Darn.. It is too silly that I wrongly issued truncate records for the target
relation of the function (rel) instaed of the relation on which we're currently
operating at that time (r).[..]
The following fix works.
Cool, I have verified basic stuff that was coming to my mind, now even cfbot is happy with v11, You should happy too I hope :)
I made another change in this version. Previously only btree among all index
AMs was processed in the in-place manner. In this version we do that all
AMs except GiST. Maybe if gistGetFakeLSN behaved the same way for
permanent and unlogged indexes, we could skip index rebuild in exchange of
some extra WAL records emitted while it is unlogged.
I think there's slight omission:
-- unlogged table -> logged with GiST:
DROP TABLE IF EXISTS testcase;
CREATE UNLOGGED TABLE testcase(geom geometry not null);
CREATE INDEX idx_testcase_gist ON testcase USING gist(geom);
INSERT INTO testcase(geom) SELECT ST_Buffer(ST_SetSRID(ST_MakePoint(-1.0, 2.0),4326), 0.0001);
ALTER TABLE testcase SET LOGGED;
-- crashes with:
(gdb) where
#0 reindex_index (indexId=indexId@entry=65541, skip_constraint_checks=skip_constraint_checks@entry=true, persistence=persistence@entry=112 'p', params=params@entry=0x0) at index.c:3521
#1 0x000000000062f494 in RelationChangePersistence (tab=tab@entry=0x1947258, persistence=112 'p', lockmode=lockmode@entry=8) at tablecmds.c:5434
#2 0x0000000000642819 in ATRewriteTables (context=0x7ffc19c04520, lockmode=<optimized out>, wqueue=0x7ffc19c04388, parsetree=0x1925ec8) at tablecmds.c:5644
[..]
#10 0x00000000007f078f in exec_simple_query (query_string=0x1925340 "ALTER TABLE testcase SET LOGGED;") at postgres.c:1215
apparently reindex_index() params cannot be NULL - the same happens with switching persistent
table to unlogged one too (with GiST).
I'll also try to give another shot to the patch early next year - as we are starting long Christmas/holiday break here
- with verifying WAL for GiST and more advanced setup (more crashes, and standby/archiving/barman to see
how it's possible to use wal_level=minimal <-> replica transitions).
-J.
At Wed, 22 Dec 2021 08:42:14 +0000, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote in
I think there's slight omission:
...
apparently reindex_index() params cannot be NULL - the same happens with switching persistent
Hmm. a3dc926009 has changed the interface. (But the name is also
changed after that.)
-reindex_relation(Oid relid, int flags, int options)
+reindex_relation(Oid relid, int flags, ReindexParams *params)
I'll also try to give another shot to the patch early next year - as we are starting long Christmas/holiday break here
- with verifying WAL for GiST and more advanced setup (more crashes, and standby/archiving/barman to see
how it's possible to use wal_level=minimal <-> replica transitions).
Thanks. I added TAP test to excecise the in-place persistence change.
have a nice holiday, Jakub!
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v12-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 19a8bb14dc863981e33ce8a10ecb9b87a4aa3937 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v12 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 ++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 545 +++++++++++++++++-
src/backend/commands/tablecmds.c | 266 +++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 +++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 +++++++----
src/backend/storage/smgr/md.c | 93 ++-
src/backend/storage/smgr/smgr.c | 32 +
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 +
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
src/test/recovery/t/027_persistence_change.pl | 268 +++++++++
24 files changed, 1728 insertions(+), 182 deletions(-)
create mode 100644 src/test/recovery/t/027_persistence_change.pl
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e7b0bc804d..b41186d6d8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2197,6 +2197,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2447,6 +2450,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2772,6 +2778,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e1fbe957f..59f4c2eacf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..d6b30387e9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup *pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode)
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operations on this relation, that means
+ * that we have already registered pending delete entries to drop an
+ * init-fork preexisting since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create an init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The mark file works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just register pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
prev->next = next;
else
pendingDeletes = next;
+
pfree(pending);
/* prev does not change */
}
@@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * emitted before the commit record for the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert ((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 45e59e3d5c..41e77e1072 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5329,6 +5330,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5459,47 +5641,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..0137902bb2 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..580b74839f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..4945b111cc 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..584ebac391 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/test/recovery/t/027_persistence_change.pl b/src/test/recovery/t/027_persistence_change.pl
new file mode 100644
index 0000000000..38c5388093
--- /dev/null
+++ b/src/test/recovery/t/027_persistence_change.pl
@@ -0,0 +1,268 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test relation persistence change
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Test::More tests => 57;
+use IPC::Run qw(pump finish timer);
+use Config;
+
+my $data_unit = 2000;
+
+# Initialize primary node.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+# we don't want checkpointing
+$node->append_conf('postgresql.conf', qq(
+checkpoint_timeout = '24h'
+));
+$node->start;
+create($node);
+
+my $relfilenodes1 = relfilenodes();
+
+# correctly recover empty tables
+$node->stop('immediate');
+$node->start;
+insert($node, 0, $data_unit);
+
+# data persists after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss($data_unit, 'crash logged 1');
+
+set_unlogged($node);
+# SET UNLOGGED didn't change relfilenode
+my $relfilenodes2 = relfilenodes();
+checkrelfilenodes($relfilenodes1, $relfilenodes2, 'logged->unlogged');
+
+# data cleanly vanishes after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss(0, 'crash unlogged');
+
+insert($node, 0, $data_unit);
+set_logged($node);
+
+$node->stop('immediate');
+$node->start;
+# SET LOGGED didn't change relfilenode and data survive a crash
+my $relfilenodes3 = relfilenodes();
+checkrelfilenodes($relfilenodes2, $relfilenodes3, 'unlogged->logged');
+checkdataloss($data_unit, 'crash logged 2');
+
+# unlogged insert -> graceful stop
+set_unlogged($node);
+insert($node, $data_unit, $data_unit, 0);
+$node->stop;
+$node->start;
+checkdataloss($data_unit * 2, 'unlogged graceful restart');
+
+# crash during transaction
+set_logged($node);
+$node->stop('immediate');
+$node->start;
+insert($node, $data_unit * 2, $data_unit);
+
+my $h = insert($node, $data_unit * 3, $data_unit, 1); ## this is aborted
+$node->stop('immediate');
+
+# finishing $h stalls this case, just tear it off.
+$h = undef;
+
+# check if indexes are working
+$node->start;
+check($node, $data_unit * 3, 'final');
+
+sub create
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ CREATE TABLE t_bt (a int);
+ CREATE INDEX t_bt_i ON t_bt USING btree (a);
+ CREATE TABLE t_gin (a int[]);
+ CREATE INDEX t_gin_i ON t_gin USING gin (a);
+ CREATE TABLE t_gist (a point);
+ CREATE INDEX t_gist_i ON t_gist USING gist (a);
+ CREATE TABLE t_hash (a int);
+ CREATE INDEX t_hash_i ON t_hash USING hash (a);
+ CREATE TABLE t_brin (a int);
+ CREATE INDEX t_brin_i ON t_brin USING brin (a);
+ CREATE TABLE t_spgist (a point);
+ CREATE INDEX t_spgist_i ON t_spgist USING spgist (a);));
+}
+
+
+sub insert
+{
+ my ($node, $st, $num, $interactive) = @_;
+ my $ed = $st + $num - 1;
+ my $query = qq(BEGIN;
+INSERT INTO t_bt (SELECT i FROM generate_series($st, $ed) i);
+INSERT INTO t_gin
+ (SELECT ARRAY[i, i * 2] FROM generate_series($st, $ed) i);
+INSERT INTO t_gist
+ (SELECT point(i, i * 2) FROM generate_series($st, $ed) i);
+INSERT INTO t_hash (SELECT i FROM generate_series($st, $ed) i);
+INSERT INTO t_brin (SELECT i FROM generate_series($st, $ed) i);
+INSERT INTO t_spgist
+ (SELECT point(i,i) FROM generate_series($st, $ed) i);
+);
+
+ if ($interactive)
+ {
+ my $in = '';
+ my $out = '';
+ my $timer = timer(10);
+
+ my $h = $node->interactive_psql('postgres', \$in, \$out, $timer);
+ like($out, qr/psql/, "print startup banner");
+
+ $timer->start(10);
+ $in .= $query . "SELECT 'END';\n";
+ pump $h until ($out =~ /\nEND/ || $timer->is_expired);
+ ok(($out =~ /\nEND/ && !$timer->is_expired), "inserted");
+ return $h
+ # the trasaction is not terminated
+ }
+ else
+ {
+ $node->psql('postgres', $query . "COMMIT;");
+ return undef;
+ }
+}
+
+sub check
+{
+ my ($node, $num_data, $head) = @_;
+ my $st = 0;
+ my $ed = $num_data - 1;
+
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO true;
+ SET enable_indexscan TO false;
+ SELECT COUNT(*) FROM t_bt, generate_series($st, $ed) i
+ WHERE a = i)),
+ $num_data, "$head: heap is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t_bt, generate_series($st, $ed) i
+ WHERE a = i)),
+ $num_data, "$head: btree is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t_gin, generate_series($st, $ed) i
+ WHERE a = ARRAY[i, i * 2];)),
+ $num_data, "$head: gin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t_gist, generate_series($st, $ed) i
+ WHERE a <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)),
+ $num_data, "$head: gist is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t_hash, generate_series($st, $ed) i
+ WHERE a = i;)),
+ $num_data, "$head: hash is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t_brin, generate_series($st, $ed) i
+ WHERE a = i;)),
+ $num_data, "$head: brin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t_spgist, generate_series($st, $ed) i
+ WHERE a <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)),
+ $num_data, "$head: spgist is not broken");
+}
+
+sub set_unlogged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t_bt SET UNLOGGED;
+ ALTER TABLE t_gin SET UNLOGGED;
+ ALTER TABLE t_gist SET UNLOGGED;
+ ALTER TABLE t_hash SET UNLOGGED;
+ ALTER TABLE t_brin SET UNLOGGED;
+ ALTER TABLE t_spgist SET UNLOGGED;));
+}
+
+sub set_logged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t_bt SET LOGGED;
+ ALTER TABLE t_gin SET LOGGED;
+ ALTER TABLE t_gist SET LOGGED;
+ ALTER TABLE t_hash SET LOGGED;
+ ALTER TABLE t_brin SET LOGGED;
+ ALTER TABLE t_spgist SET LOGGED;));
+}
+
+sub relfilenodes
+{
+ my $result = $node->safe_psql('postgres', qq(
+ SELECT relname, relfilenode FROM pg_class
+ WHERE relname
+ IN
+ (SELECT unnest(ARRAY[n, n||'_i'])
+ FROM unnest(ARRAY['t_bt','t_gin','t_gist','t_hash','t_brin','t_spgist'])
+ as n(n));
+));
+
+ my %relfilenodes;
+
+ foreach my $l (split(/\n/, $result))
+ {
+ die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/);
+ $relfilenodes{$1} = $2;
+ }
+
+ return \%relfilenodes;
+}
+
+sub checkrelfilenodes
+{
+ my ($rnodes1, $rnodes2, $s) = @_;
+
+ foreach my $n (keys %{$rnodes1})
+ {
+ if ($n eq 't_gist_i')
+ {
+ # persistence of GiST index is not changed in-place
+ isnt($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is changed: $n");
+ }
+ else
+ {
+ # otherwise all relations are processed in-place
+ is($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is not changed: $n");
+ }
+ }
+}
+
+sub checkdataloss
+{
+ my ($expected, $s) = @_;
+
+ foreach my $n ('t_bt','t_gin','t_gist','t_hash','t_brin','t_spgist')
+ {
+ is($node->safe_psql('postgres', "SELECT count(*) FROM $n;"), $expected,
+ "$s: data in table $n is in the expected state");
+ }
+}
--
2.27.0
v12-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From 500b33149ac2db982fe95ee5e6bcfee285ab7dd1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v12 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 41e77e1072..08caeec931 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14509,6 +14509,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index df0b747883..55e38cfe3f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4269,6 +4269,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5622,6 +5635,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index cb7ddd463c..a19b7874d7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3625,6 +3637,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3d4dd43e47..9823d57a54 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d47..1483f9a475 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7c657c1241..8860b2e548 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -428,6 +428,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4c5a8a39bf..c3e1bc66d1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
At Thu, 23 Dec 2021 15:01:41 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I added TAP test to excecise the in-place persistence change.
We don't need a base table for every index. TAP test revised.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v13-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 112c077561bb24a0b40995e2d6ada7b33edd6475 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v13 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 ++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 545 +++++++++++++++++-
src/backend/commands/tablecmds.c | 266 +++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 +++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 +++++++----
src/backend/storage/smgr/md.c | 93 ++-
src/backend/storage/smgr/smgr.c | 32 +
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 +
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
src/test/recovery/t/027_persistence_change.pl | 244 ++++++++
24 files changed, 1704 insertions(+), 182 deletions(-)
create mode 100644 src/test/recovery/t/027_persistence_change.pl
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e7b0bc804d..b41186d6d8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2197,6 +2197,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2447,6 +2450,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2772,6 +2778,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1e1fbe957f..59f4c2eacf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..d6b30387e9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup *pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode)
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operations on this relation, that means
+ * that we have already registered pending delete entries to drop an
+ * init-fork preexisting since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create an init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The mark file works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just register pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
prev->next = next;
else
pendingDeletes = next;
+
pfree(pending);
/* prev does not change */
}
@@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * emitted before the commit record for the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert ((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 45e59e3d5c..41e77e1072 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5329,6 +5330,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5459,47 +5641,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..0137902bb2 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..580b74839f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..4945b111cc 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..584ebac391 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/test/recovery/t/027_persistence_change.pl b/src/test/recovery/t/027_persistence_change.pl
new file mode 100644
index 0000000000..a45bacc9b2
--- /dev/null
+++ b/src/test/recovery/t/027_persistence_change.pl
@@ -0,0 +1,244 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test relation persistence change
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Test::More tests => 25;
+use IPC::Run qw(pump finish timer);
+use Config;
+
+my $data_unit = 2000;
+
+# Initialize primary node.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+# we don't want checkpointing
+$node->append_conf('postgresql.conf', qq(
+checkpoint_timeout = '24h'
+));
+$node->start;
+create($node);
+
+my $relfilenodes1 = relfilenodes();
+
+# correctly recover empty tables
+$node->stop('immediate');
+$node->start;
+insert($node, 0, $data_unit);
+
+# data persists after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss($data_unit, 'crash logged 1');
+
+set_unlogged($node);
+# SET UNLOGGED didn't change relfilenode
+my $relfilenodes2 = relfilenodes();
+checkrelfilenodes($relfilenodes1, $relfilenodes2, 'logged->unlogged');
+
+# data cleanly vanishes after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss(0, 'crash unlogged');
+
+insert($node, 0, $data_unit);
+set_logged($node);
+
+$node->stop('immediate');
+$node->start;
+# SET LOGGED didn't change relfilenode and data survive a crash
+my $relfilenodes3 = relfilenodes();
+checkrelfilenodes($relfilenodes2, $relfilenodes3, 'unlogged->logged');
+checkdataloss($data_unit, 'crash logged 2');
+
+# unlogged insert -> graceful stop
+set_unlogged($node);
+insert($node, $data_unit, $data_unit, 0);
+$node->stop;
+$node->start;
+checkdataloss($data_unit * 2, 'unlogged graceful restart');
+
+# crash during transaction
+set_logged($node);
+$node->stop('immediate');
+$node->start;
+insert($node, $data_unit * 2, $data_unit);
+
+my $h = insert($node, $data_unit * 3, $data_unit, 1); ## this is aborted
+$node->stop('immediate');
+
+# finishing $h stalls this case, just tear it off.
+$h = undef;
+
+# check if indexes are working
+$node->start;
+# drop first half of data to reduce run time
+$node->safe_psql('postgres', 'DELETE FROM t WHERE bt < ' . $data_unit * 2);
+check($node, $data_unit * 2, $data_unit * 3 - 1, 'final check');
+
+sub create
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ CREATE TABLE t (bt int, gin int[], gist point, hash int,
+ brin int, spgist point);
+ CREATE INDEX i_bt ON t USING btree (bt);
+ CREATE INDEX i_gin ON t USING gin (gin);
+ CREATE INDEX i_gist ON t USING gist (gist);
+ CREATE INDEX i_hash ON t USING hash (hash);
+ CREATE INDEX i_brin ON t USING brin (brin);
+ CREATE INDEX i_spgist ON t_spgist USING spgist (spgist);));
+}
+
+
+sub insert
+{
+ my ($node, $st, $num, $interactive) = @_;
+ my $ed = $st + $num - 1;
+ my $query = qq(BEGIN;
+INSERT INTO t
+ (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i)
+ FROM generate_series($st, $ed) i);
+);
+
+ if ($interactive)
+ {
+ my $in = '';
+ my $out = '';
+ my $timer = timer(10);
+
+ my $h = $node->interactive_psql('postgres', \$in, \$out, $timer);
+ like($out, qr/psql/, "print startup banner");
+
+ $timer->start(10);
+ $in .= $query . "SELECT 'END';\n";
+ pump $h until ($out =~ /\nEND/ || $timer->is_expired);
+ ok(($out =~ /\nEND/ && !$timer->is_expired), "inserted");
+ return $h
+ # the trasaction is not terminated
+ }
+ else
+ {
+ $node->psql('postgres', $query . "COMMIT;");
+ return undef;
+ }
+}
+
+sub check
+{
+ my ($node, $st, $ed, $head) = @_;
+ my $num_data = $ed - $st + 1;
+
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO true;
+ SET enable_indexscan TO false;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "$head: heap is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "$head: btree is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gin = ARRAY[i, i * 2];)),
+ $num_data, "$head: gin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)),
+ $num_data, "$head: gist is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE hash = i;)),
+ $num_data, "$head: hash is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE brin = i;)),
+ $num_data, "$head: brin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)),
+ $num_data, "$head: spgist is not broken");
+}
+
+sub set_unlogged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t SET UNLOGGED;
+));
+}
+
+sub set_logged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t SET LOGGED;
+));
+}
+
+sub relfilenodes
+{
+ my $result = $node->safe_psql('postgres', qq{
+ SELECT relname, relfilenode FROM pg_class
+ WHERE relname
+ IN ('t', 'i_bt','i_gin','i_gist','i_hash','i_brin','i_spgist');});
+
+ my %relfilenodes;
+
+ foreach my $l (split(/\n/, $result))
+ {
+ die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/);
+ $relfilenodes{$1} = $2;
+ }
+
+ return \%relfilenodes;
+}
+
+sub checkrelfilenodes
+{
+ my ($rnodes1, $rnodes2, $s) = @_;
+
+ foreach my $n (keys %{$rnodes1})
+ {
+ if ($n eq 'i_gist')
+ {
+ # persistence of GiST index is not changed in-place
+ isnt($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is changed: $n");
+ }
+ else
+ {
+ # otherwise all relations are processed in-place
+ is($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is not changed: $n");
+ }
+ }
+}
+
+sub checkdataloss
+{
+ my ($expected, $s) = @_;
+
+ is($node->safe_psql('postgres', "SELECT count(*) FROM t;"), $expected,
+ "$s: data in table t is in the expected state");
+}
--
2.27.0
v13-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From cf197e934133f10286cef210f3e943086d015cb1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v13 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 41e77e1072..08caeec931 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14509,6 +14509,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index df0b747883..55e38cfe3f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4269,6 +4269,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5622,6 +5635,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index cb7ddd463c..a19b7874d7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3625,6 +3637,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3d4dd43e47..9823d57a54 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d47..1483f9a475 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7c657c1241..8860b2e548 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -428,6 +428,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 4c5a8a39bf..c3e1bc66d1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
On 2021-12-23 15:33:35 +0900, Kyotaro Horiguchi wrote:
At Thu, 23 Dec 2021 15:01:41 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I added TAP test to excecise the in-place persistence change.
We don't need a base table for every index. TAP test revised.
The tap tests seems to fail on all platforms. See
https://cirrus-ci.com/build/4911549314760704
E.g. the linux failure is
[16:45:15.569]
[16:45:15.569] # Failed test 'inserted'
[16:45:15.569] # at t/027_persistence_change.pl line 121.
[16:45:15.569] # Looks like you failed 1 test of 25.
[16:45:15.569] [16:45:15] t/027_persistence_change.pl ..........
[16:45:15.569] Dubious, test returned 1 (wstat 256, 0x100)
[16:45:15.569] Failed 1/25 subtests
[16:45:15.569] [16:45:15]
[16:45:15.569]
[16:45:15.569] Test Summary Report
[16:45:15.569] -------------------
[16:45:15.569] t/027_persistence_change.pl (Wstat: 256 Tests: 25 Failed: 1)
[16:45:15.569] Failed test: 18
[16:45:15.569] Non-zero exit status: 1
[16:45:15.569] Files=27, Tests=315, 220 wallclock secs ( 0.14 usr 0.03 sys + 48.94 cusr 17.13 csys = 66.24 CPU)
Greetings,
Andres Freund
At Tue, 4 Jan 2022 16:05:08 -0800, Andres Freund <andres@anarazel.de> wrote in
The tap tests seems to fail on all platforms. See
https://cirrus-ci.com/build/4911549314760704E.g. the linux failure is
[16:45:15.569]
[16:45:15.569] # Failed test 'inserted'
[16:45:15.569] # at t/027_persistence_change.pl line 121.
[16:45:15.569] # Looks like you failed 1 test of 25.
[16:45:15.569] [16:45:15] t/027_persistence_change.pl ..........
[16:45:15.569] Dubious, test returned 1 (wstat 256, 0x100)
[16:45:15.569] Failed 1/25 subtests
[16:45:15.569] [16:45:15]
[16:45:15.569]
[16:45:15.569] Test Summary Report
[16:45:15.569] -------------------
[16:45:15.569] t/027_persistence_change.pl (Wstat: 256 Tests: 25 Failed: 1)
[16:45:15.569] Failed test: 18
[16:45:15.569] Non-zero exit status: 1
[16:45:15.569] Files=27, Tests=315, 220 wallclock secs ( 0.14 usr 0.03 sys + 48.94 cusr 17.13 csys = 66.24 CPU)
Thank you very much. It still doesn't fail on my devlopment
environment (CentOS8), but I found a silly bug of the test script.
I'm still not sure the reason the test item failed but I repost the
updated version then watch what the CI says.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v14-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From e2deae1bef19827803e0e8f85b1e45e3fcd88505 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v14 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 ++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 545 +++++++++++++++++-
src/backend/commands/tablecmds.c | 266 +++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 +++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 +++++++----
src/backend/storage/smgr/md.c | 93 ++-
src/backend/storage/smgr/smgr.c | 32 +
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 +
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
src/test/recovery/t/027_persistence_change.pl | 247 ++++++++
24 files changed, 1707 insertions(+), 182 deletions(-)
create mode 100644 src/test/recovery/t/027_persistence_change.pl
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e7b0bc804d..b41186d6d8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2197,6 +2197,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2447,6 +2450,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2772,6 +2778,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87cd05c945..243860fcb1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..d6b30387e9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup *pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode)
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operations on this relation, that means
+ * that we have already registered pending delete entries to drop an
+ * init-fork preexisting since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create an init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The mark file works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just register pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
prev->next = next;
else
pendingDeletes = next;
+
pfree(pending);
/* prev does not change */
}
@@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * emitted before the commit record for the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert ((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3631b8a929..848fda40ca 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5329,6 +5330,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5459,47 +5641,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..0137902bb2 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..580b74839f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..4945b111cc 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..584ebac391 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/test/recovery/t/027_persistence_change.pl b/src/test/recovery/t/027_persistence_change.pl
new file mode 100644
index 0000000000..c2f7076ea9
--- /dev/null
+++ b/src/test/recovery/t/027_persistence_change.pl
@@ -0,0 +1,247 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test relation persistence change
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Test::More tests => 30;
+use IPC::Run qw(pump finish timer);
+use Config;
+
+my $data_unit = 2000;
+
+# Initialize primary node.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+# we don't want checkpointing
+$node->append_conf('postgresql.conf', qq(
+checkpoint_timeout = '24h'
+));
+$node->start;
+create($node);
+
+my $relfilenodes1 = relfilenodes();
+
+# correctly recover empty tables
+$node->stop('immediate');
+$node->start;
+insert($node, 0, $data_unit);
+
+# data persists after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss($data_unit, 'crash logged 1');
+
+set_unlogged($node);
+# SET UNLOGGED didn't change relfilenode
+my $relfilenodes2 = relfilenodes();
+checkrelfilenodes($relfilenodes1, $relfilenodes2, 'logged->unlogged');
+
+# data cleanly vanishes after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss(0, 'crash unlogged');
+
+insert($node, 0, $data_unit);
+set_logged($node);
+
+$node->stop('immediate');
+$node->start;
+# SET LOGGED didn't change relfilenode and data survive a crash
+my $relfilenodes3 = relfilenodes();
+checkrelfilenodes($relfilenodes2, $relfilenodes3, 'unlogged->logged');
+checkdataloss($data_unit, 'crash logged 2');
+
+# unlogged insert -> graceful stop
+set_unlogged($node);
+insert($node, $data_unit, $data_unit, 0);
+$node->stop;
+$node->start;
+checkdataloss($data_unit * 2, 'unlogged graceful restart');
+
+# crash during transaction
+set_logged($node);
+$node->stop('immediate');
+$node->start;
+insert($node, $data_unit * 2, $data_unit);
+
+my $h = insert($node, $data_unit * 3, $data_unit, 1); ## this is aborted
+$node->stop('immediate');
+
+# finishing $h stalls this case, just tear it off.
+$h = undef;
+
+# check if indexes are working
+$node->start;
+# drop first half of data to reduce run time
+$node->safe_psql('postgres', 'DELETE FROM t WHERE bt < ' . $data_unit * 2);
+check($node, $data_unit * 2, $data_unit * 3 - 1, 'final check');
+
+sub create
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ CREATE TABLE t (bt int, gin int[], gist point, hash int,
+ brin int, spgist point);
+ CREATE INDEX i_bt ON t USING btree (bt);
+ CREATE INDEX i_gin ON t USING gin (gin);
+ CREATE INDEX i_gist ON t USING gist (gist);
+ CREATE INDEX i_hash ON t USING hash (hash);
+ CREATE INDEX i_brin ON t USING brin (brin);
+ CREATE INDEX i_spgist ON t USING spgist (spgist);));
+}
+
+
+sub insert
+{
+ my ($node, $st, $num, $interactive) = @_;
+ my $ed = $st + $num - 1;
+ my $query = qq(BEGIN;
+INSERT INTO t
+ (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i)
+ FROM generate_series($st, $ed) i);
+);
+
+ if ($interactive)
+ {
+ my $in = '';
+ my $out = '';
+ my $timer = timer(10);
+
+ my $h = $node->interactive_psql('postgres', \$in, \$out, $timer);
+ like($out, qr/psql/, "print startup banner");
+
+ $timer->start(10);
+ $in .= $query . "SELECT 'END';\n";
+ pump $h until ($out =~ /\nEND/ || $timer->is_expired);
+ ok(($out =~ /\nEND/ && !$timer->is_expired), "inserted-$st-$num");
+ return $h
+ # the trasaction is not terminated
+ }
+ else
+ {
+ $node->psql('postgres', $query . "COMMIT;");
+ return undef;
+ }
+}
+
+sub check
+{
+ my ($node, $st, $ed, $head) = @_;
+ my $num_data = $ed - $st + 1;
+
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO true;
+ SET enable_indexscan TO false;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "$head: heap is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "$head: btree is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gin = ARRAY[i, i * 2];)),
+ $num_data, "$head: gin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)),
+ $num_data, "$head: gist is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE hash = i;)),
+ $num_data, "$head: hash is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE brin = i;)),
+ $num_data, "$head: brin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)),
+ $num_data, "$head: spgist is not broken");
+}
+
+sub set_unlogged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t SET UNLOGGED;
+));
+}
+
+sub set_logged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t SET LOGGED;
+));
+}
+
+sub relfilenodes
+{
+ my $result = $node->safe_psql('postgres', qq{
+ SELECT relname, relfilenode FROM pg_class
+ WHERE relname
+ IN ('t', 'i_bt','i_gin','i_gist','i_hash','i_brin','i_spgist');});
+
+ my %relfilenodes;
+
+ foreach my $l (split(/\n/, $result))
+ {
+ die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/);
+ $relfilenodes{$1} = $2;
+ }
+
+ # the number must correspond to the in list above
+ is (scalar %relfilenodes, 7, "number of relations is correct");
+
+ return \%relfilenodes;
+}
+
+sub checkrelfilenodes
+{
+ my ($rnodes1, $rnodes2, $s) = @_;
+
+ foreach my $n (keys %{$rnodes1})
+ {
+ if ($n eq 'i_gist')
+ {
+ # persistence of GiST index is not changed in-place
+ isnt($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is changed: $n");
+ }
+ else
+ {
+ # otherwise all relations are processed in-place
+ is($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is not changed: $n");
+ }
+ }
+}
+
+sub checkdataloss
+{
+ my ($expected, $s) = @_;
+
+ is($node->safe_psql('postgres', "SELECT count(*) FROM t;"), $expected,
+ "$s: data in table t is in the expected state");
+}
--
2.27.0
v14-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From f7a23cafbbdbca874ac5ecdbc15360d0408de160 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v14 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 848fda40ca..9aa263db65 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14509,6 +14509,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 18e778e856..51b6ad757f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4270,6 +4270,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5623,6 +5636,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index cb7ddd463c..a19b7874d7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3625,6 +3637,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 6dddc07947..a55ea302c1 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d47..1483f9a475 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7c657c1241..8860b2e548 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -428,6 +428,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 593e301f7a..b9226a7cd9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
Hi,
On January 5, 2022 8:30:17 PM PST, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Tue, 4 Jan 2022 16:05:08 -0800, Andres Freund <andres@anarazel.de> wrote in
The tap tests seems to fail on all platforms. See
https://cirrus-ci.com/build/4911549314760704E.g. the linux failure is
[16:45:15.569]
[16:45:15.569] # Failed test 'inserted'
[16:45:15.569] # at t/027_persistence_change.pl line 121.
[16:45:15.569] # Looks like you failed 1 test of 25.
[16:45:15.569] [16:45:15] t/027_persistence_change.pl ..........
[16:45:15.569] Dubious, test returned 1 (wstat 256, 0x100)
[16:45:15.569] Failed 1/25 subtests
[16:45:15.569] [16:45:15]
[16:45:15.569]
[16:45:15.569] Test Summary Report
[16:45:15.569] -------------------
[16:45:15.569] t/027_persistence_change.pl (Wstat: 256 Tests: 25 Failed: 1)
[16:45:15.569] Failed test: 18
[16:45:15.569] Non-zero exit status: 1
[16:45:15.569] Files=27, Tests=315, 220 wallclock secs ( 0.14 usr 0.03 sys + 48.94 cusr 17.13 csys = 66.24 CPU)Thank you very much. It still doesn't fail on my devlopment
environment (CentOS8), but I found a silly bug of the test script.
I'm still not sure the reason the test item failed but I repost the
updated version then watch what the CI says.
Fwiw, you can now test the same way as cfbot does with a lower turnaround time, as explained in src/tools/ci/README
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
At Wed, 05 Jan 2022 20:42:32 -0800, Andres Freund <andres@anarazel.de> wrote in
Hi,
On January 5, 2022 8:30:17 PM PST, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
I'm still not sure the reason the test item failed but I repost the
updated version then watch what the CI says.Fwiw, you can now test the same way as cfbot does with a lower turnaround time, as explained in src/tools/ci/README
Fantastic! I'll give it a try. Thanks!
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Thu, 06 Jan 2022 16:39:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Fantastic! I'll give it a try. Thanks!
I did that and found that the test stumbled on newlines.
Tests succeeded for other than Windows.
Windows version fails for a real known issue.
[7916]: [postmaster] LOG: database system is shut down
[7916]: [postmaster] LOG: database system is shut down
[6228]: [postmaster] LOG: startup process (PID 2948) exited with exit code 1
[6228]: [postmaster] LOG: startup process (PID 2948) exited with exit code 1
[2948]: [startup] FATAL: could not remove file "base/12759/16384.u": Permission denied
[2948]: [startup] FATAL: could not remove file "base/12759/16384.u": Permission denied
[2948]: [startup] FATAL: could not remove file "base/12759/16384.u": Permission denied
[2948]: [startup] FATAL: could not remove file "base/12759/16384.u": Permission denied
[2948]: [startup] FATAL: could not remove file "base/12759/16384.u": Permission denied
[6228]: [postmaster] LOG: startup process (PID 2948) exited with exit code 1
Mmm.. Someone is still grasping the file after restart?
Anyway, I post the fixed version. This still fails on Windows..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v15-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 48527df0c7d094a8ca7cc8d0c90df02bfd7c2614 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v15 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 ++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 545 +++++++++++++++++-
src/backend/commands/tablecmds.c | 266 +++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 +++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 +++++++----
src/backend/storage/smgr/md.c | 93 ++-
src/backend/storage/smgr/smgr.c | 32 +
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 +
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
src/test/recovery/t/027_persistence_change.pl | 247 ++++++++
24 files changed, 1707 insertions(+), 182 deletions(-)
create mode 100644 src/test/recovery/t/027_persistence_change.pl
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e7b0bc804d..b41186d6d8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2197,6 +2197,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2447,6 +2450,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2772,6 +2778,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87cd05c945..243860fcb1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..d6b30387e9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup *pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode)
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operations on this relation, that means
+ * that we have already registered pending delete entries to drop an
+ * init-fork preexisting since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create an init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The mark file works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just register pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
prev->next = next;
else
pendingDeletes = next;
+
pfree(pending);
/* prev does not change */
}
@@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * emitted before the commit record for the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert ((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 89bc865e28..51fcf9ca5f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5346,6 +5347,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5476,47 +5658,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..0137902bb2 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..580b74839f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..4945b111cc 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..584ebac391 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/test/recovery/t/027_persistence_change.pl b/src/test/recovery/t/027_persistence_change.pl
new file mode 100644
index 0000000000..526b19cbda
--- /dev/null
+++ b/src/test/recovery/t/027_persistence_change.pl
@@ -0,0 +1,247 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test relation persistence change
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Test::More tests => 30;
+use IPC::Run qw(pump finish timer);
+use Config;
+
+my $data_unit = 2000;
+
+# Initialize primary node.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+# we don't want checkpointing
+$node->append_conf('postgresql.conf', qq(
+checkpoint_timeout = '24h'
+));
+$node->start;
+create($node);
+
+my $relfilenodes1 = relfilenodes();
+
+# correctly recover empty tables
+$node->stop('immediate');
+$node->start;
+insert($node, 0, $data_unit, 0);
+
+# data persists after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss($data_unit, 'crash logged 1');
+
+set_unlogged($node);
+# SET UNLOGGED didn't change relfilenode
+my $relfilenodes2 = relfilenodes();
+checkrelfilenodes($relfilenodes1, $relfilenodes2, 'logged->unlogged');
+
+# data cleanly vanishes after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss(0, 'crash unlogged');
+
+insert($node, 0, $data_unit, 0);
+set_logged($node);
+
+$node->stop('immediate');
+$node->start;
+# SET LOGGED didn't change relfilenode and data survive a crash
+my $relfilenodes3 = relfilenodes();
+checkrelfilenodes($relfilenodes2, $relfilenodes3, 'unlogged->logged');
+checkdataloss($data_unit, 'crash logged 2');
+
+# unlogged insert -> graceful stop
+set_unlogged($node);
+insert($node, $data_unit, $data_unit, 0);
+$node->stop;
+$node->start;
+checkdataloss($data_unit * 2, 'unlogged graceful restart');
+
+# crash during transaction
+set_logged($node);
+$node->stop('immediate');
+$node->start;
+insert($node, $data_unit * 2, $data_unit, 0);
+
+my $h = insert($node, $data_unit * 3, $data_unit, 1); ## this is aborted
+$node->stop('immediate');
+
+# finishing $h stalls this case, just tear it off.
+$h = undef;
+
+# check if indexes are working
+$node->start;
+# drop first half of data to reduce run time
+$node->safe_psql('postgres', 'DELETE FROM t WHERE bt < ' . $data_unit * 2);
+check($node, $data_unit * 2, $data_unit * 3 - 1, 'final check');
+
+sub create
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ CREATE TABLE t (bt int, gin int[], gist point, hash int,
+ brin int, spgist point);
+ CREATE INDEX i_bt ON t USING btree (bt);
+ CREATE INDEX i_gin ON t USING gin (gin);
+ CREATE INDEX i_gist ON t USING gist (gist);
+ CREATE INDEX i_hash ON t USING hash (hash);
+ CREATE INDEX i_brin ON t USING brin (brin);
+ CREATE INDEX i_spgist ON t USING spgist (spgist);));
+}
+
+
+sub insert
+{
+ my ($node, $st, $num, $interactive) = @_;
+ my $ed = $st + $num - 1;
+ my $query = qq(BEGIN;
+INSERT INTO t
+ (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i)
+ FROM generate_series($st, $ed) i);
+);
+
+ if ($interactive)
+ {
+ my $in = '';
+ my $out = '';
+ my $timer = timer(10);
+
+ my $h = $node->interactive_psql('postgres', \$in, \$out, $timer);
+ like($out, qr/psql/, "print startup banner");
+
+ $in .= "$query\n";
+ pump $h until ($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/ ||
+ $timer->is_expired);
+ ok(($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/), "inserted-$st-$num");
+ return $h
+ # the trasaction is not terminated
+ }
+ else
+ {
+ $node->psql('postgres', $query . "COMMIT;");
+ return undef;
+ }
+}
+
+sub check
+{
+ my ($node, $st, $ed, $head) = @_;
+ my $num_data = $ed - $st + 1;
+
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO true;
+ SET enable_indexscan TO false;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "$head: heap is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "$head: btree is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gin = ARRAY[i, i * 2];)),
+ $num_data, "$head: gin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)),
+ $num_data, "$head: gist is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE hash = i;)),
+ $num_data, "$head: hash is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE brin = i;)),
+ $num_data, "$head: brin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)),
+ $num_data, "$head: spgist is not broken");
+}
+
+sub set_unlogged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t SET UNLOGGED;
+));
+}
+
+sub set_logged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t SET LOGGED;
+));
+}
+
+sub relfilenodes
+{
+ my $result = $node->safe_psql('postgres', qq{
+ SELECT relname, relfilenode FROM pg_class
+ WHERE relname
+ IN ('t', 'i_bt','i_gin','i_gist','i_hash','i_brin','i_spgist');});
+
+ my %relfilenodes;
+
+ foreach my $l (split(/\n/, $result))
+ {
+ die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/);
+ $relfilenodes{$1} = $2;
+ }
+
+ # the number must correspond to the in list above
+ is (scalar %relfilenodes, 7, "number of relations is correct");
+
+ return \%relfilenodes;
+}
+
+sub checkrelfilenodes
+{
+ my ($rnodes1, $rnodes2, $s) = @_;
+
+ foreach my $n (keys %{$rnodes1})
+ {
+ if ($n eq 'i_gist')
+ {
+ # persistence of GiST index is not changed in-place
+ isnt($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is changed: $n");
+ }
+ else
+ {
+ # otherwise all relations are processed in-place
+ is($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is not changed: $n");
+ }
+ }
+}
+
+sub checkdataloss
+{
+ my ($expected, $s) = @_;
+
+ is($node->safe_psql('postgres', "SELECT count(*) FROM t;"), $expected,
+ "$s: data in table t is in the expected state");
+}
--
2.27.0
v15-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From 4e1d78c0eaf0c34f58c2ab2708244a75f3791add Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v15 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 ++++
src/backend/nodes/equalfuncs.c | 15 ++++
src/backend/parser/gram.y | 20 +++++
src/backend/tcop/utility.c | 11 +++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 9 ++
8 files changed, 214 insertions(+)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 51fcf9ca5f..1620fe771d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14770,6 +14770,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(NIL);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 18e778e856..51b6ad757f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4270,6 +4270,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5623,6 +5636,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index cb7ddd463c..a19b7874d7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3625,6 +3637,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 6dddc07947..a55ea302c1 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,26 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d47..1483f9a475 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7c657c1241..8860b2e548 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -428,6 +428,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 593e301f7a..b9226a7cd9 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
--
2.27.0
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested
I've retested v15 of the patch with everything that came to my mind. The patch passes all my tests (well, there's this just windows / cfbot issue). Patch looks good to me. I haven't looked in-depth at the code, so patch might still need review.
FYI, about potential usage of this patch: the most advanced test that I did was continually bouncing wal_level - it works. So chain of :
1. wal_level=replica->minimal
2. alter table set unlogged and load a lot of data, set logged
3. wal_level=minimal->replica
4. barman incremental backup # rsync(1) just backups changed files in steps 2 and 3 (not whole DB)
5. some other (logged) work
The idea is that when performing mass-alterations to the DB (think nightly ETL/ELT on TB-sized DBs), one could skip backing up most of DB and then just quickly backup only the changed files - during the maintenance window - e.g. thanks to local-rsync barman mode. This is the output of barman show-backups after loading data to unlogged table each such cycle:
mydb 20220110T100236 - Mon Jan 10 10:05:14 2022 - Size: 144.1 GiB - WAL Size: 16.0 KiB
mydb 20220110T094905 - Mon Jan 10 09:50:12 2022 - Size: 128.5 GiB - WAL Size: 80.2 KiB
mydb 20220110T094016 - Mon Jan 10 09:40:17 2022 - Size: 109.1 GiB - WAL Size: 496.3 KiB
And dedupe ratio of the last one: Backup size: 144.1 GiB. Actual size on disk: 36.1 GiB (-74.96% deduplication ratio).
The only thing I've found out that bouncing wal_level also forces max_wal_senders=X -> 0 -> X which in turn requires dropping replication slot for pg_receievewal (e.g. barman receive-wal --create-slot/--drop-slot/--reset). I have tested the restore using barman recover afterwards to backup 20220110T094905 and indeed it worked OK using this patch too.
The new status of this patch is: Needs review
I found a bug.
mdmarkexists() didn't close the tentatively opend fd. This is a silent
leak on Linux and similars and it causes delete failure on Windows.
It was the reason of the CI failure.
027_persistence_change.pl uses interactive_psql() that doesn't work on
the Windos VM on the CI.
In this version the following changes have been made in 0001.
- Properly close file descriptor in mdmarkexists.
- Skip some tests when IO::Pty is not available.
It might be better to separate that part.
Looking again the ALTER TABLE ALL IN TABLESPACE SET LOGGED patch, I
noticed that it doesn't implement OWNED BY part and doesn't have test
and documenttaion (it was PoC). Added all of them to 0002.
At Tue, 11 Jan 2022 09:33:55 +0000, Jakub Wartak <jakub.wartak@tomtom.com> wrote in
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not testedI've retested v15 of the patch with everything that came to my mind. The patch passes all my tests (well, there's this just windows / cfbot issue). Patch looks good to me. I haven't looked in-depth at the code, so patch might still need review.
Thanks for checking.
FYI, about potential usage of this patch: the most advanced test that I did was continually bouncing wal_level - it works. So chain of :
1. wal_level=replica->minimal
2. alter table set unlogged and load a lot of data, set logged
3. wal_level=minimal->replica
4. barman incremental backup # rsync(1) just backups changed files in steps 2 and 3 (not whole DB)
5. some other (logged) work
The idea is that when performing mass-alterations to the DB (think nightly ETL/ELT on TB-sized DBs), one could skip backing up most of DB and then just quickly backup only the changed files - during the maintenance window - e.g. thanks to local-rsync barman mode. This is the output of barman show-backups after loading data to unlogged table each such cycle:
mydb 20220110T100236 - Mon Jan 10 10:05:14 2022 - Size: 144.1 GiB - WAL Size: 16.0 KiB
mydb 20220110T094905 - Mon Jan 10 09:50:12 2022 - Size: 128.5 GiB - WAL Size: 80.2 KiB
mydb 20220110T094016 - Mon Jan 10 09:40:17 2022 - Size: 109.1 GiB - WAL Size: 496.3 KiB
And dedupe ratio of the last one: Backup size: 144.1 GiB. Actual size on disk: 36.1 GiB (-74.96% deduplication ratio).
Ah, The patch skips duping relation files. This is advantageous that
that not only eliminates the I/O activities the duping causes but also
reduce the size of incremental backup. I didn't noticed only the
latter advantage.
The only thing I've found out that bouncing wal_level also forces max_wal_senders=X -> 0 -> X which in turn requires dropping replication slot for pg_receievewal (e.g. barman receive-wal --create-slot/--drop-slot/--reset). I have tested the restore using barman recover afterwards to backup 20220110T094905 and indeed it worked OK using this patch too.
Year, it is irrelevant to this patch but I'm annoyed by the
restriction. I think it would be okay that max_wal_senders is
forcibly set to 0 while wal_level=minimal..
The new status of this patch is: Needs review
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v16-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From d6bf0bd0d60391b24d5be7942b546acfffa3d7b1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v16 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 ++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 545 +++++++++++++++++-
src/backend/commands/tablecmds.c | 266 +++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 +++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 +++++++----
src/backend/storage/smgr/md.c | 94 ++-
src/backend/storage/smgr/smgr.c | 32 +
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 +
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
src/test/recovery/t/027_persistence_change.pl | 263 +++++++++
24 files changed, 1724 insertions(+), 182 deletions(-)
create mode 100644 src/test/recovery/t/027_persistence_change.pl
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7755553d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e7b0bc804d..b41186d6d8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2197,6 +2197,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2447,6 +2450,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2772,6 +2778,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87cd05c945..243860fcb1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..d6b30387e9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup *pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode)
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operations on this relation, that means
+ * that we have already registered pending delete entries to drop an
+ * init-fork preexisting since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create an init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The mark file works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just register pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
prev->next = next;
else
pendingDeletes = next;
+
pfree(pending);
/* prev does not change */
}
@@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * emitted before the commit record for the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert ((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 89bc865e28..51fcf9ca5f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5346,6 +5347,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5476,47 +5658,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ec0485705d..45e1a5d817 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3..dab74bf99a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 263057841d..8487ae1f02 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 0ae3fb6902..0137902bb2 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index b4bca7eed6..1f3aac5bcc 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1102,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1464,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0fcef4994b..110e64b0b2 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index d4083e8a56..9563940d45 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 436df54120..dbc0da5da5 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1f5c426ec0..4945b111cc 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 0ab32b44e9..584ebac391 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index f0814f1458..12346ed7f6 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a44be11ca0..106a5cf508 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index cfce23ecbc..f5a7df87a4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 34602ae006..2dc0357ad5 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 752b440864..99620816b5 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index fad1e5c473..e1f97e9b89 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a6fbf7b6a6..201ecace8a 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/test/recovery/t/027_persistence_change.pl b/src/test/recovery/t/027_persistence_change.pl
new file mode 100644
index 0000000000..261c4cf943
--- /dev/null
+++ b/src/test/recovery/t/027_persistence_change.pl
@@ -0,0 +1,263 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test relation persistence change
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Test::More tests => 30;
+use IPC::Run qw(pump finish timer);
+use Config;
+
+my $data_unit = 2000;
+
+# Initialize primary node.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+# we don't want checkpointing
+$node->append_conf('postgresql.conf', qq(
+checkpoint_timeout = '24h'
+));
+$node->start;
+create($node);
+
+my $relfilenodes1 = relfilenodes();
+
+# correctly recover empty tables
+$node->stop('immediate');
+$node->start;
+insert($node, 0, $data_unit, 0);
+
+# data persists after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss($data_unit, 'crash logged 1');
+
+set_unlogged($node);
+# SET UNLOGGED shouldn't change relfilenode
+my $relfilenodes2 = relfilenodes();
+checkrelfilenodes($relfilenodes1, $relfilenodes2, 'logged->unlogged');
+
+# data cleanly vanishes after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss(0, 'crash unlogged');
+
+insert($node, 0, $data_unit, 0);
+set_logged($node);
+
+$node->stop('immediate');
+$node->start;
+# SET LOGGED shouldn't change relfilenode and data should survive the crash
+my $relfilenodes3 = relfilenodes();
+checkrelfilenodes($relfilenodes2, $relfilenodes3, 'unlogged->logged');
+checkdataloss($data_unit, 'crash logged 2');
+
+# unlogged insert -> graceful stop
+set_unlogged($node);
+insert($node, $data_unit, $data_unit, 0);
+$node->stop;
+$node->start;
+checkdataloss($data_unit * 2, 'unlogged graceful restart');
+
+# crash during transaction
+set_logged($node);
+$node->stop('immediate');
+$node->start;
+insert($node, $data_unit * 2, $data_unit, 0);
+
+my $h;
+
+# insert(,,,1) requires IO::Pty. Skip the test if the module is not
+# available, but do the insert to make the expected situation for the
+# later tests.
+eval { require IO::Pty; };
+if ($@)
+{
+ insert($node, $data_unit * 3, $data_unit, 0);
+ ok (1, 'SKIPPED: IO::Pty is needed');
+ ok (1, 'SKIPPED: IO::Pty is needed');
+}
+else
+{
+ $h = insert($node, $data_unit * 3, $data_unit, 1); ## this is aborted
+}
+
+$node->stop('immediate');
+
+# finishing $h stalls this case, just tear it off.
+$h = undef;
+
+# check if indexes are working
+$node->start;
+# drop first half of data to reduce run time
+$node->safe_psql('postgres', 'DELETE FROM t WHERE bt < ' . $data_unit * 2);
+check($node, $data_unit * 2, $data_unit * 3 - 1, 'final check');
+
+sub create
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ CREATE TABLE t (bt int, gin int[], gist point, hash int,
+ brin int, spgist point);
+ CREATE INDEX i_bt ON t USING btree (bt);
+ CREATE INDEX i_gin ON t USING gin (gin);
+ CREATE INDEX i_gist ON t USING gist (gist);
+ CREATE INDEX i_hash ON t USING hash (hash);
+ CREATE INDEX i_brin ON t USING brin (brin);
+ CREATE INDEX i_spgist ON t USING spgist (spgist);));
+}
+
+
+sub insert
+{
+ my ($node, $st, $num, $interactive) = @_;
+ my $ed = $st + $num - 1;
+ my $query = qq(BEGIN;
+INSERT INTO t
+ (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i)
+ FROM generate_series($st, $ed) i);
+);
+
+ if ($interactive)
+ {
+ my $in = '';
+ my $out = '';
+ my $timer = timer(10);
+
+ my $h = $node->interactive_psql('postgres', \$in, \$out, $timer);
+ like($out, qr/psql/, "print startup banner");
+
+ $in .= "$query\n";
+ pump $h until ($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/ ||
+ $timer->is_expired);
+ ok(($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/), "inserted-$st-$num");
+ return $h
+ # the trasaction is not terminated
+ }
+ else
+ {
+ $node->psql('postgres', $query . "COMMIT;");
+ return undef;
+ }
+}
+
+sub check
+{
+ my ($node, $st, $ed, $head) = @_;
+ my $num_data = $ed - $st + 1;
+
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO true;
+ SET enable_indexscan TO false;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "$head: heap is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "$head: btree is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gin = ARRAY[i, i * 2];)),
+ $num_data, "$head: gin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)),
+ $num_data, "$head: gist is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE hash = i;)),
+ $num_data, "$head: hash is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE brin = i;)),
+ $num_data, "$head: brin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)),
+ $num_data, "$head: spgist is not broken");
+}
+
+sub set_unlogged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t SET UNLOGGED;
+));
+}
+
+sub set_logged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t SET LOGGED;
+));
+}
+
+sub relfilenodes
+{
+ my $result = $node->safe_psql('postgres', qq{
+ SELECT relname, relfilenode FROM pg_class
+ WHERE relname
+ IN ('t', 'i_bt','i_gin','i_gist','i_hash','i_brin','i_spgist');});
+
+ my %relfilenodes;
+
+ foreach my $l (split(/\n/, $result))
+ {
+ die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/);
+ $relfilenodes{$1} = $2;
+ }
+
+ # the number must correspond to the in list above
+ is (scalar %relfilenodes, 7, "number of relations is correct");
+
+ return \%relfilenodes;
+}
+
+sub checkrelfilenodes
+{
+ my ($rnodes1, $rnodes2, $s) = @_;
+
+ foreach my $n (keys %{$rnodes1})
+ {
+ if ($n eq 'i_gist')
+ {
+ # persistence of GiST index is not changed in-place
+ isnt($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is changed: $n");
+ }
+ else
+ {
+ # otherwise all relations are processed in-place
+ is($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is not changed: $n");
+ }
+ }
+}
+
+sub checkdataloss
+{
+ my ($expected, $s) = @_;
+
+ is($node->safe_psql('postgres', "SELECT count(*) FROM t;"), $expected,
+ "$s: data in table t is in the expected state");
+}
--
2.27.0
v16-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From edb09262d0793df84dfcb9138bad0309f84cfe87 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v16 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 +++
src/backend/nodes/equalfuncs.c | 15 +++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
11 files changed, 369 insertions(+)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index a76e2e7322..6f108980af 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -753,6 +755,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(see <xref linkend="sql-createtable-unlogged"/>). It cannot be applied
to a temporary table.
</para>
+
+ <para>
+ All tables in the current database in a tablespace can be changed by using
+ the <literal>ALL IN TABLESPACE</literal> form, which will lock all tables
+ to be changed first and then change each one. This form also supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ roles specified. If the <literal>NOWAIT</literal> option is specified
+ then the command will fail if it is unable to acquire all of the locks
+ required immediately. The <literal>information_schema</literal>
+ relations are not considered part of the system catalogs and will be
+ changed. See also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 51fcf9ca5f..524c9d5c1b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14770,6 +14770,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 18e778e856..51b6ad757f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4270,6 +4270,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5623,6 +5636,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index cb7ddd463c..a19b7874d7 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3625,6 +3637,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 6dddc07947..50bc3190de 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1fbc387d47..1483f9a475 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 336549cc5f..714077ff4c 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7c657c1241..8860b2e548 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -428,6 +428,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 593e301f7a..01661e9622 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2350,6 +2350,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index 864f4b6e20..420eed0717 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -935,5 +935,81 @@ drop cascades to table testschema.asexecute
drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 92076db9a1..0025c56401 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -412,5 +412,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.27.0
Hi,
On Fri, Jan 14, 2022 at 11:43:10AM +0900, Kyotaro Horiguchi wrote:
I found a bug.
mdmarkexists() didn't close the tentatively opend fd. This is a silent
leak on Linux and similars and it causes delete failure on Windows.
It was the reason of the CI failure.027_persistence_change.pl uses interactive_psql() that doesn't work on
the Windos VM on the CI.In this version the following changes have been made in 0001.
- Properly close file descriptor in mdmarkexists.
- Skip some tests when IO::Pty is not available.
It might be better to separate that part.Looking again the ALTER TABLE ALL IN TABLESPACE SET LOGGED patch, I
noticed that it doesn't implement OWNED BY part and doesn't have test
and documenttaion (it was PoC). Added all of them to 0002.
The cfbot is failing on all OS with this version of the patch. Apparently
v16-0002 introduces some usage of "testtablespace" client-side variable that's
never defined, e.g.
https://api.cirrus-ci.com/v1/artifact/task/4670105480069120/regress_diffs/src/bin/pg_upgrade/tmp_check/regress/regression.diffs:
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/tablespace.out /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tablespace.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/tablespace.out 2022-01-18 04:26:38.744707547 +0000
+++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tablespace.out 2022-01-18 04:30:37.557078083 +0000
@@ -948,76 +948,71 @@
CREATE SCHEMA testschema;
GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace';
+ERROR: syntax error at or near ":"
+LINE 1: CREATE TABLESPACE regress_tablespace LOCATION :'testtablespa...
Julien Rouhaud <rjuju123@gmail.com> writes:
The cfbot is failing on all OS with this version of the patch. Apparently
v16-0002 introduces some usage of "testtablespace" client-side variable that's
never defined, e.g.
That test infrastructure got rearranged very recently, see d6d317dbf.
regards, tom lane
At Tue, 18 Jan 2022 10:37:53 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in
Julien Rouhaud <rjuju123@gmail.com> writes:
The cfbot is failing on all OS with this version of the patch. Apparently
v16-0002 introduces some usage of "testtablespace" client-side variable that's
never defined, e.g.That test infrastructure got rearranged very recently, see d6d317dbf.
Thanks to both. It seems that even though I know about the change, I
forgot to make my repo up to date before checking.
The v17 attached changes only the following point (as well as
corresponding "expected" file).
-+CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace';
++CREATE TABLESPACE regress_tablespace LOCATION '';
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v17-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From c227842521de00d5da9dffb2f2dd86e8c1c171a8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v17 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 52 ++
src/backend/access/transam/README | 8 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlog.c | 17 +
src/backend/catalog/storage.c | 545 +++++++++++++++++-
src/backend/commands/tablecmds.c | 266 +++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 88 +++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 +++++++----
src/backend/storage/smgr/md.c | 94 ++-
src/backend/storage/smgr/smgr.c | 32 +
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 24 +
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
src/test/recovery/t/027_persistence_change.pl | 263 +++++++++
24 files changed, 1724 insertions(+), 182 deletions(-)
create mode 100644 src/test/recovery/t/027_persistence_change.pl
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..2c674e5de0 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ default:
+ action = "<unknown action>";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..b344bbe511 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c9516e03fa..3c7010eb0f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2197,6 +2197,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2447,6 +2450,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2772,6 +2778,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c9d4cbf3ff..7cab6a0170 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
+#include "catalog/storage.h"
#include "commands/progress.h"
#include "commands/tablespace.h"
#include "common/controldata_utils.h"
@@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
@@ -7824,6 +7833,14 @@ StartupXLOG(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..92a9451e90 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup *pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode)
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have entries for init-fork operations on this relation, that means
+ * that we have already registered pending delete entries to drop an
+ * init-fork preexisting since before the current transaction started. This
+ * function reverts that change just by removing the entries.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /*
+ * We are going to create an init fork. If server crashes before the
+ * current transaction ends the init fork left alone corrupts data while
+ * recovery. The mark file works as the sentinel to identify that
+ * situation.
+ */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * index-init fork needs further initialization. ambuildempty shoud do
+ * WAL-log and file sync by itself but otherwise we do that by ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion doesn't happen.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have entries for init-fork operations of this relation, that means
+ * that we have created the init fork in the current transaction. We
+ * remove the init fork and mark file immediately in that case. Otherwise
+ * just register pending-delete for the existing init fork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks never be loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
prev->next = next;
else
pendingDeletes = next;
+
pfree(pending);
/* prev does not change */
}
@@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * emitted before the commit record for the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert ((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* other forks needs to drop buffers */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1f0654c2f5..9e673ba68f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -52,6 +52,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5346,6 +5347,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5476,47 +5658,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3afbbe7e02..3f16b5f58c 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1102,6 +1102,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1152,7 +1153,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2512e750c..6384b4efbe 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -37,6 +37,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* init fork is always BM_PERMANENT. See BufferAlloc */
+ if (bufHdr->tag.forkNum != INIT_FORKNUM)
+ buf_state &= ~BM_PERMANENT;
+
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14b77f2861..2fc9f17c28 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f053fe0495..1124e95d0d 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlog.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d26c915f90..007efe68a5 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1025,6 +1102,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1378,12 +1464,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index eb701dce57..4819b5c404 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 11fa17ddea..ddc344dad2 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -222,7 +223,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -236,6 +238,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 9143797458..b21d01d04a 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. When we compare the sizes later on,
+ * we'll notice that they differ, and copy the missing tail from
+ * source system.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 636c96efd3..1c19e16fea 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9ffc741913..d362d62ed2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 622de22b03..8139308634 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a4b5dc853b..a864c91614 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841c30..739b386216 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 29209e2724..8bf746bf45 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index ffffa40db7..046afdb5fb 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index bf2c10d443..e399aec0c7 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 052e0b8426..48e69ab69b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/test/recovery/t/027_persistence_change.pl b/src/test/recovery/t/027_persistence_change.pl
new file mode 100644
index 0000000000..261c4cf943
--- /dev/null
+++ b/src/test/recovery/t/027_persistence_change.pl
@@ -0,0 +1,263 @@
+
+# Copyright (c) 2021, PostgreSQL Global Development Group
+
+# Test relation persistence change
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Test::More tests => 30;
+use IPC::Run qw(pump finish timer);
+use Config;
+
+my $data_unit = 2000;
+
+# Initialize primary node.
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+# we don't want checkpointing
+$node->append_conf('postgresql.conf', qq(
+checkpoint_timeout = '24h'
+));
+$node->start;
+create($node);
+
+my $relfilenodes1 = relfilenodes();
+
+# correctly recover empty tables
+$node->stop('immediate');
+$node->start;
+insert($node, 0, $data_unit, 0);
+
+# data persists after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss($data_unit, 'crash logged 1');
+
+set_unlogged($node);
+# SET UNLOGGED shouldn't change relfilenode
+my $relfilenodes2 = relfilenodes();
+checkrelfilenodes($relfilenodes1, $relfilenodes2, 'logged->unlogged');
+
+# data cleanly vanishes after a crash
+$node->stop('immediate');
+$node->start;
+checkdataloss(0, 'crash unlogged');
+
+insert($node, 0, $data_unit, 0);
+set_logged($node);
+
+$node->stop('immediate');
+$node->start;
+# SET LOGGED shouldn't change relfilenode and data should survive the crash
+my $relfilenodes3 = relfilenodes();
+checkrelfilenodes($relfilenodes2, $relfilenodes3, 'unlogged->logged');
+checkdataloss($data_unit, 'crash logged 2');
+
+# unlogged insert -> graceful stop
+set_unlogged($node);
+insert($node, $data_unit, $data_unit, 0);
+$node->stop;
+$node->start;
+checkdataloss($data_unit * 2, 'unlogged graceful restart');
+
+# crash during transaction
+set_logged($node);
+$node->stop('immediate');
+$node->start;
+insert($node, $data_unit * 2, $data_unit, 0);
+
+my $h;
+
+# insert(,,,1) requires IO::Pty. Skip the test if the module is not
+# available, but do the insert to make the expected situation for the
+# later tests.
+eval { require IO::Pty; };
+if ($@)
+{
+ insert($node, $data_unit * 3, $data_unit, 0);
+ ok (1, 'SKIPPED: IO::Pty is needed');
+ ok (1, 'SKIPPED: IO::Pty is needed');
+}
+else
+{
+ $h = insert($node, $data_unit * 3, $data_unit, 1); ## this is aborted
+}
+
+$node->stop('immediate');
+
+# finishing $h stalls this case, just tear it off.
+$h = undef;
+
+# check if indexes are working
+$node->start;
+# drop first half of data to reduce run time
+$node->safe_psql('postgres', 'DELETE FROM t WHERE bt < ' . $data_unit * 2);
+check($node, $data_unit * 2, $data_unit * 3 - 1, 'final check');
+
+sub create
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ CREATE TABLE t (bt int, gin int[], gist point, hash int,
+ brin int, spgist point);
+ CREATE INDEX i_bt ON t USING btree (bt);
+ CREATE INDEX i_gin ON t USING gin (gin);
+ CREATE INDEX i_gist ON t USING gist (gist);
+ CREATE INDEX i_hash ON t USING hash (hash);
+ CREATE INDEX i_brin ON t USING brin (brin);
+ CREATE INDEX i_spgist ON t USING spgist (spgist);));
+}
+
+
+sub insert
+{
+ my ($node, $st, $num, $interactive) = @_;
+ my $ed = $st + $num - 1;
+ my $query = qq(BEGIN;
+INSERT INTO t
+ (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i)
+ FROM generate_series($st, $ed) i);
+);
+
+ if ($interactive)
+ {
+ my $in = '';
+ my $out = '';
+ my $timer = timer(10);
+
+ my $h = $node->interactive_psql('postgres', \$in, \$out, $timer);
+ like($out, qr/psql/, "print startup banner");
+
+ $in .= "$query\n";
+ pump $h until ($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/ ||
+ $timer->is_expired);
+ ok(($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/), "inserted-$st-$num");
+ return $h
+ # the trasaction is not terminated
+ }
+ else
+ {
+ $node->psql('postgres', $query . "COMMIT;");
+ return undef;
+ }
+}
+
+sub check
+{
+ my ($node, $st, $ed, $head) = @_;
+ my $num_data = $ed - $st + 1;
+
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO true;
+ SET enable_indexscan TO false;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "$head: heap is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "$head: btree is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gin = ARRAY[i, i * 2];)),
+ $num_data, "$head: gin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)),
+ $num_data, "$head: gist is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE hash = i;)),
+ $num_data, "$head: hash is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE brin = i;)),
+ $num_data, "$head: brin is not broken");
+ is($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)),
+ $num_data, "$head: spgist is not broken");
+}
+
+sub set_unlogged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t SET UNLOGGED;
+));
+}
+
+sub set_logged
+{
+ my ($node) = @_;
+
+ $node->psql('postgres', qq(
+ ALTER TABLE t SET LOGGED;
+));
+}
+
+sub relfilenodes
+{
+ my $result = $node->safe_psql('postgres', qq{
+ SELECT relname, relfilenode FROM pg_class
+ WHERE relname
+ IN ('t', 'i_bt','i_gin','i_gist','i_hash','i_brin','i_spgist');});
+
+ my %relfilenodes;
+
+ foreach my $l (split(/\n/, $result))
+ {
+ die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/);
+ $relfilenodes{$1} = $2;
+ }
+
+ # the number must correspond to the in list above
+ is (scalar %relfilenodes, 7, "number of relations is correct");
+
+ return \%relfilenodes;
+}
+
+sub checkrelfilenodes
+{
+ my ($rnodes1, $rnodes2, $s) = @_;
+
+ foreach my $n (keys %{$rnodes1})
+ {
+ if ($n eq 'i_gist')
+ {
+ # persistence of GiST index is not changed in-place
+ isnt($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is changed: $n");
+ }
+ else
+ {
+ # otherwise all relations are processed in-place
+ is($rnodes1->{$n}, $rnodes2->{$n},
+ "$s: relfilenode is not changed: $n");
+ }
+ }
+}
+
+sub checkdataloss
+{
+ my ($expected, $s) = @_;
+
+ is($node->safe_psql('postgres', "SELECT count(*) FROM t;"), $expected,
+ "$s: data in table t is in the expected state");
+}
--
2.27.0
v17-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From f621f134e7c48b52a65e3b60ad42c0259e226a40 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v17 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 +++
src/backend/nodes/equalfuncs.c | 15 +++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
11 files changed, 369 insertions(+)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index a76e2e7322..6f108980af 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -753,6 +755,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(see <xref linkend="sql-createtable-unlogged"/>). It cannot be applied
to a temporary table.
</para>
+
+ <para>
+ All tables in the current database in a tablespace can be changed by using
+ the <literal>ALL IN TABLESPACE</literal> form, which will lock all tables
+ to be changed first and then change each one. This form also supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ roles specified. If the <literal>NOWAIT</literal> option is specified
+ then the command will fail if it is unable to acquire all of the locks
+ required immediately. The <literal>information_schema</literal>
+ relations are not considered part of the system catalogs and will be
+ changed. See also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 9e673ba68f..25bbdb5664 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14769,6 +14769,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 90b5da51c9..bbc9eb28e6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4273,6 +4273,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5639,6 +5652,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 06345da3ba..603bd2a044 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3636,6 +3648,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index b5966712ce..682684c2ee 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1984,6 +1984,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 83e4e37c78..750e0ecac9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 5d4037f26e..c381dad3e5 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index f9ddafd345..a83c66cad6 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -429,6 +429,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 3e9bdc781f..f19bd3c569 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2351,6 +2351,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index 2dfbcfdebe..c02afdcb68 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -943,5 +943,81 @@ drop cascades to table testschema.asexecute
drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 896f05cea3..4e407eb8c0 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -419,5 +419,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.27.0
Rebased on a recent xlog refactoring.
No functional changes have been made.
- Removed the default case in smgr_desc since it seems to me we don't
assume out-of-definition values in xlog records elsewhere.
- Simplified some added to storage.c.
- Fix copy-pasto'ed comments in extractPageInfo().
- The previous version smgrDoPendingCleanups() assumes that init-fork
are not loaded onto shared buffer but it is wrong
(SetRelationBuffersPersistence assumes the opposite.). Thus we need
to drop buffers before unlink an init fork. But it is already
guaranteed by logic so I rewrote the comment for for PCOP_UNLINK_FORK.
* Unlink the fork file. Currently we use this only for
* init forks and we're sure that the init fork is not
* loaded on shared buffers. For RelationDropInitFork
* case, the function dropped that buffers. For
* RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
* is set and the buffers have been dropped just before.
This logic has the same critical window as
DropRelFilenodeBuffers. That is, if file deletion fails after
successful buffer dropping, theoretically the file content of the
init fork may be stale. However, AFAICS init-fork is write-once fork
so I don't think that actually matters.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v18-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 420a9d9a0dae3bcfb1396c14997624ad67a3e557 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v18 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 49 ++
src/backend/access/transam/README | 9 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 +
src/backend/catalog/storage.c | 548 +++++++++++++++++++++-
src/backend/commands/tablecmds.c | 266 +++++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 86 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 ++++++++++----
src/backend/storage/smgr/md.c | 94 +++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 22 +
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
23 files changed, 1459 insertions(+), 182 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..f8908e2c0a 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action;
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..2ecd8c8c7c 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -725,6 +725,15 @@ then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index adf763a8ea..559666b802 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2198,6 +2198,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2448,6 +2451,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2773,6 +2779,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..2923b8ef8c 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -53,6 +54,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/guc.h"
@@ -1746,6 +1748,14 @@ PerformWalRecovery(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
@@ -3022,6 +3032,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..cd1445713a 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup *pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode)
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -168,6 +208,200 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init-fork exists since before the current transaction
+ * started. This function reverts that change just by removing the entry.
+ * See RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the commit-sentinel file */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * init fork for indexes needs further initialization. ambuildempty should
+ * do WAL-log and file sync by itself but otherwise we do that by
+ * ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion is canceled.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init fork is created in the current transaction. We remove
+ * both the init fork and mark file immediately in that case. Otherwise
+ * just register a pending-unlink for the existing init fork. See
+ * RelationCreateInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks are never loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +421,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -673,6 +989,95 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert ((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /*
+ * Unlink the fork file. Currently we use this only for
+ * init forks and we're sure that the init fork is not
+ * loaded on shared buffers. For RelationDropInitFork
+ * case, the function dropped that buffers. For
+ * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
+ * is set and the buffers have been dropped just before.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -933,6 +1338,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1435,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3e83f375b5..9e5b77e94a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -53,6 +53,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5347,6 +5348,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5477,47 +5659,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0bf28b55d7..17185f4e55 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1209,6 +1209,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1259,7 +1260,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..6cd010429a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3155,6 +3156,91 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(bufHdr->tag.forkNum != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14b77f2861..2fc9f17c28 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f053fe0495..f28f55baa6 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 879f647dbc..4d44bdd78b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1031,6 +1108,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1384,12 +1470,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index d71a557a35..0710e8b145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -63,6 +63,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -84,6 +88,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -337,6 +343,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -664,6 +690,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e161d57761..f5ded7cb34 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -90,7 +90,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -223,7 +224,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -237,6 +239,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 56df08c64f..f1382d4c4f 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,28 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 636c96efd3..1c19e16fea 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9ffc741913..d362d62ed2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 622de22b03..8139308634 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a4b5dc853b..a864c91614 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841c30..739b386216 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 29209e2724..8bf746bf45 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 6e46d8d96a..ef5fdaf4f8 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -24,6 +24,10 @@ extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
extern void mdrelease(void);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -42,12 +46,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index bf2c10d443..e399aec0c7 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 8e3ef92cda..022654b7b2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
v18-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From d7caa6b33f364ad1a88a8f74306a255e607a6639 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v18 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 +++
src/backend/nodes/equalfuncs.c | 15 +++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
11 files changed, 369 insertions(+)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 5c0735e08a..b03d5511a6 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -753,6 +755,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(see <xref linkend="sql-createtable-unlogged"/>). It cannot be applied
to a temporary table.
</para>
+
+ <para>
+ All tables in the current database in a tablespace can be changed by using
+ the <literal>ALL IN TABLESPACE</literal> form, which will lock all tables
+ to be changed first and then change each one. This form also supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ roles specified. If the <literal>NOWAIT</literal> option is specified
+ then the command will fail if it is unable to acquire all of the locks
+ required immediately. The <literal>information_schema</literal>
+ relations are not considered part of the system catalogs and will be
+ changed. See also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 9e5b77e94a..0724d0e1d2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14770,6 +14770,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d4f8455a2b..ba605405a9 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4285,6 +4285,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5655,6 +5668,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index f1002afe7a..b76fc872a5 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1925,6 +1925,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3650,6 +3662,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a03b33b53b..f8a41de2dd 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1985,6 +1985,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 3780c6e812..80d1e360b3 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -163,6 +163,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1753,6 +1754,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2675,6 +2682,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 5d4037f26e..c381dad3e5 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 5d075f0c34..d8e1f223c8 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -430,6 +430,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 1617702d9d..4fa9d9360f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2352,6 +2352,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index 2dfbcfdebe..c02afdcb68 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -943,5 +943,81 @@ drop cascades to table testschema.asexecute
drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 896f05cea3..4e407eb8c0 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -419,5 +419,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.27.0
At Tue, 01 Mar 2022 14:14:13 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
- Removed the default case in smgr_desc since it seems to me we don't
assume out-of-definition values in xlog records elsewhere.
Stupid. The complier on the CI environemnt complains for
uninitialized variable even though it (presumably) knows that the all
paths of the switch statement set the variable. Added default value
to try to silence compiler.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v19-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From 26cac5c8a65ff27e294996198333924c7e839a00 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v19 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 +++
src/backend/nodes/equalfuncs.c | 15 +++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
11 files changed, 369 insertions(+)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 5c0735e08a..b03d5511a6 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -753,6 +755,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(see <xref linkend="sql-createtable-unlogged"/>). It cannot be applied
to a temporary table.
</para>
+
+ <para>
+ All tables in the current database in a tablespace can be changed by using
+ the <literal>ALL IN TABLESPACE</literal> form, which will lock all tables
+ to be changed first and then change each one. This form also supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ roles specified. If the <literal>NOWAIT</literal> option is specified
+ then the command will fail if it is unable to acquire all of the locks
+ required immediately. The <literal>information_schema</literal>
+ relations are not considered part of the system catalogs and will be
+ changed. See also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 9e5b77e94a..0724d0e1d2 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14770,6 +14770,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistene only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified")));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're going
+ * to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname))));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ (errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid))));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d4f8455a2b..ba605405a9 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4285,6 +4285,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -5655,6 +5668,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index f1002afe7a..b76fc872a5 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -1925,6 +1925,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a,
+ const AlterTableSetLoggedAllStmt *b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -3650,6 +3662,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index a03b33b53b..f8a41de2dd 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -1985,6 +1985,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 3780c6e812..80d1e360b3 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -163,6 +163,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1753,6 +1754,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2675,6 +2682,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 5d4037f26e..c381dad3e5 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 5d075f0c34..d8e1f223c8 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -430,6 +430,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 1617702d9d..4fa9d9360f 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2352,6 +2352,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index 2dfbcfdebe..c02afdcb68 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -943,5 +943,81 @@ drop cascades to table testschema.asexecute
drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 896f05cea3..4e407eb8c0 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -419,5 +419,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.27.0
v19-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From ec75c49ffd939f6db8e0d840ef043c18845d1b9d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v19 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 49 ++
src/backend/access/transam/README | 9 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 +
src/backend/catalog/storage.c | 548 +++++++++++++++++++++-
src/backend/commands/tablecmds.c | 266 +++++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 86 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 344 ++++++++++----
src/backend/storage/smgr/md.c | 94 +++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 20 +-
src/bin/pg_rewind/parsexlog.c | 22 +
src/bin/pg_rewind/pg_rewind.c | 1 -
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
24 files changed, 1459 insertions(+), 183 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..225ffbafef 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..2ecd8c8c7c 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -725,6 +725,15 @@ then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index adf763a8ea..559666b802 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2198,6 +2198,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2448,6 +2451,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2773,6 +2779,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..2923b8ef8c 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -53,6 +54,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/guc.h"
@@ -1746,6 +1748,14 @@ PerformWalRecovery(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
@@ -3022,6 +3032,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9b8075536a..cd1445713a 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ int unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup *pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode)
SMgrRelation
RelationCreateStorage(RelFileNode rnode, char relpersistence)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
SMgrRelation srel;
BackendId backend;
bool needs_wal;
@@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM);
/* Add the relation to the list of stuff to delete at abort */
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -168,6 +208,200 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence)
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init-fork exists since before the current transaction
+ * started. This function reverts that change just by removing the entry.
+ * See RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the commit-sentinel file */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * init fork for indexes needs further initialization. ambuildempty should
+ * do WAL-log and file sync by itself but otherwise we do that by
+ * ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion is canceled.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init fork is created in the current transaction. We remove
+ * both the init fork and mark file immediately in that case. Otherwise
+ * just register a pending-unlink for the existing init fork. See
+ * RelationCreateInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks are never loaded to shared buffer so no point in dropping
+ * buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -187,6 +421,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -673,6 +989,95 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert ((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /*
+ * Unlink the fork file. Currently we use this only for
+ * init forks and we're sure that the init fork is not
+ * loaded on shared buffers. For RelationDropInitFork
+ * case, the function dropped that buffers. For
+ * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
+ * is set and the buffers have been dropped just before.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -933,6 +1338,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1021,6 +1435,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert (pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3e83f375b5..9e5b77e94a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -53,6 +53,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5347,6 +5348,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistnce change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach (lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0 ; i < INIT_FORKNUM ; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table.
+ * We don't emit this fhile wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM ; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5477,47 +5659,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new relfilenode
+ * will have the right persistence set, and at the same time
+ * ensure that the original filenode's buffers will get read in
+ * with the correct setting (i.e. the original one). Otherwise
+ * a rollback after the rewrite would possibly result with
+ * buffers for the original filenode having the wrong
+ * persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0bf28b55d7..17185f4e55 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1209,6 +1209,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1259,7 +1260,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f8..6cd010429a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -38,6 +38,7 @@
#include "access/xlogutils.h"
#include "catalog/catalog.h"
#include "catalog/storage.h"
+#include "catalog/storage_xlog.h"
#include "executor/instrument.h"
#include "lib/binaryheap.h"
#include "miscadmin.h"
@@ -3155,6 +3156,91 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert (!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(bufHdr->tag.forkNum != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14b77f2861..2fc9f17c28 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f053fe0495..f28f55baa6 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object before initializing the unlogged relation. This is safe
+ * as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays remain persistent. Don't
+ * drop the buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0 ; i < nrels ; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ int oidchars;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path)));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 879f647dbc..4d44bdd78b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path =markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not crete mark file \"%s\": %m", path)));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path)));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1031,6 +1108,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1384,12 +1470,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index d71a557a35..0710e8b145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -63,6 +63,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -84,6 +88,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -337,6 +343,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -664,6 +690,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e161d57761..f5ded7cb34 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -90,7 +90,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -223,7 +224,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -237,6 +239,20 @@ SyncPostCheckpoint(void)
(errcode_for_file_access(),
errmsg("could not remove file \"%s\": %m", path)));
}
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(
+ &entry->tag, path,
+ SMGR_MARK_UNCOMMITTED) < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork files has been successfully removed. It's ok if the file
+ * does not exist.
+ */
+ if (errno != ENOENT)
+ ereport(WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path)));
+ }
/* Mark the list entry as canceled, just in case */
entry->canceled = true;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 56df08c64f..f1382d4c4f 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -407,6 +407,28 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index efb82a4034..b289df4060 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -412,7 +412,6 @@ main(int argc, char **argv)
if (showprogress)
pg_log_info("reading source file list");
source->traverse_files(source, &process_source_file);
-
if (showprogress)
pg_log_info("reading target file list");
traverse_datadir(datadir_target, &process_target_file);
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 636c96efd3..1c19e16fea 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9ffc741913..d362d62ed2 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -23,6 +23,8 @@
extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 622de22b03..8139308634 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a4b5dc853b..a864c91614 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index dd01841c30..739b386216 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels)
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 29209e2724..8bf746bf45 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 6e46d8d96a..ef5fdaf4f8 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -24,6 +24,10 @@ extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
extern void mdrelease(void);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -42,12 +46,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index bf2c10d443..e399aec0c7 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 8e3ef92cda..022654b7b2 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
--
2.27.0
On Tue, Mar 01, 2022 at 02:14:13PM +0900, Kyotaro Horiguchi wrote:
Rebased on a recent xlog refactoring.
It'll come as no surprise that this neds to be rebased again.
At least a few typos I reported in January aren't fixed.
Set to "waiting".
Thanks! Version 20 is attached.
At Wed, 30 Mar 2022 08:44:02 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in
On Tue, Mar 01, 2022 at 02:14:13PM +0900, Kyotaro Horiguchi wrote:
Rebased on a recent xlog refactoring.
It'll come as no surprise that this neds to be rebased again.
At least a few typos I reported in January aren't fixed.
Set to "waiting".
Oh, I'm sorry for overlooking it. It somehow didn't show up on my
mailer.
I started looking at this and reviewed docs and comments again.
+typedef struct PendingCleanup +{ + RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */This can be of data type "ForkNumber"
Right. Fixed.
+ * We are going to create an init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The mark file works as the sentinel to identify that + * situation.s/while/during/
This was in v17, but dissapeared in v18.
+ * index-init fork needs further initialization. ambuildempty shoud do
should (I reported this before)
+ if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT forks never be loaded to shared buffer so no point in dropping"are never loaded"
If was fixed in v18.
+ elog(DEBUG1, "perform in-place persistnce change");
persistence (I reported this before)
Sorry. Fixed.
+ /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + * We don't emit this fhile wal_level = minimal.while (or "if")
There are "While" and "fhile". I changed the latter to "if".
+ * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation."stays remain" is redundant (I reported this before)
Thanks. I changed it to "stays persistent".
+ if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path)));The parens around errcode are unnecessary since last year.
I suggest to avoid using them here and elsewhere.
It is just moved from elsewhere without editing, but of course I can
do that. I didn't know about that change of ereport and not found the
corresponding commit, but I found that Tom mentioned that change.
/messages/by-id/5063.1584641224@sss.pgh.pa.us
Now that we can rely on having varargs macros, I think we could
stop requiring the extra level of parentheses, ie instead of
...
ereport(ERROR,
errcode(ERRCODE_DIVISION_BY_ZERO),
errmsg("division by zero"));(The old syntax had better still work, of course. I'm not advocating
running around and changing existing calls.)
I changed all ereport calls added by this patch to this style.
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the + * fork files has been successfully removed. It's ok if the filefile
Fixed.
+ <para> + All tables in the current database in a tablespace can be changed by usinggiven tablespace
I did /database in a tablespace/database in the given tablespace/. Is
it right?
+ the <literal>ALL IN TABLESPACE</literal> form, which will lock all tables
which will first lock
+ to be changed first and then change each one. This form also supports
remove "first" here
This is almost a dead copy of the description of SET TABLESPACE. This
change makes the two almost the same description vary slightly in that
wordings. Anyway I did that as suggested only for the part this patch
adds in this version.
+ <literal>OWNED BY</literal>, which will only change tables owned by the + roles specified. If the <literal>NOWAIT</literal> option is specifiedspecified roles.
is specified, (comma)
This is the same as above. I did that but it makes the description
differ from another almost-the-same description.
+ then the command will fail if it is unable to acquire all of the locks
if it is unable to immediately acquire
+ required immediately. The <literal>information_schema</literal>
remove immediately
Ditto.
+ relations are not considered part of the system catalogs and will be
I think you need to first say that "relations in the pg_catalog schema cannot
be changed".
Mmm. I don't agree on this. Aren't such "exceptions"-ish descriptions
usually placed after the descriptions of how the feature works? This
is also the same structure with SET TABLESPACE.
in patch 2/2:
typo: persistene
Hmm. Bad. I checked the spellings of the whole patches and found some
typos.
+ * The crashed trasaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
s/trasaction/transaction/
+ errmsg("could not crete mark file \"%s\": %m", path));
s/crete/create/
Then rebased on 9c08aea6a3 then pgindent'ed.
Thanks!
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v20-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From d2fb934c20912d4e1fe091805ff4790addd8f77d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v20 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 49 ++
src/backend/access/transam/README | 9 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 +
src/backend/catalog/storage.c | 548 +++++++++++++++++++++-
src/backend/commands/tablecmds.c | 266 +++++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 85 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 340 ++++++++++----
src/backend/storage/smgr/md.c | 94 +++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 21 +-
src/bin/pg_rewind/parsexlog.c | 22 +
src/bin/pg_rewind/pg_rewind.c | 1 -
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
src/tools/pgindent/typedefs.list | 7 +
25 files changed, 1464 insertions(+), 181 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..225ffbafef 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..2ecd8c8c7c 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -725,6 +725,15 @@ then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3596a7d734..f48d950895 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2199,6 +2199,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2449,6 +2452,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2774,6 +2780,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 8d2395dae2..e7786a3851 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -53,6 +54,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/guc.h"
@@ -1746,6 +1748,14 @@ PerformWalRecovery(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
@@ -3026,6 +3036,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9898701a43..ab8ec34c3d 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -145,7 +164,14 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence,
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
@@ -157,16 +183,30 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence,
*/
if (register_delete)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
+ PendingCleanup *pendingclean;
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
}
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
@@ -178,6 +218,200 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init-fork exists since before the current transaction
+ * started. This function reverts that change just by removing the entry.
+ * See RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the commit-sentinel file */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * init fork for indexes needs further initialization. ambuildempty should
+ * do WAL-log and file sync by itself but otherwise we do that by
+ * ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion is canceled.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init fork is created in the current transaction. We remove
+ * both the init fork and mark file immediately in that case. Otherwise
+ * just register a pending-unlink for the existing init fork. See
+ * RelationCreateInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks are never loaded to shared buffer so no point in
+ * dropping buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -197,6 +431,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -710,6 +1026,95 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /*
+ * Unlink the fork file. Currently we use this only for
+ * init forks and we're sure that the init fork is not
+ * loaded on shared buffers. For RelationDropInitFork
+ * case, the function dropped that buffers. For
+ * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
+ * is set and the buffers have been dropped just before.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -970,6 +1375,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1058,6 +1472,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 51b4a00d50..71aaf3320a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -55,6 +55,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5374,6 +5375,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table. We don't
+ * emit this if wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5504,47 +5686,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenode will have the right persistence set, and at the
+ * same time ensure that the original filenode's buffers will
+ * get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would
+ * possibly result with buffers for the original filenode
+ * having the wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 6884cad2c0..c67bae34f5 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1188,6 +1188,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1238,7 +1239,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d73a40c1bc..01974b71d2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3159,6 +3159,91 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert(!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(bufHdr->tag.forkNum != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14b77f2861..2fc9f17c28 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f053fe0495..fa4b1c0e6e 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then
+ * the smgr object before initializing the unlogged relation. This is
+ * safe as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while ((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays persistent. Don't drop the
+ * buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0; i < nrels; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
- unlogged_relation_entry ent;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed transaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 879f647dbc..692508ea98 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create mark file \"%s\": %m", path));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1031,6 +1108,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1384,12 +1470,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index d71a557a35..0710e8b145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -63,6 +63,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -84,6 +88,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -337,6 +343,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -664,6 +690,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index c695d816fc..ab11600724 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -91,7 +91,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -236,7 +237,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -245,6 +247,21 @@ SyncPostCheckpoint(void)
* here. rmtree() also has to ignore ENOENT errors, to deal with
* the possibility that we delete the file first.
*/
+ if (errno != ENOENT)
+ ereport(WARNING,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path));
+ }
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+ path,
+ SMGR_MARK_UNCOMMITTED)
+ < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork file has been successfully removed. It's ok if the file
+ * does not exist.
+ */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 49966e7b7f..c3515e5546 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -416,6 +416,28 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index b39b5c1aac..9f7235b920 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -413,7 +413,6 @@ main(int argc, char **argv)
if (showprogress)
pg_log_info("reading source file list");
source->traverse_files(source, &process_source_file);
-
if (showprogress)
pg_log_info("reading target file list");
traverse_datadir(datadir_target, &process_target_file);
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 636c96efd3..1c19e16fea 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 844a023b2c..a685665fab 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 622de22b03..d83fc6876e 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a4b5dc853b..a864c91614 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a6b657f0ba..b7db0b2922 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -210,6 +210,8 @@ extern void CreateAndCopyRelationData(RelFileNode src_rnode,
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 29209e2724..8bf746bf45 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 6e46d8d96a..18b27d366b 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -24,6 +24,10 @@ extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
extern void mdrelease(void);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -42,12 +46,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index bf2c10d443..e399aec0c7 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 8e3ef92cda..43b33b6b8d 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 72fafb795b..181709039c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1890,6 +1890,7 @@ PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
PendingFsyncEntry
+PendingMarkCleanup
PendingRelDelete
PendingRelSync
PendingUnlinkEntry
@@ -2535,6 +2536,7 @@ StdRdOptIndexCleanup
StdRdOptions
Step
StopList
+StorageMarks
StrategyNumber
StreamCtl
StreamXidHash
@@ -3502,6 +3504,7 @@ registered_buffer
regmatch_t
regoff_t
regproc
+relfile_entry
relopt_bool
relopt_enum
relopt_enum_elt_def
@@ -3555,6 +3558,7 @@ slist_iter
slist_mutable_iter
slist_node
slock_t
+smgr_mark_action
socket_set
spgBulkDeleteState
spgChooseIn
@@ -3755,8 +3759,11 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
+xl_smgr_mark
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.27.0
v20-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From 510c31a90a7874a431b0ed9c669fa7c39f9e68fe Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v20 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 16 +++
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 +++
src/backend/nodes/equalfuncs.c | 15 +++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
11 files changed, 370 insertions(+)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 5c0735e08a..5ae825b30f 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -753,6 +755,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(see <xref linkend="sql-createtable-unlogged"/>). It cannot be applied
to a temporary table.
</para>
+
+ <para>
+ All tables in the current database in a tablespace can be changed by
+ using the <literal>ALL IN TABLESPACE</literal> form, which will first
+ lock all tables to be changed and then change each one. This form also
+ supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ specified roles. If the <literal>NOWAIT</literal> option is specified,
+ then the command will fail if it is unable to immediately acquire all of
+ the locks required. The <literal>information_schema</literal> relations
+ are not considered part of the system catalogs and will be changed. See
+ also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 71aaf3320a..5442f790ed 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14794,6 +14794,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistence only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified"));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're
+ * going to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname)));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid)));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 56505557bf..722464ab6e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4713,6 +4713,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt * from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -6156,6 +6169,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 9ea3c5abf2..04e00cd7f4 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2221,6 +2221,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt * a,
+ const AlterTableSetLoggedAllStmt * b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -4012,6 +4024,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index eefcf90187..3754e758bd 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2077,6 +2077,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index f364a9b88a..f3670a56a2 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -164,6 +164,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1754,6 +1755,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2682,6 +2689,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 5d4037f26e..09bb75d6a0 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 53f6b05a3f..c078478376 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -444,6 +444,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index c24fc26da1..dda1f67a35 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2608,6 +2608,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index c52cf1cfcf..a679d58553 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -966,5 +966,81 @@ drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to materialized view testschema.amv
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 21db433f2a..a4b664f4e0 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -431,5 +431,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.27.0
On Thu, Mar 31, 2022 at 01:58:45PM +0900, Kyotaro Horiguchi wrote:
Thanks! Version 20 is attached.
The patch failed an all CI tasks, and seems to have caused the macos task to
hang.
http://cfbot.cputube.org/kyotaro-horiguchi.html
Would you send a fixed patch, or remove this thread from the CFBOT ? Otherwise
cirrrus will try to every day to rerun but take 1hr to time out, which is twice
as slow as the slowest OS.
I think this patch should be moved to the next CF and set to v16.
Thanks,
--
Justin
At Thu, 31 Mar 2022 00:37:07 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in
On Thu, Mar 31, 2022 at 01:58:45PM +0900, Kyotaro Horiguchi wrote:
Thanks! Version 20 is attached.
The patch failed an all CI tasks, and seems to have caused the macos task to
hang.http://cfbot.cputube.org/kyotaro-horiguchi.html
Would you send a fixed patch, or remove this thread from the CFBOT ? Otherwis
e
cirrrus will try to every day to rerun but take 1hr to time out, which is twice
as slow as the slowest OS.
That is found to be a thinko that causes mark files left behind in new
database created in the logged version of CREATE DATABASE. It is
easily fixed.
That being said, this failure revealed that pg_checksums or
pg_basebackup dislikes the mark files. It happens even in a quite low
possibility. This would need further consideration and extra rounds of
reviews.
I think this patch should be moved to the next CF and set to v16.
I don't think this can be commited to 15. So I post the fixed version
then move this to the next CF.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v21-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From c4a2e19cd51c3a1470a916a235ab7c0e72ff498c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v21 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 49 ++
src/backend/access/transam/README | 9 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 +
src/backend/catalog/storage.c | 548 +++++++++++++++++++++-
src/backend/commands/tablecmds.c | 266 +++++++++--
src/backend/replication/basebackup.c | 3 +-
src/backend/storage/buffer/bufmgr.c | 85 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 340 ++++++++++----
src/backend/storage/smgr/md.c | 94 +++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 21 +-
src/bin/pg_rewind/parsexlog.c | 22 +
src/bin/pg_rewind/pg_rewind.c | 1 -
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 42 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
src/tools/pgindent/typedefs.list | 7 +
25 files changed, 1464 insertions(+), 181 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 7547813254..225ffbafef 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rnode.dbNode,
+ xlrec->rnode.spcNode,
+ xlrec->rnode.relNode,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..2ecd8c8c7c 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -725,6 +725,15 @@ then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is created when a new relation file is created to
+mark the relfilenode needs to be cleaned up at recovery time. In
+contrast to the four actions above, failure to remove smgr mark files
+will lead to data loss, in which case the server will shut down.
+
+
Skipping WAL for New RelFileNode
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3596a7d734..f48d950895 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2199,6 +2199,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2449,6 +2452,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2774,6 +2780,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 8d2395dae2..e7786a3851 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -40,6 +40,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -53,6 +54,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/guc.h"
@@ -1746,6 +1748,14 @@ PerformWalRecovery(void)
}
}
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
/* Allow resource managers to do any required cleanup. */
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
@@ -3026,6 +3036,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 9898701a43..3607177ffb 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileNode relnode; /* relation that may need to be deleted */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileNode rnode;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
HTAB *pendingSyncHash = NULL;
@@ -123,6 +142,7 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence,
SMgrRelation srel;
BackendId backend;
bool needs_wal;
+ PendingCleanup *pendingclean;
Assert(!IsInParallelMode()); /* couldn't update pendingSyncHash */
@@ -145,7 +165,14 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence,
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ */
srel = smgropen(rnode, backend);
+ log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
@@ -157,18 +184,31 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence,
*/
if (register_delete)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->relnode = rnode;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->relnode = rnode;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
}
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->relnode = rnode;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
+
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
Assert(backend == InvalidBackendId);
@@ -178,6 +218,200 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init-fork exists since before the current transaction
+ * started. This function reverts that change just by removing the entry.
+ * See RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the commit-sentinel file */
+ srel = smgropen(rnode, InvalidBackendId);
+ log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * init fork for indexes needs further initialization. ambuildempty should
+ * do WAL-log and file sync by itself but otherwise we do that by
+ * ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rnode, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion is canceled.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileNode rnode = rel->rd_node;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init fork is created in the current transaction. We remove
+ * both the init fork and mark file immediately in that case. Otherwise
+ * just register a pending-unlink for the existing init fork. See
+ * RelationCreateInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileNodeEquals(rnode, pending->relnode) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rnode, InvalidBackendId);
+
+ /*
+ * INIT forks are never loaded to shared buffer so no point in
+ * dropping buffers for such files.
+ */
+ log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rnode, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -197,6 +431,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileNode *rnode, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rnode = *rnode;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -710,6 +1026,95 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->relnode, pending->backend);
+
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /*
+ * Unlink the fork file. Currently we use this only for
+ * init forks and we're sure that the init fork is not
+ * loaded on shared buffers. For RelationDropInitFork
+ * case, the function dropped that buffers. For
+ * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
+ * is set and the buffers have been dropped just before.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->relnode,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->relnode,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->relnode, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -970,6 +1375,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rnode, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1058,6 +1472,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rnode, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileNodeEquals(xlrec->rnode, pending->relnode) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->relnode = xlrec->rnode;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 51b4a00d50..71aaf3320a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -55,6 +55,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5374,6 +5375,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take this way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table. We don't
+ * emit this if wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rnode = r->rd_node;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5504,47 +5686,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenode will
- * have the right persistence set, and at the same time ensure
- * that the original filenode's buffers will get read in with the
- * correct setting (i.e. the original one). Otherwise a rollback
- * after the rewrite would possibly result with buffers for the
- * original filenode having the wrong persistence setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenode will have the right persistence set, and at the
+ * same time ensure that the original filenode's buffers will
+ * get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would
+ * possibly result with buffers for the original filenode
+ * having the wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 6884cad2c0..c67bae34f5 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1188,6 +1188,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1238,7 +1239,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d73a40c1bc..01974b71d2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3159,6 +3159,91 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelFileNodeBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileNodeBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileNodeBackend rnode = srel->smgr_rnode;
+
+ Assert(!RelFileNodeBackendIsTemp(rnode));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rnode.node, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(bufHdr->tag.forkNum != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileNodesAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14b77f2861..2fc9f17c28 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f053fe0495..fa4b1c0e6e 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relfilenode cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relfilenode information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then
+ * the smgr object before initializing the unlogged relation. This is
+ * safe as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while ((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileNodeBackend rel;
+
+ /*
+ * The relation is persistent and stays persistent. Don't drop the
+ * buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.node, InvalidBackendId);
+ }
+
+ DropRelFileNodesAllBuffers(srels, nrels);
+
+ for (i = 0; i < nrels; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
- unlogged_relation_entry ent;
+ Oid key;
+ relfile_entry *ent;
+ RelFileNodeBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed transaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relfilenode
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path));
+
+ rel.backend = InvalidBackendId;
+ rel.node.spcNode = tspid;
+ rel.node.dbNode = dbid;
+ rel.node.relNode = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 879f647dbc..692508ea98 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -169,6 +170,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode,
+ reln->smgr_rnode.node.dbNode,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create mark file \"%s\": %m", path));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rnode, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1031,6 +1108,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
+{
+ register_forget_request(rnode, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1384,12 +1470,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rnode, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+ ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM,
+ mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index d71a557a35..0710e8b145 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -63,6 +63,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -84,6 +88,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -337,6 +343,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -664,6 +690,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index c695d816fc..ab11600724 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -91,7 +91,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -236,7 +237,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -245,6 +247,21 @@ SyncPostCheckpoint(void)
* here. rmtree() also has to ignore ENOENT errors, to deal with
* the possibility that we delete the file first.
*/
+ if (errno != ENOENT)
+ ereport(WARNING,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path));
+ }
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+ path,
+ SMGR_MARK_UNCOMMITTED)
+ < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork file has been successfully removed. It's ok if the file
+ * does not exist.
+ */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 49966e7b7f..c3515e5546 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -416,6 +416,28 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index b39b5c1aac..9f7235b920 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -413,7 +413,6 @@ main(int argc, char **argv)
if (showprogress)
pg_log_info("reading source file list");
source->traverse_files(source, &process_source_file);
-
if (showprogress)
pg_log_info("reading target file list");
traverse_datadir(datadir_target, &process_target_file);
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 636c96efd3..1c19e16fea 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode)
*/
char *
GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcNode == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
Assert(dbNode == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNode, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNode, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNode);
+ path = psprintf("global/%u%s", relNode, markstr);
}
else if (spcNode == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbNode, relNode);
+ path = psprintf("base/%u/%u%s",
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbNode, backendId, relNode);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbNode, backendId, relNode, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, relNode);
+ dbNode, relNode, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
dbNode, backendId, relNode,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcNode, TABLESPACE_VERSION_DIRECTORY,
- dbNode, backendId, relNode);
+ dbNode, backendId, relNode, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 844a023b2c..a685665fab 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileNode rnode,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 622de22b03..d83fc6876e 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilenode.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileNode rnode;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileNode rnode;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum,
+ StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a4b5dc853b..a864c91614 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbNode, Oid spcNode);
extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
/* First argument is a RelFileNode */
#define relpathbackend(rnode, backend, forknum) \
GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileNode */
#define relpathperm(rnode, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode,
#define relpath(rnode, forknum) \
relpathbackend((rnode).node, (rnode).backend, forknum)
+/* First argument is a RelFileNodeBackend */
+#define markpath(rnode, forknum, mark) \
+ GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \
+ (rnode).node.relNode, \
+ (rnode).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index a6b657f0ba..b7db0b2922 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -210,6 +210,8 @@ extern void CreateAndCopyRelationData(RelFileNode src_rnode,
extern void FlushDatabaseBuffers(Oid dbid);
extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes);
extern void DropDatabaseBuffers(Oid dbid);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 29209e2724..8bf746bf45 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 6e46d8d96a..18b27d366b 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -24,6 +24,10 @@ extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
extern void mdrelease(void);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
@@ -42,12 +46,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index bf2c10d443..e399aec0c7 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 8e3ef92cda..43b33b6b8d 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilenode.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
extern void smgrclose(SMgrRelation reln);
extern void smgrcloseall(void);
extern void smgrclosenode(RelFileNodeBackend rnode);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 72fafb795b..181709039c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1890,6 +1890,7 @@ PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
PendingFsyncEntry
+PendingMarkCleanup
PendingRelDelete
PendingRelSync
PendingUnlinkEntry
@@ -2535,6 +2536,7 @@ StdRdOptIndexCleanup
StdRdOptions
Step
StopList
+StorageMarks
StrategyNumber
StreamCtl
StreamXidHash
@@ -3502,6 +3504,7 @@ registered_buffer
regmatch_t
regoff_t
regproc
+relfile_entry
relopt_bool
relopt_enum
relopt_enum_elt_def
@@ -3555,6 +3558,7 @@ slist_iter
slist_mutable_iter
slist_node
slock_t
+smgr_mark_action
socket_set
spgBulkDeleteState
spgChooseIn
@@ -3755,8 +3759,11 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
+xl_smgr_mark
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.27.0
v21-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From 19e47466ab13ce805ad7a6056319cbba657ea9c0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v21 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 16 +++
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 +++
src/backend/nodes/equalfuncs.c | 15 +++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
11 files changed, 370 insertions(+)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 5c0735e08a..5ae825b30f 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -753,6 +755,20 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(see <xref linkend="sql-createtable-unlogged"/>). It cannot be applied
to a temporary table.
</para>
+
+ <para>
+ All tables in the current database in a tablespace can be changed by
+ using the <literal>ALL IN TABLESPACE</literal> form, which will first
+ lock all tables to be changed and then change each one. This form also
+ supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ specified roles. If the <literal>NOWAIT</literal> option is specified,
+ then the command will fail if it is unable to immediately acquire all of
+ the locks required. The <literal>information_schema</literal> relations
+ are not considered part of the system catalogs and will be changed. See
+ also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 71aaf3320a..5442f790ed 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14794,6 +14794,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistence only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified"));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're
+ * going to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname)));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid)));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileNode newrnode)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 56505557bf..722464ab6e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4713,6 +4713,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt * from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -6156,6 +6169,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 9ea3c5abf2..04e00cd7f4 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2221,6 +2221,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt * a,
+ const AlterTableSetLoggedAllStmt * b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -4012,6 +4024,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index eefcf90187..3754e758bd 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2077,6 +2077,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *)n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index f364a9b88a..f3670a56a2 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -164,6 +164,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1754,6 +1755,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2682,6 +2689,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 5d4037f26e..09bb75d6a0 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 53f6b05a3f..c078478376 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -444,6 +444,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index c24fc26da1..dda1f67a35 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2608,6 +2608,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index c52cf1cfcf..a679d58553 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -966,5 +966,81 @@ drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to materialized view testschema.amv
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 21db433f2a..a4b664f4e0 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -431,5 +431,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.27.0
At Thu, 31 Mar 2022 18:33:18 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I don't think this can be commited to 15. So I post the fixed version
then move this to the next CF.
Then done. Thanks!
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Thu, Mar 31, 2022 at 2:36 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
At Thu, 31 Mar 2022 18:33:18 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I don't think this can be commited to 15. So I post the fixed version
then move this to the next CF.Then done. Thanks!
Hello! This patchset will need to be rebased over latest -- looks like
b74e94dc27f (Rethink PROCSIGNAL_BARRIER_SMGRRELEASE) and 5c279a6d350
(Custom WAL Resource Managers) are interfering.
Thanks,
--Jacob
At Wed, 6 Jul 2022 08:44:18 -0700, Jacob Champion <jchampion@timescale.com> wrote in
On Thu, Mar 31, 2022 at 2:36 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:At Thu, 31 Mar 2022 18:33:18 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I don't think this can be commited to 15. So I post the fixed version
then move this to the next CF.Then done. Thanks!
Hello! This patchset will need to be rebased over latest -- looks like
b74e94dc27f (Rethink PROCSIGNAL_BARRIER_SMGRRELEASE) and 5c279a6d350
(Custom WAL Resource Managers) are interfering.
Thank you for checking that! It got a wider attack by b0a55e4329
(RelFileNumber). The commit message suggests "relfilenode" as files
should be replaced with "relation storage/file" so I did that in
ResetUnloggedRelationsInDbspaceDir.
This patch said that:
* INIT forks are never loaded to shared buffer so no point in
* dropping buffers for such files.
But actually some *buildempty() functions use ReadBufferExtended() for
INIT_FORK. So that's wrong. So, I did that but... I don't like that.
Or I don't like that some AMs leave buffers for INIT fork after. But I
feel I'm misunderstanding here since I don't understand how the INIT
fork can work as expected after a crash that happens before the next
checkpoint flushes the buffers.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v22-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 864ce55a05e67d03119462efa1820905c222e9d5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v22 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 49 ++
src/backend/access/transam/README | 10 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 +
src/backend/catalog/storage.c | 561 +++++++++++++++++++++-
src/backend/commands/tablecmds.c | 267 ++++++++--
src/backend/replication/basebackup.c | 9 +-
src/backend/storage/buffer/bufmgr.c | 85 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 340 +++++++++----
src/backend/storage/smgr/md.c | 95 +++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 21 +-
src/bin/pg_rewind/parsexlog.c | 22 +
src/bin/pg_rewind/pg_rewind.c | 1 -
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 43 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
src/tools/pgindent/typedefs.list | 7 +
25 files changed, 1485 insertions(+), 183 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index e0ee8a078a..2f92c06f70 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rlocator.dbOid,
+ xlrec->rlocator.spcOid,
+ xlrec->rlocator.relNumber,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 734c39a4d0..f08bd7f42d 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,16 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+================================
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is an empty file that is created when a new relation
+storage file is created to signal that the storage file needs to be
+cleaned up at recovery time. In contrast to the four actions above,
+failure to remove smgr mark files will lead to data loss, in which
+case the server will shut down.
+
Skipping WAL for New RelFileLocator
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 116de1175b..a1d97150dd 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2213,6 +2213,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2463,6 +2466,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2788,6 +2794,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5d6f1b5e46..42b4f6b5c8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -41,6 +41,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -54,6 +55,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/guc.h"
@@ -1773,6 +1775,14 @@ PerformWalRecovery(void)
RmgrCleanup();
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
ereport(LOG,
(errmsg("redo done at %X/%X system usage: %s",
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),
@@ -3052,6 +3062,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d708af19ed..bf21b35ba5 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -123,6 +142,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
SMgrRelation srel;
BackendId backend;
bool needs_wal;
+ PendingCleanup *pendingclean;
Assert(!IsInParallelMode()); /* couldn't update pendingSyncHash */
@@ -145,9 +165,23 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ * We don't need this during WAL-loggged CREATE DATABASE. See
+ * CreateAndCopyRelationData for detail.
+ */
srel = smgropen(rlocator, backend);
+
+ if (register_delete)
+ {
+ log_smgrcreatemark(&rlocator, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ }
+
smgrcreate(srel, MAIN_FORKNUM, false);
-
+
if (needs_wal)
log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -157,16 +191,29 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
*/
if (register_delete)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->rlocator = rlocator;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->rlocator = rlocator;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
}
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
@@ -178,6 +225,204 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init-fork exists since before the current transaction
+ * started. This function reverts that change just by removing the entry.
+ * See RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the commit-sentinel file */
+ srel = smgropen(rlocator, InvalidBackendId);
+ log_smgrcreatemark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * init fork for indexes needs further initialization. ambuildempty should
+ * do WAL-log and file sync by itself but otherwise we do that by
+ * ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion is canceled.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init fork is created in the current transaction. We remove
+ * both the init fork and mark file immediately in that case. Otherwise
+ * register an at-commit pending-unlink for the existing init fork. See
+ * RelationCreateInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, InvalidBackendId);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+
+ /*
+ * Some AMs initializes INIT fork via buffer manager. Drop all buffers
+ * for the INIT fork then unlink the INIT fork along with the mark
+ * file.
+ */
+ DropRelFileLocatorBuffers(srel, &forknum, 1, &firstblock);
+ log_smgrunlinkmark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -197,6 +442,88 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator *rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +1038,95 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /*
+ * Unlink the fork file. Currently we use this only for
+ * init forks and we're sure that the init fork is not
+ * loaded on shared buffers. For RelationDropInitFork
+ * case, the function dropped that buffers. For
+ * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
+ * is set and the buffers have been dropped just before.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->rlocator,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->rlocator, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -971,6 +1387,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1059,6 +1484,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ef5b34a312..ab8ec38929 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -54,6 +54,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5372,6 +5373,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take the hard way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table. We don't
+ * emit this if wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5502,48 +5684,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 95440013c0..0ef4b2bf01 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1167,6 +1167,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1217,7 +1218,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
@@ -1442,6 +1443,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
strncmp(fullpath, "/", 1) == 0)
{
int excludeIdx;
+ char *p;
/* Compare file against noChecksumFiles skip list */
for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
@@ -1455,6 +1457,11 @@ is_checksummed_file(const char *fullpath, const char *filename)
return false;
}
+ /* exclude mark files */
+ p = strchr(filename, '.');
+ if (p && isalpha(p[1]) && p[2] == 0)
+ return false;
+
return true;
}
else
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e4de4b306c..e8b8b33780 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3161,6 +3161,91 @@ DropRelFileLocatorBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileLocatorBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(bufHdr->tag.rlocator, rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(bufHdr->tag.rlocator, rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(bufHdr->tag.forkNum != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelFileLocatorsAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f904f60c08..bcdbbad0f1 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3696,7 +3694,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f053fe0495..4c2b19ada4 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relation files cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relation file information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the storage file is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then
+ * the smgr object before initializing the unlogged relation. This is
+ * safe as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while ((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileLocatorBackend rel;
+
+ /*
+ * The relation is persistent and stays persistent. Don't drop the
+ * buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.locator, InvalidBackendId);
+ }
+
+ DropRelFileLocatorsAllBuffers(srels, nrels);
+
+ for (i = 0; i < nrels; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
- unlogged_relation_entry ent;
+ Oid key;
+ relfile_entry *ent;
+ RelFileLocatorBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed transaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relation files.
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path));
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3998296a62..4249df657a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -141,7 +141,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -173,6 +174,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rlocator.locator.spcOid,
+ reln->smgr_rlocator.locator.dbOid,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create mark file \"%s\": %m", path));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1032,6 +1109,16 @@ register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum)
+{
+ register_forget_request(rlocator, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1390,12 +1477,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rlocator.dbOid, ftag->rlocator.spcOid,
+ ftag->rlocator.relNumber,InvalidBackendId,
+ MAIN_FORKNUM, mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index b21d8c3822..b461c0d583 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -371,6 +377,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -693,6 +719,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index e1fb631003..1a72dd12bb 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -91,7 +91,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -235,7 +236,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -244,6 +246,21 @@ SyncPostCheckpoint(void)
* here. rmtree() also has to ignore ENOENT errors, to deal with
* the possibility that we delete the file first.
*/
+ if (errno != ENOENT)
+ ereport(WARNING,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path));
+ }
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+ path,
+ SMGR_MARK_UNCOMMITTED)
+ < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork file has been successfully removed. It's ok if the file
+ * does not exist.
+ */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 53f011a2fe..4da62f71ce 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,28 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 1ff8da1676..c48a60cfe1 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -419,7 +419,6 @@ main(int argc, char **argv)
if (showprogress)
pg_log_info("reading source file list");
source->traverse_files(source, &process_source_file);
-
if (showprogress)
pg_log_info("reading target file list");
traverse_datadir(datadir_target, &process_target_file);
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1b6b620ce8..46cfe38fd5 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbOid, Oid spcOid)
*/
char *
GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcOid == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
Assert(dbOid == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNumber, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNumber, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNumber);
+ path = psprintf("global/%u%s", relNumber, markstr);
}
else if (spcOid == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbOid, relNumber);
+ path = psprintf("base/%u/%u%s",
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbOid, backendId, relNumber);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbOid, backendId, relNumber, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, relNumber);
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, backendId, relNumber);
+ dbOid, backendId, relNumber, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9964c312aa..ee4179699a 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 44a5e2043b..9f48fb5e6f 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,13 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileLocator *rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 3ab713247f..eba7a05f4e 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbOid, Oid spcOid);
extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
/* First argument is a RelFileLocator */
#define relpathbackend(rlocator, backend, forknum) \
GetRelationPath((rlocator).dbOid, (rlocator).spcOid, (rlocator).relNumber, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileLocator */
#define relpathperm(rlocator, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
#define relpath(rlocator, forknum) \
relpathbackend((rlocator).locator, (rlocator).backend, forknum)
+/* First argument is a RelFileLocatorBackend */
+#define markpath(rlocator, forknum, mark) \
+ GetRelationPath((rlocator).locator.dbOid, (rlocator).locator.spcOid, \
+ (rlocator).locator.relNumber, \
+ (rlocator).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7bcfaac272..a4271ebd69 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -213,6 +213,8 @@ extern void DropRelFileLocatorBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelFileLocatorsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2b4a8e0ffe..eb33a9ba4c 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern void SyncDataDirectory(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 10aa1b0109..294a09444c 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index bf2c10d443..e399aec0c7 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356b..75b9d41e4b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilelocator.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -87,7 +99,12 @@ extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 34a76ceb60..5445826998 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1981,6 +1981,7 @@ PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
PendingFsyncEntry
+PendingMarkCleanup
PendingRelDelete
PendingRelSync
PendingUnlinkEntry
@@ -2611,6 +2612,7 @@ StdRdOptIndexCleanup
StdRdOptions
Step
StopList
+StorageMarks
StrategyNumber
StreamCtl
String
@@ -3612,6 +3614,7 @@ registered_buffer
regmatch_t
regoff_t
regproc
+relfile_entry
relopt_bool
relopt_enum
relopt_enum_elt_def
@@ -3665,6 +3668,7 @@ slist_iter
slist_mutable_iter
slist_node
slock_t
+smgr_mark_action
socket_set
socklen_t
spgBulkDeleteState
@@ -3863,8 +3867,11 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
+xl_smgr_mark
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.31.1
v22-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From 96227d0ef81de78c47dc1e6045d67636accaaec5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 11 Nov 2020 23:21:09 +0900
Subject: [PATCH v22 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/nodes/copyfuncs.c | 16 +++
src/backend/nodes/equalfuncs.c | 15 +++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/nodes.h | 1 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
11 files changed, 369 insertions(+)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index a3c62bf056..6a5e63da34 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -759,6 +761,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(for identity or serial columns). However, it is also possible to
change the persistence of such sequences separately.
</para>
+ <para>
+ All tables in the current database in a tablespace can be changed by
+ using the <literal>ALL IN TABLESPACE</literal> form, which will first
+ lock all tables to be changed and then change each one. This form also
+ supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ specified roles. If the <literal>NOWAIT</literal> option is specified,
+ then the command will fail if it is unable to immediately acquire all of
+ the locks required. The <literal>information_schema</literal> relations
+ are not considered part of the system catalogs and will be changed. See
+ also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ab8ec38929..58e62ba90b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14819,6 +14819,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistence only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified"));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're
+ * going to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname)));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid)));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileLocator newrlocator)
{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 8313b5e5a7..853914ef3e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4818,6 +4818,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from)
return newnode;
}
+static AlterTableSetLoggedAllStmt *
+_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt * from)
+{
+ AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt);
+
+ COPY_STRING_FIELD(tablespacename);
+ COPY_SCALAR_FIELD(objtype);
+ COPY_SCALAR_FIELD(logged);
+ COPY_SCALAR_FIELD(nowait);
+
+ return newnode;
+}
+
static CreateExtensionStmt *
_copyCreateExtensionStmt(const CreateExtensionStmt *from)
{
@@ -6276,6 +6289,9 @@ copyObjectImpl(const void *from)
case T_AlterTableMoveAllStmt:
retval = _copyAlterTableMoveAllStmt(from);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _copyAlterTableSetLoggedAllStmt(from);
+ break;
case T_CreateExtensionStmt:
retval = _copyCreateExtensionStmt(from);
break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 449352639f..0fb452d5b8 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2297,6 +2297,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a,
return true;
}
+static bool
+_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt * a,
+ const AlterTableSetLoggedAllStmt * b)
+{
+ COMPARE_STRING_FIELD(tablespacename);
+ COMPARE_SCALAR_FIELD(objtype);
+ COMPARE_SCALAR_FIELD(logged);
+ COMPARE_SCALAR_FIELD(nowait);
+
+ return true;
+}
+
static bool
_equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b)
{
@@ -4094,6 +4106,9 @@ equal(const void *a, const void *b)
case T_AlterTableMoveAllStmt:
retval = _equalAlterTableMoveAllStmt(a, b);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ retval = _equalAlterTableSetLoggedAllStmt(a, b);
+ break;
case T_CreateExtensionStmt:
retval = _equalCreateExtensionStmt(a, b);
break;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 0523013f53..962f26259a 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2177,6 +2177,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *) n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 6b0a865262..da16b33837 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -164,6 +164,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1760,6 +1761,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2688,6 +2695,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 0c48654b96..4a4b417db6 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7ce1fc4deb..8003af83a1 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -447,6 +447,7 @@ typedef enum NodeTag
T_AlterCollationStmt,
T_CallStmt,
T_AlterStatsStmt,
+ T_AlterTableSetLoggedAllStmt,
/*
* TAGS FOR PARSE TREE NODES (parsenodes.h)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 5f6d65b5c4..5c14e1fa37 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2699,6 +2699,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index c52cf1cfcf..a679d58553 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -966,5 +966,81 @@ drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to materialized view testschema.amv
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 21db433f2a..a4b664f4e0 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -431,5 +431,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.31.1
(Mmm. I haven't noticed an annoying misspelling in the subejct X( )
Thank you for checking that! It got a wider attack by b0a55e4329
(RelFileNumber). The commit message suggests "relfilenode" as files
Then, now I stepped on my own foot. Rebased also on nodefuncs
autogeneration.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v23-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From cdd0282eb373ce79b8495b9a1160fb8c5122315e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 19 Jul 2022 13:23:01 +0900
Subject: [PATCH v23 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 49 ++
src/backend/access/transam/README | 10 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 +
src/backend/catalog/storage.c | 561 +++++++++++++++++++++-
src/backend/commands/tablecmds.c | 267 ++++++++--
src/backend/replication/basebackup.c | 9 +-
src/backend/storage/buffer/bufmgr.c | 85 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 340 +++++++++----
src/backend/storage/smgr/md.c | 95 +++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 21 +-
src/bin/pg_rewind/parsexlog.c | 22 +
src/bin/pg_rewind/pg_rewind.c | 1 -
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 43 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 10 +-
src/include/storage/smgr.h | 17 +
src/tools/pgindent/typedefs.list | 7 +
25 files changed, 1485 insertions(+), 183 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index e0ee8a078a..2f92c06f70 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rlocator.dbOid,
+ xlrec->rlocator.spcOid,
+ xlrec->rlocator.relNumber,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 734c39a4d0..f08bd7f42d 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,16 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+================================
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is an empty file that is created when a new relation
+storage file is created to signal that the storage file needs to be
+cleaned up at recovery time. In contrast to the four actions above,
+failure to remove smgr mark files will lead to data loss, in which
+case the server will shut down.
+
Skipping WAL for New RelFileLocator
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 116de1175b..a1d97150dd 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2213,6 +2213,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2463,6 +2466,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2788,6 +2794,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5d6f1b5e46..42b4f6b5c8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -41,6 +41,7 @@
#include "access/xlogrecovery.h"
#include "access/xlogutils.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "miscadmin.h"
#include "pgstat.h"
@@ -54,6 +55,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/guc.h"
@@ -1773,6 +1775,14 @@ PerformWalRecovery(void)
RmgrCleanup();
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
ereport(LOG,
(errmsg("redo done at %X/%X system usage: %s",
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),
@@ -3052,6 +3062,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d708af19ed..317e44acfd 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -123,6 +142,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
SMgrRelation srel;
BackendId backend;
bool needs_wal;
+ PendingCleanup *pendingclean;
Assert(!IsInParallelMode()); /* couldn't update pendingSyncHash */
@@ -145,9 +165,23 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ * We don't need this during WAL-loggged CREATE DATABASE. See
+ * CreateAndCopyRelationData for detail.
+ */
srel = smgropen(rlocator, backend);
+
+ if (register_delete)
+ {
+ log_smgrcreatemark(&rlocator, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ }
+
smgrcreate(srel, MAIN_FORKNUM, false);
-
+
if (needs_wal)
log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -157,16 +191,29 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
*/
if (register_delete)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->rlocator = rlocator;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->rlocator = rlocator;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
}
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
@@ -178,6 +225,204 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init-fork exists since before the current transaction
+ * started. This function reverts that change just by removing the entry.
+ * See RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the commit-sentinel file */
+ srel = smgropen(rlocator, InvalidBackendId);
+ log_smgrcreatemark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * init fork for indexes needs further initialization. ambuildempty should
+ * do WAL-log and file sync by itself but otherwise we do that by
+ * ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion is canceled.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init fork is created in the current transaction. We remove
+ * both the init fork and mark file immediately in that case. Otherwise
+ * register an at-commit pending-unlink for the existing init fork. See
+ * RelationCreateInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, InvalidBackendId);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+
+ /*
+ * Some AMs initializes INIT fork via buffer manager. Drop all buffers
+ * for the INIT fork then unlink the INIT fork along with the mark
+ * file.
+ */
+ DropRelationBuffers(srel, &forknum, 1, &firstblock);
+ log_smgrunlinkmark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -197,6 +442,88 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator *rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +1038,95 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /*
+ * Unlink the fork file. Currently we use this only for
+ * init forks and we're sure that the init fork is not
+ * loaded on shared buffers. For RelationDropInitFork
+ * case, the function dropped that buffers. For
+ * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
+ * is set and the buffers have been dropped just before.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->rlocator,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->rlocator, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -971,6 +1387,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1059,6 +1484,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index a2f577024a..d8de3f496b 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -54,6 +54,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5376,6 +5377,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take the hard way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table. We don't
+ * emit this if wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5506,48 +5688,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 95440013c0..0ef4b2bf01 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1167,6 +1167,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
bool excludeFound;
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relOidChars; /* Chars in filename that are the rel oid */
+ StorageMarks mark;
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1217,7 +1218,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relOidChars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
@@ -1442,6 +1443,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
strncmp(fullpath, "/", 1) == 0)
{
int excludeIdx;
+ char *p;
/* Compare file against noChecksumFiles skip list */
for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
@@ -1455,6 +1457,11 @@ is_checksummed_file(const char *fullpath, const char *filename)
return false;
}
+ /* exclude mark files */
+ p = strchr(filename, '.');
+ if (p && isalpha(p[1]) && p[2] == 0)
+ return false;
+
return true;
}
else
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c7d7abcd73..c5d96b2fb7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3161,6 +3161,91 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileLocatorBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(bufHdr->tag.rlocator, rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(bufHdr->tag.rlocator, rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(bufHdr->tag.forkNum != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f904f60c08..bcdbbad0f1 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3696,7 +3694,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f053fe0495..684c08fc1e 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(Oid);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("relation files cleanup hash",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ int oidchars;
+ ForkNumber forkNum;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ Oid key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Record the relation file information. If it has
+ * SMGR_MARK_UNCOMMITTED mark files, the storage file is in dirty
+ * state, where clean up is needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then
+ * the smgr object before initializing the unlogged relation. This is
+ * safe as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while ((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileLocatorBackend rel;
+
+ /*
+ * The relation is persistent and stays persistent. Don't drop the
+ * buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.locator, InvalidBackendId);
+ }
+
+ DropRelationsAllBuffers(srels, nrels);
+
+ for (i = 0; i < nrels; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
- unlogged_relation_entry ent;
+ Oid key;
+ relfile_entry *ent;
+ RelFileLocatorBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int oidchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed transaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relation files.
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path));
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int oidchars;
char oidbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *oidchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3998296a62..4249df657a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -141,7 +141,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -173,6 +174,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum)
return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rlocator.locator.spcOid,
+ reln->smgr_rlocator.locator.dbOid,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create mark file \"%s\": %m", path));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1032,6 +1109,16 @@ register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum)
+{
+ register_forget_request(rlocator, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1390,12 +1477,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rlocator.dbOid, ftag->rlocator.spcOid,
+ ftag->rlocator.relNumber,InvalidBackendId,
+ MAIN_FORKNUM, mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c1a5febcbf..5044dc21e4 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -371,6 +377,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -693,6 +719,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..74357efb1c 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -91,7 +91,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -235,7 +236,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -244,6 +246,21 @@ SyncPostCheckpoint(void)
* here. rmtree() also has to ignore ENOENT errors, to deal with
* the possibility that we delete the file first.
*/
+ if (errno != ENOENT)
+ ereport(WARNING,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path));
+ }
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+ path,
+ SMGR_MARK_UNCOMMITTED)
+ < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork file has been successfully removed. It's ok if the file
+ * does not exist.
+ */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 53f011a2fe..4da62f71ce 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,28 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 1ff8da1676..c48a60cfe1 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -419,7 +419,6 @@ main(int argc, char **argv)
if (showprogress)
pg_log_info("reading source file list");
source->traverse_files(source, &process_source_file);
-
if (showprogress)
pg_log_info("reading target file list");
traverse_datadir(datadir_target, &process_target_file);
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1b6b620ce8..46cfe38fd5 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbOid, Oid spcOid)
*/
char *
GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcOid == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
Assert(dbOid == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNumber, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNumber, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNumber);
+ path = psprintf("global/%u%s", relNumber, markstr);
}
else if (spcOid == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbOid, relNumber);
+ path = psprintf("base/%u/%u%s",
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbOid, backendId, relNumber);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbOid, backendId, relNumber, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, relNumber);
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, backendId, relNumber);
+ dbOid, backendId, relNumber, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9964c312aa..ee4179699a 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 44a5e2043b..9f48fb5e6f 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,13 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileLocator *rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 3ab713247f..eba7a05f4e 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbOid, Oid spcOid);
extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
/* First argument is a RelFileLocator */
#define relpathbackend(rlocator, backend, forknum) \
GetRelationPath((rlocator).dbOid, (rlocator).spcOid, (rlocator).relNumber, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileLocator */
#define relpathperm(rlocator, forknum) \
@@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
#define relpath(rlocator, forknum) \
relpathbackend((rlocator).locator, (rlocator).backend, forknum)
+/* First argument is a RelFileLocatorBackend */
+#define markpath(rlocator, forknum, mark) \
+ GetRelationPath((rlocator).locator.dbOid, (rlocator).locator.spcOid, \
+ (rlocator).locator.relNumber, \
+ (rlocator).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index bf8cce7ccf..be5d63319e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -143,6 +143,8 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2b4a8e0ffe..eb33a9ba4c 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern void SyncDataDirectory(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 10aa1b0109..294a09444c 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index bf2c10d443..e399aec0c7 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,13 +16,15 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
-extern bool parse_filename_for_nontemp_relation(const char *name,
- int *oidchars, ForkNumber *fork);
+extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars,
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356b..75b9d41e4b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilelocator.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -87,7 +99,12 @@ extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 34a76ceb60..5445826998 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1981,6 +1981,7 @@ PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
PendingFsyncEntry
+PendingMarkCleanup
PendingRelDelete
PendingRelSync
PendingUnlinkEntry
@@ -2611,6 +2612,7 @@ StdRdOptIndexCleanup
StdRdOptions
Step
StopList
+StorageMarks
StrategyNumber
StreamCtl
String
@@ -3612,6 +3614,7 @@ registered_buffer
regmatch_t
regoff_t
regproc
+relfile_entry
relopt_bool
relopt_enum
relopt_enum_elt_def
@@ -3665,6 +3668,7 @@ slist_iter
slist_mutable_iter
slist_node
slock_t
+smgr_mark_action
socket_set
socklen_t
spgBulkDeleteState
@@ -3863,8 +3867,11 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
+xl_smgr_mark
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.31.1
v23-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET.patchtext/x-patch; charset=us-asciiDownload
From d58437fe963c2475c5ab3ca175049333f4b9b8f2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 19 Jul 2022 13:32:13 +0900
Subject: [PATCH v23 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
8 files changed, 337 insertions(+)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index f0f912a56c..fd3d8290b8 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -759,6 +761,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(for identity or serial columns). However, it is also possible to
change the persistence of such sequences separately.
</para>
+ <para>
+ All tables in the current database in a tablespace can be changed by
+ using the <literal>ALL IN TABLESPACE</literal> form, which will first
+ lock all tables to be changed and then change each one. This form also
+ supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ specified roles. If the <literal>NOWAIT</literal> option is specified,
+ then the command will fail if it is unable to immediately acquire all of
+ the locks required. The <literal>information_schema</literal> relations
+ are not considered part of the system catalogs and will be changed. See
+ also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d8de3f496b..1366a4f22a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14779,6 +14779,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistence only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified"));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're
+ * going to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname)));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid)));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileLocator newrlocator)
{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c018140afe..96ece9747e 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2177,6 +2177,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *) n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 6b0a865262..da16b33837 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -164,6 +164,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1760,6 +1761,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2688,6 +2695,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 0c48654b96..4a4b417db6 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 98fe1abaa2..e7d7bd05a2 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2711,6 +2711,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index c52cf1cfcf..a679d58553 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -966,5 +966,81 @@ drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to materialized view testschema.amv
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 21db433f2a..a4b664f4e0 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -431,5 +431,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.31.1
Just rebased.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v24-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 8c0d5bd7b519149059d1b2b86a93ffe509e09b9b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 19 Jul 2022 13:23:01 +0900
Subject: [PATCH v24 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 49 ++
src/backend/access/transam/README | 10 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 +
src/backend/backup/basebackup.c | 9 +-
src/backend/catalog/storage.c | 561 +++++++++++++++++++++-
src/backend/commands/tablecmds.c | 267 ++++++++--
src/backend/storage/buffer/bufmgr.c | 85 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 313 ++++++++----
src/backend/storage/smgr/md.c | 95 +++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 21 +-
src/bin/pg_rewind/parsexlog.c | 22 +
src/bin/pg_rewind/pg_rewind.c | 1 -
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 43 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 8 +-
src/include/storage/smgr.h | 17 +
src/tools/pgindent/typedefs.list | 7 +
25 files changed, 1471 insertions(+), 168 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index e0ee8a078a..2f92c06f70 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rlocator.dbOid,
+ xlrec->rlocator.spcOid,
+ xlrec->rlocator.relNumber,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 91c2578f7a..37527e16ca 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -725,6 +725,16 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+================================
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is an empty file that is created when a new relation
+storage file is created to signal that the storage file needs to be
+cleaned up at recovery time. In contrast to the four actions above,
+failure to remove smgr mark files will lead to data loss, in which
+case the server will shut down.
+
Skipping WAL for New RelFileLocator
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 2bb975943c..03e4bcec34 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2213,6 +2213,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2463,6 +2466,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2788,6 +2794,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 1026ce5dcf..aac303934c 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -42,6 +42,7 @@
#include "access/xlogutils.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "common/file_utils.h"
#include "miscadmin.h"
@@ -56,6 +57,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/datetime.h"
@@ -1777,6 +1779,14 @@ PerformWalRecovery(void)
RmgrCleanup();
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
ereport(LOG,
(errmsg("redo done at %X/%X system usage: %s",
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),
@@ -3113,6 +3123,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 1434bcdd85..4e0c6c4e98 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1190,6 +1190,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relnumchars; /* Chars in filename that are the
* relnumber */
+ StorageMarks mark; /* marker file sign */
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1240,7 +1241,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
@@ -1447,6 +1448,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
strncmp(fullpath, "/", 1) == 0)
{
int excludeIdx;
+ char *p;
/* Compare file against noChecksumFiles skip list */
for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
@@ -1460,6 +1462,11 @@ is_checksummed_file(const char *fullpath, const char *filename)
return false;
}
+ /* exclude mark files */
+ p = strchr(filename, '.');
+ if (p && isalpha(p[1]) && p[2] == 0)
+ return false;
+
return true;
}
else
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 38bbe32550..2c6472cfd5 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -123,6 +142,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
SMgrRelation srel;
BackendId backend;
bool needs_wal;
+ PendingCleanup *pendingclean;
Assert(!IsInParallelMode()); /* couldn't update pendingSyncHash */
@@ -145,9 +165,23 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ * We don't need this during WAL-loggged CREATE DATABASE. See
+ * CreateAndCopyRelationData for detail.
+ */
srel = smgropen(rlocator, backend);
+
+ if (register_delete)
+ {
+ log_smgrcreatemark(&rlocator, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ }
+
smgrcreate(srel, MAIN_FORKNUM, false);
-
+
if (needs_wal)
log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -157,16 +191,29 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
*/
if (register_delete)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->rlocator = rlocator;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->rlocator = rlocator;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
}
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
@@ -178,6 +225,204 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init-fork exists since before the current transaction
+ * started. This function reverts that change just by removing the entry.
+ * See RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the commit-sentinel file */
+ srel = smgropen(rlocator, InvalidBackendId);
+ log_smgrcreatemark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * init fork for indexes needs further initialization. ambuildempty should
+ * do WAL-log and file sync by itself but otherwise we do that by
+ * ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion is canceled.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init fork is created in the current transaction. We remove
+ * both the init fork and mark file immediately in that case. Otherwise
+ * register an at-commit pending-unlink for the existing init fork. See
+ * RelationCreateInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, InvalidBackendId);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+
+ /*
+ * Some AMs initializes INIT fork via buffer manager. Drop all buffers
+ * for the INIT fork then unlink the INIT fork along with the mark
+ * file.
+ */
+ DropRelationBuffers(srel, &forknum, 1, &firstblock);
+ log_smgrunlinkmark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -197,6 +442,88 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator *rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +1038,95 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /*
+ * Unlink the fork file. Currently we use this only for
+ * init forks and we're sure that the init fork is not
+ * loaded on shared buffers. For RelationDropInitFork
+ * case, the function dropped that buffers. For
+ * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
+ * is set and the buffers have been dropped just before.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ SMgrRelation srel;
+
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->rlocator,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+ srel = smgropen(pending->rlocator, pending->backend);
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -975,6 +1391,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1067,6 +1492,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1b8e6d5729..1779959410 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -55,6 +55,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5407,6 +5408,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take the hard way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table. We don't
+ * emit this if wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5537,48 +5719,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5b0e531f97..796cb139c3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3172,6 +3172,91 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileLocatorBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(&srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(bufHdr->tag.rlocator, rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(bufHdr->tag.rlocator, rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr, true);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(bufHdr->tag.forkNum != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 073dab2be5..c5f577a694 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -347,8 +347,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3670,7 +3668,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index c3faa68126..ea5d6bbba1 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
RelFileNumber relnumber; /* hash key */
-} unlogged_relation_entry;
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,228 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
- if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNumber);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("unlogged relation RelFileNumbers",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- HTAB *hash;
- HASHCTL ctl;
+ ForkNumber forkNum;
+ int relnumchars;
+ StorageMarks mark;
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(RelFileNumber);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation RelFileNumbers", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
+ &forkNum, &mark))
+ continue;
- /* Scan the directory. */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
{
- ForkNumber forkNum;
- int relnumchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
+ RelFileNumbers key;
+ relfile_entry *ent;
+ bool found;
/*
* Put the RELFILENUMBER portion of the name into the hash table,
- * if it isn't already.
+ * if it isn't already. If it has SMGR_MARK_UNCOMMITTED mark
+ * files, the storage file is in dirty state, where clean up is
+ * needed.
*/
- ent.relnumber = atorelnumber(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
+ key = atorelnumber(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
}
+ }
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
/*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then
+ * the smgr object before initializing the unlogged relation. This is
+ * safe as far as no other backends have accessed the relation before
+ * starting archive recovery.
*/
- if (hash_get_num_entries(hash) == 0)
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while ((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
{
- hash_destroy(hash);
- return;
+ RelFileLocatorBackend rel;
+
+ /*
+ * The relation is persistent and stays persistent. Don't drop the
+ * buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = ent->reloid;
+
+ srels[nrels++] = smgropen(rel.locator, InvalidBackendId);
}
- /*
- * Now, make a second pass and remove anything that matches.
- */
+ DropRelationsAllBuffers(srels, nrels);
+
+ for (i = 0; i < nrels; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
+ if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
+ {
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
- unlogged_relation_entry ent;
+ RelFileNumber key;
+ relfile_entry *ent;
+ RelFileLocatorBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the RELFILENUMBER portion of the name shows up in
* the hash table. If so, nuke it!
*/
- ent.relnumber = atorelnumber(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atorelnumber(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed transaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relation files.
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path));
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
+ hash_destroy(hash);
+ hash = NULL;
+
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +420,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[RELNUMBERCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +428,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +466,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[RELNUMBERCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +520,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +551,19 @@ parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index bed47f07d7..d628461970 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -141,7 +141,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -173,6 +174,82 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rlocator.locator.spcOid,
+ reln->smgr_rlocator.locator.dbOid,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create mark file \"%s\": %m", path));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1056,6 +1133,16 @@ register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum)
+{
+ register_forget_request(rlocator, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1414,12 +1501,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rlocator.dbOid, ftag->rlocator.spcOid,
+ ftag->rlocator.relNumber,InvalidBackendId,
+ MAIN_FORKNUM, mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index ed46ac3f44..c7353d3dcf 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -371,6 +377,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -693,6 +719,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..74357efb1c 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -91,7 +91,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -235,7 +236,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -244,6 +246,21 @@ SyncPostCheckpoint(void)
* here. rmtree() also has to ignore ENOENT errors, to deal with
* the possibility that we delete the file first.
*/
+ if (errno != ENOENT)
+ ereport(WARNING,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path));
+ }
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+ path,
+ SMGR_MARK_UNCOMMITTED)
+ < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork file has been successfully removed. It's ok if the file
+ * does not exist.
+ */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 53f011a2fe..4da62f71ce 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,28 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 3cd77c09b1..f87dfda8fe 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -419,7 +419,6 @@ main(int argc, char **argv)
if (showprogress)
pg_log_info("reading source file list");
source->traverse_files(source, &process_source_file);
-
if (showprogress)
pg_log_info("reading target file list");
traverse_datadir(datadir_target, &process_target_file);
diff --git a/src/common/relpath.c b/src/common/relpath.c
index d0d83e593b..f4f7435cf9 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbOid, Oid spcOid)
*/
char *
GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcOid == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
Assert(dbOid == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/" UINT64_FORMAT "_%s",
- relNumber, forkNames[forkNumber]);
+ path = psprintf("global/" UINT64_FORMAT "_%s%s",
+ relNumber, forkNames[forkNumber], markstr);
else
- path = psprintf("global/" UINT64_FORMAT, relNumber);
+ path = psprintf("global/" UINT64_FORMAT "%s", relNumber, markstr);
}
else if (spcOid == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/" UINT64_FORMAT "_%s",
+ path = psprintf("base/%u/" UINT64_FORMAT "_%s%s",
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/" UINT64_FORMAT,
- dbOid, relNumber);
+ path = psprintf("base/%u/" UINT64_FORMAT "%s",
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_" UINT64_FORMAT "_%s",
+ path = psprintf("base/%u/t%d_" UINT64_FORMAT "_%s%s",
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_" UINT64_FORMAT,
- dbOid, backendId, relNumber);
+ path = psprintf("base/%u/t%d_" UINT64_FORMAT "%s",
+ dbOid, backendId, relNumber, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/" UINT64_FORMAT "_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/" UINT64_FORMAT "_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/" UINT64_FORMAT,
+ path = psprintf("pg_tblspc/%u/%s/%u/" UINT64_FORMAT "%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, relNumber);
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_" UINT64_FORMAT "_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_" UINT64_FORMAT "_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_" UINT64_FORMAT,
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_" UINT64_FORMAT "%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, backendId, relNumber);
+ dbOid, backendId, relNumber, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9964c312aa..ee4179699a 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 44a5e2043b..9f48fb5e6f 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,13 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileLocator *rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 2d3b52fe0b..bcbd66ead3 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -77,7 +77,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbOid, Oid spcOid);
extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -87,7 +87,7 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
/* First argument is a RelFileLocator */
#define relpathbackend(rlocator, backend, forknum) \
GetRelationPath((rlocator).dbOid, (rlocator).spcOid, (rlocator).relNumber, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileLocator */
#define relpathperm(rlocator, forknum) \
@@ -97,4 +97,9 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
#define relpath(rlocator, forknum) \
relpathbackend((rlocator).locator, (rlocator).backend, forknum)
+/* First argument is a RelFileLocatorBackend */
+#define markpath(rlocator, forknum, mark) \
+ GetRelationPath((rlocator).locator.dbOid, (rlocator).locator.spcOid, \
+ (rlocator).locator.relNumber, \
+ (rlocator).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6f4dfa0960..5b753b768b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -142,6 +142,8 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 5a48fccd9c..b2773ae8bb 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd,
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int elevel);
extern int durable_unlink(const char *fname, int elevel);
extern void SyncDataDirectory(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 10aa1b0109..294a09444c 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index b990d28d38..cacd5e7d2c 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,14 +16,16 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
int *relnumchars,
- ForkNumber *fork);
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356b..75b9d41e4b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilelocator.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -87,7 +99,12 @@ extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97c9bc1861..2fdd221408 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1970,6 +1970,7 @@ PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
PendingFsyncEntry
+PendingMarkCleanup
PendingRelDelete
PendingRelSync
PendingUnlinkEntry
@@ -2600,6 +2601,7 @@ StdRdOptIndexCleanup
StdRdOptions
Step
StopList
+StorageMarks
StrategyNumber
StreamCtl
String
@@ -3607,6 +3609,7 @@ registered_buffer
regmatch_t
regoff_t
regproc
+relfile_entry
relopt_bool
relopt_enum
relopt_enum_elt_def
@@ -3660,6 +3663,7 @@ slist_iter
slist_mutable_iter
slist_node
slock_t
+smgr_mark_action
socket_set
socklen_t
spgBulkDeleteState
@@ -3858,8 +3862,11 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
+xl_smgr_mark
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.31.1
v24-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From 2b078b9b32307707967e0ff7f473713488896b32 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 19 Jul 2022 13:32:13 +0900
Subject: [PATCH v24 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/catalog/storage.c | 4 +-
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/parser/gram.y | 42 +++++++
src/backend/storage/buffer/bufmgr.c | 12 +-
src/backend/storage/file/reinit.c | 4 +-
src/backend/tcop/utility.c | 11 ++
src/include/catalog/storage_xlog.h | 2 +-
src/include/commands/tablecmds.h | 2 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
12 files changed, 349 insertions(+), 10 deletions(-)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index f0f912a56c..fd3d8290b8 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -759,6 +761,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(for identity or serial columns). However, it is also possible to
change the persistence of such sequences separately.
</para>
+ <para>
+ All tables in the current database in a tablespace can be changed by
+ using the <literal>ALL IN TABLESPACE</literal> form, which will first
+ lock all tables to be changed and then change each one. This form also
+ supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ specified roles. If the <literal>NOWAIT</literal> option is specified,
+ then the command will fail if it is unable to immediately acquire all of
+ the locks required. The <literal>information_schema</literal> relations
+ are not considered part of the system catalogs and will be changed. See
+ also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 2c6472cfd5..b913f7d993 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -509,14 +509,14 @@ log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
* Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
*/
void
-log_smgrbufpersistence(const RelFileLocator *rlocator, bool persistence)
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
{
xl_smgr_bufpersistence xlrec;
/*
* Make an XLOG entry reporting the change of buffer persistence.
*/
- xlrec.rlocator = *rlocator;
+ xlrec.rlocator = rlocator;
xlrec.persistence = persistence;
XLogBeginInsert();
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 1779959410..cbf88f2878 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14827,6 +14827,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistence only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified"));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're
+ * going to change persistence.
+ */
+ if (!pg_class_ownercheck(relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname)));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid)));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileLocator newrlocator)
{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 0d8d292850..b56fbbd4b2 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2086,6 +2086,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *) n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 796cb139c3..454f3bcf0a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3201,7 +3201,7 @@ SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
Assert(!RelFileLocatorBackendIsTemp(rlocator));
if (!isRedo)
- log_smgrbufpersistence(&srel->smgr_rlocator.locator, permanent);
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, permanent);
ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
@@ -3210,14 +3210,16 @@ SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
BufferDesc *bufHdr = GetBufferDescriptor(i);
uint32 buf_state;
- if (!RelFileLocatorEquals(bufHdr->tag.rlocator, rlocator.locator))
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
continue;
ReservePrivateRefCountEntry();
buf_state = LockBufHdr(bufHdr);
- if (!RelFileLocatorEquals(bufHdr->tag.rlocator, rlocator.locator))
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
{
UnlockBufHdr(bufHdr, buf_state);
continue;
@@ -3226,7 +3228,7 @@ SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
if (permanent)
{
/* Init fork is being dropped, drop buffers for it. */
- if (bufHdr->tag.forkNum == INIT_FORKNUM)
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
{
InvalidateBuffer(bufHdr);
continue;
@@ -3251,7 +3253,7 @@ SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
else
{
/* There shouldn't be an init fork */
- Assert(bufHdr->tag.forkNum != INIT_FORKNUM);
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
UnlockBufHdr(bufHdr, buf_state);
}
}
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index ea5d6bbba1..aae8561f9d 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -237,7 +237,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
{
- RelFileNumbers key;
+ RelFileNumber key;
relfile_entry *ent;
bool found;
@@ -318,7 +318,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
rel.backend = InvalidBackendId;
rel.locator.spcOid = tspid;
rel.locator.dbOid = dbid;
- rel.locator.relNumber = ent->reloid;
+ rel.locator.relNumber = ent->relnumber;
srels[nrels++] = smgropen(rel.locator, InvalidBackendId);
}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index aa00815787..3307937276 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -164,6 +164,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1760,6 +1761,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2688,6 +2695,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 9f48fb5e6f..0a3726f3db 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -88,7 +88,7 @@ extern void log_smgrcreatemark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
-extern void log_smgrbufpersistence(const RelFileLocator *rlocator,
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
bool persistence);
extern void smgr_redo(XLogReaderState *record);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 03f14d6be1..27286fff33 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 633e7671b3..789e9438d1 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2426,6 +2426,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index c52cf1cfcf..a679d58553 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -966,5 +966,81 @@ drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to materialized view testschema.amv
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 21db433f2a..a4b664f4e0 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -431,5 +431,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.31.1
2022年9月28日(水) 17:21 Kyotaro Horiguchi <horikyota.ntt@gmail.com>:
Just rebased.
Hi
cfbot reports the patch no longer applies. As CommitFest 2022-11 is
currently underway, this would be an excellent time to update the patch.
Thanks
Ian Barwick
At Fri, 4 Nov 2022 09:32:52 +0900, Ian Lawrence Barwick <barwick@gmail.com> wrote in
cfbot reports the patch no longer applies. As CommitFest 2022-11 is
currently underway, this would be an excellent time to update the patch.
Indeed, thanks! I'll do that in a few days.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 08 Nov 2022 11:33:53 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Indeed, thanks! I'll do that in a few days.
Got too late, but rebased.. The contents of the two patches in the
last version was a bit shuffled but they are fixed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v25-0001-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From b8bcd1e9dc7c52c277de7f13bc21900efd2030dc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 19 Jul 2022 13:23:01 +0900
Subject: [PATCH v25 1/2] In-place table persistence change
Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O. This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
src/backend/access/rmgrdesc/smgrdesc.c | 49 ++
src/backend/access/transam/README | 10 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 +
src/backend/backup/basebackup.c | 9 +-
src/backend/catalog/storage.c | 559 +++++++++++++++++++++-
src/backend/commands/tablecmds.c | 267 +++++++++--
src/backend/storage/buffer/bufmgr.c | 87 ++++
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 340 +++++++++----
src/backend/storage/smgr/md.c | 95 +++-
src/backend/storage/smgr/smgr.c | 32 ++
src/backend/storage/sync/sync.c | 21 +-
src/bin/pg_rewind/parsexlog.c | 22 +
src/bin/pg_rewind/pg_rewind.c | 1 -
src/common/relpath.c | 47 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 43 +-
src/include/common/relpath.h | 9 +-
src/include/storage/bufmgr.h | 2 +
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 8 +-
src/include/storage/smgr.h | 17 +
src/tools/pgindent/typedefs.list | 7 +
25 files changed, 1483 insertions(+), 183 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index e0ee8a078a..2f92c06f70 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rlocator.dbOid,
+ xlrec->rlocator.spcOid,
+ xlrec->rlocator.relNumber,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +95,15 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 22c8ae9755..617a63e2c5 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -741,6 +741,16 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+================================
+The Smgr MARK files
+--------------------------------
+
+An smgr mark file is an empty file that is created when a new relation
+storage file is created to signal that the storage file needs to be
+cleaned up at recovery time. In contrast to the four actions above,
+failure to remove smgr mark files will lead to data loss, in which
+case the server will shut down.
+
Skipping WAL for New RelFileLocator
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8086b857b9..17d631a5db 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2213,6 +2213,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2463,6 +2466,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2788,6 +2794,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea..af49fb5bf5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -42,6 +42,7 @@
#include "access/xlogutils.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "common/file_utils.h"
#include "miscadmin.h"
@@ -56,6 +57,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/datetime.h"
@@ -1777,6 +1779,14 @@ PerformWalRecovery(void)
RmgrCleanup();
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
ereport(LOG,
(errmsg("redo done at %X/%X system usage: %s",
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),
@@ -3113,6 +3123,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
StandbyMode = true;
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 74fb529380..1736df3d24 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1190,6 +1190,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relnumchars; /* Chars in filename that are the
* relnumber */
+ StorageMarks mark; /* marker file sign */
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1240,7 +1241,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
@@ -1447,6 +1448,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
strncmp(fullpath, "/", 1) == 0)
{
int excludeIdx;
+ char *p;
/* Compare file against noChecksumFiles skip list */
for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
@@ -1460,6 +1462,11 @@ is_checksummed_file(const char *fullpath, const char *filename)
return false;
}
+ /* exclude mark files */
+ p = strchr(filename, '.');
+ if (p && isalpha(p[1]) && p[2] == 0)
+ return false;
+
return true;
}
else
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d708af19ed..595d6c2bb3 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,23 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -123,6 +142,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
SMgrRelation srel;
BackendId backend;
bool needs_wal;
+ PendingCleanup *pendingclean;
Assert(!IsInParallelMode()); /* couldn't update pendingSyncHash */
@@ -145,9 +165,23 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ * We don't need this during WAL-loggged CREATE DATABASE. See
+ * CreateAndCopyRelationData for detail.
+ */
srel = smgropen(rlocator, backend);
+
+ if (register_delete)
+ {
+ log_smgrcreatemark(&rlocator, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ }
+
smgrcreate(srel, MAIN_FORKNUM, false);
-
+
if (needs_wal)
log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -157,16 +191,29 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
*/
if (register_delete)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->rlocator = rlocator;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->rlocator = rlocator;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
}
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
@@ -178,6 +225,204 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init-fork exists since before the current transaction
+ * started. This function reverts that change just by removing the entry.
+ * See RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the commit-sentinel file */
+ srel = smgropen(rlocator, InvalidBackendId);
+ log_smgrcreatemark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * init fork for indexes needs further initialization. ambuildempty should
+ * do WAL-log and file sync by itself but otherwise we do that by
+ * ourselves.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file and revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ *
+ * Register pending-delete of the init fork. The real deletion is performed by
+ * smgrDoPendingDeletes at commit.
+ *
+ * This function is transactional. If the transaction aborts later on, the
+ * deletion is canceled.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init fork is created in the current transaction. We remove
+ * both the init fork and mark file immediately in that case. Otherwise
+ * register an at-commit pending-unlink for the existing init fork. See
+ * RelationCreateInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, InvalidBackendId);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+
+ /*
+ * Some AMs initializes INIT fork via buffer manager. Drop all buffers
+ * for the INIT fork then unlink the INIT fork along with the mark
+ * file.
+ */
+ DropRelationBuffers(srel, &forknum, 1, &firstblock);
+ log_smgrunlinkmark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -197,6 +442,88 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator *rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +1038,93 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /*
+ * Unlink the fork file. Currently we use this only for
+ * init forks and we're sure that the init fork is not
+ * loaded on shared buffers. For RelationDropInitFork
+ * case, the function dropped that buffers. For
+ * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true)
+ * is set and the buffers have been dropped just before.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->rlocator,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -971,6 +1385,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1059,6 +1482,124 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * Revert buffer-persistence changes at abort if the relation is going
+ * to different persistence from before this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f006807852..b01bdaa51f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -55,6 +55,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5407,6 +5408,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: do in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Under the following condition, we need to call ATRewriteTable, which
+ * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * First we collect all relations that we need to change persistence.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not bear up an in-place persistence
+ * change. Specifically, GiST uses page LSNs to figure out whether a
+ * block has changed, where UNLOGGED GiST indexes use fake LSNs that
+ * are incompatible with real LSNs used for LOGGED ones.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could skip index rebuild in exchange of some
+ * extra WAL records emitted while it is unlogged.
+ *
+ * Check relam against a positive list so that we take the hard way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * When this relation gets WAL-logged, immediately sync all files but
+ * initfork to establish the initial state on storage. Buffers have
+ * already flushed out by RelationCreate(Drop)InitFork called just
+ * above. Initfork should have been synced as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * While wal_level >= replica, switching to LOGGED requires the
+ * relation content to be WAL-logged to recover the table. We don't
+ * emit this if wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5537,48 +5719,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..6c6590005e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3169,6 +3169,93 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more accurately, out to kernel disk buffers),
+ * ensuring that the kernel has an up-to-date view of the relation.
+ *
+ * Generally, the caller should be holding AccessExclusiveLock on the
+ * target relation to ensure that no other backend is busy dirtying
+ * more blocks of the relation; the effects can't be expected to last
+ * after the lock is released.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not
+ * used in any performance-critical code paths, so it's not worth
+ * adding additional overhead to normal paths to make it go faster;
+ * but see also DropRelFileLocatorBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* Init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* we flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 4151cafec5..f3e087a006 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -346,8 +346,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3669,7 +3667,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 647c458b52..4d51cdaf34 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,49 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
- Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ RelFileNumber relNumber; /* hash key */
+ bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +92,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +101,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +129,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +190,225 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNumber);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("unlogged relation RelFileNumbers",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ ForkNumber forkNum;
+ int relnumchars;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ RelFileNumber key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Put the OID portion of the name into the hash table,
+ * if it isn't already. If it has SMGR_MARK_UNCOMMITTED mark
+ * files, the storage file is in dirty state, where clean up is
+ * needed. isn't already.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then
+ * the smgr object before initializing the unlogged relation. This is
+ * safe as far as no other backends have accessed the relation before
+ * starting archive recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while ((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileLocatorBackend rel;
+
+ /*
+ * The relation is persistent and stays persistent. Don't drop the
+ * buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = ent->relNumber;
+
+ srels[nrels++] = smgropen(rel.locator, InvalidBackendId);
+ }
+
+ DropRelationsAllBuffers(srels, nrels);
+
+ for (i = 0; i < nrels; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
- unlogged_relation_entry ent;
+ RelFileNumber key;
+ relfile_entry *ent;
+ RelFileLocatorBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int relnumchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed transaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relation files.
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path));
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +417,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +425,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +463,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +517,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +548,19 @@ parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 14b6fa0fd9..094b4191c4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -141,7 +141,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -173,6 +174,82 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rlocator.locator.spcOid,
+ reln->smgr_rlocator.locator.dbOid,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create mark file \"%s\": %m", path));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1063,6 +1140,16 @@ register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum)
+{
+ register_forget_request(rlocator, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1421,12 +1508,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rlocator.dbOid, ftag->rlocator.spcOid,
+ ftag->rlocator.relNumber,InvalidBackendId,
+ MAIN_FORKNUM, mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c1a5febcbf..5044dc21e4 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -371,6 +377,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -693,6 +719,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 9d6a9e9109..74357efb1c 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -91,7 +91,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -235,7 +236,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -244,6 +246,21 @@ SyncPostCheckpoint(void)
* here. rmtree() also has to ignore ENOENT errors, to deal with
* the possibility that we delete the file first.
*/
+ if (errno != ENOENT)
+ ereport(WARNING,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path));
+ }
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+ path,
+ SMGR_MARK_UNCOMMITTED)
+ < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork file has been successfully removed. It's ok if the file
+ * does not exist.
+ */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 53f011a2fe..4da62f71ce 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,28 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 3cd77c09b1..f87dfda8fe 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -419,7 +419,6 @@ main(int argc, char **argv)
if (showprogress)
pg_log_info("reading source file list");
source->traverse_files(source, &process_source_file);
-
if (showprogress)
pg_log_info("reading target file list");
traverse_datadir(datadir_target, &process_target_file);
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 1b6b620ce8..46cfe38fd5 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbOid, Oid spcOid)
*/
char *
GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcOid == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
Assert(dbOid == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNumber, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNumber, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNumber);
+ path = psprintf("global/%u%s", relNumber, markstr);
}
else if (spcOid == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbOid, relNumber);
+ path = psprintf("base/%u/%u%s",
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbOid, backendId, relNumber);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbOid, backendId, relNumber, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, relNumber);
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, backendId, relNumber);
+ dbOid, backendId, relNumber, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 9964c312aa..ee4179699a 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 44a5e2043b..9f48fb5e6f 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,32 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +83,13 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileLocator *rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 4bbd94393c..4281d35d1a 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -74,7 +74,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbOid, Oid spcOid);
extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -84,7 +84,7 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
/* First argument is a RelFileLocator */
#define relpathbackend(rlocator, backend, forknum) \
GetRelationPath((rlocator).dbOid, (rlocator).spcOid, (rlocator).relNumber, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileLocator */
#define relpathperm(rlocator, forknum) \
@@ -94,4 +94,9 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
#define relpath(rlocator, forknum) \
relpathbackend((rlocator).locator, (rlocator).backend, forknum)
+/* First argument is a RelFileLocatorBackend */
+#define markpath(rlocator, forknum, mark) \
+ GetRelationPath((rlocator).locator.dbOid, (rlocator).locator.spcOid, \
+ (rlocator).locator.relNumber, \
+ (rlocator).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index e1bd22441b..036fdc35b3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -151,6 +151,8 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index c0a212487d..7c0fa86b15 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -179,6 +179,7 @@ extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int elevel);
extern int durable_unlink(const char *fname, int elevel);
extern void SyncDataDirectory(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 10aa1b0109..294a09444c 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index b990d28d38..cacd5e7d2c 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,14 +16,16 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
int *relnumchars,
- ForkNumber *fork);
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356b..75b9d41e4b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilelocator.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -87,7 +99,12 @@ extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f8302f1ed1..83b455752e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1970,6 +1970,7 @@ PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
PendingFsyncEntry
+PendingMarkCleanup
PendingRelDelete
PendingRelSync
PendingUnlinkEntry
@@ -2600,6 +2601,7 @@ StdRdOptIndexCleanup
StdRdOptions
Step
StopList
+StorageMarks
StrategyNumber
StreamCtl
String
@@ -3609,6 +3611,7 @@ registered_buffer
regmatch_t
regoff_t
regproc
+relfile_entry
relopt_bool
relopt_enum
relopt_enum_elt_def
@@ -3662,6 +3665,7 @@ slist_iter
slist_mutable_iter
slist_node
slock_t
+smgr_mark_action
socket_set
socklen_t
spgBulkDeleteState
@@ -3861,8 +3865,11 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
+xl_smgr_mark
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.31.1
v25-0002-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET.patchtext/x-patch; charset=us-asciiDownload
From 3e1c4a6aa94e6ab0ee2c3f762dde6ed1a54aa806 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 18 Nov 2022 13:29:59 +0900
Subject: [PATCH v25 2/2] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/catalog/storage.c | 4 +-
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/catalog/storage_xlog.h | 2 +-
src/include/commands/tablecmds.h | 2 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
10 files changed, 340 insertions(+), 3 deletions(-)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 43d782fea9..d7a3aa3434 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -763,6 +765,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(for identity or serial columns). However, it is also possible to
change the persistence of such sequences separately.
</para>
+ <para>
+ All tables in the current database in a tablespace can be changed by
+ using the <literal>ALL IN TABLESPACE</literal> form, which will first
+ lock all tables to be changed and then change each one. This form also
+ supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ specified roles. If the <literal>NOWAIT</literal> option is specified,
+ then the command will fail if it is unable to immediately acquire all of
+ the locks required. The <literal>information_schema</literal> relations
+ are not considered part of the system catalogs and will be changed. See
+ also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 595d6c2bb3..0f6d3e2ccf 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -509,14 +509,14 @@ log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
* Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
*/
void
-log_smgrbufpersistence(const RelFileLocator *rlocator, bool persistence)
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
{
xl_smgr_bufpersistence xlrec;
/*
* Make an XLOG entry reporting the change of buffer persistence.
*/
- xlrec.rlocator = *rlocator;
+ xlrec.rlocator = rlocator;
xlrec.persistence = persistence;
XLogBeginInsert();
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b01bdaa51f..1d7b9c33eb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14855,6 +14855,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistence only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified"));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're
+ * going to change persistence.
+ */
+ if (!object_ownercheck(RelationRelationId, relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname)));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid)));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileLocator newrlocator)
{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 2a910ded15..a1bf2afacf 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2087,6 +2087,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *) n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 247d0816ad..a784342058 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -164,6 +164,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1760,6 +1761,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2688,6 +2695,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 9f48fb5e6f..0a3726f3db 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -88,7 +88,7 @@ extern void log_smgrcreatemark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
-extern void log_smgrbufpersistence(const RelFileLocator *rlocator,
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
bool persistence);
extern void smgr_redo(XLogReaderState *record);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 03f14d6be1..27286fff33 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 7caff62af7..738f3e20be 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2427,6 +2427,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index c52cf1cfcf..a679d58553 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -966,5 +966,81 @@ drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to materialized view testschema.amv
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index 21db433f2a..a4b664f4e0 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -431,5 +431,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.31.1
I want to call out this part of this patch:
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
This is interesting to a lot wider audience than ALTER TABLE SET
LOGGED/UNLOGGED. It also adds most of the complexity, with the new
marker files. Can you please split the first patch into two:
1. Cleanup of newly created relations on crash
2. ALTER TABLE SET LOGGED/UNLOGGED changes
Then we can review the first part independently.
Regarding the first part, I'm not sure the marker files are the best
approach to implement it. You need to create an extra file for every
relation, just to delete it at commit. It feels a bit silly, but maybe
it's OK in practice. The undo log patch set solved this problem with the
undo log, but it looks like that patch set isn't going anywhere. Maybe
invent a very lightweight version of the undo log for this?
- Heikki
Thank you for the comment!
At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
I want to call out this part of this patch:
Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.This is interesting to a lot wider audience than ALTER TABLE SET
LOGGED/UNLOGGED. It also adds most of the complexity, with the new
marker files. Can you please split the first patch into two:1. Cleanup of newly created relations on crash
2. ALTER TABLE SET LOGGED/UNLOGGED changes
Then we can review the first part independently.
Ah, indeed. I'll do that.
Regarding the first part, I'm not sure the marker files are the best
approach to implement it. You need to create an extra file for every
relation, just to delete it at commit. It feels a bit silly, but maybe
Agreed. (But I didn't come up with better idea..)
it's OK in practice. The undo log patch set solved this problem with
the undo log, but it looks like that patch set isn't going
anywhere. Maybe invent a very lightweight version of the undo log for
this?
I didn't thought on that line. Yes, indeed the marker files are a kind
of undo log.
Anyway, I'll split the current patch to two parts as suggested.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, 6 Feb 2023 at 23:48, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Thank you for the comment!
At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
I want to call out this part of this patch:
Looks like this patch has received some solid feedback from Heikki and
you have a path forward. It's not currently building in the build farm
either.
I'll set the patch to Waiting on Author for now.
--
Gregory Stark
As Commitfest Manager
At Wed, 1 Mar 2023 14:56:25 -0500, "Gregory Stark (as CFM)" <stark.cfm@gmail.com> wrote in
On Mon, 6 Feb 2023 at 23:48, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Thank you for the comment!
At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
I want to call out this part of this patch:
Looks like this patch has received some solid feedback from Heikki and
you have a path forward. It's not currently building in the build farm
either.I'll set the patch to Waiting on Author for now.
Correctly they are three parts.
Correctly they are three parts. The attached patch is the first part -
the storage mark files, which are used to identify storage files that
have not been committed and should be removed during the next
startup. This feature resolves the issue of orphaned storage files
that may result from a crash occurring during the execution of a
transaction involving the creation of a new table.
I'll post all of the three parts shortly.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v26-0001-Storage-mark-files.patchtext/x-patch; charset=us-asciiDownload
From 1665e3428b9d777989864ea302eef8368a739e7e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 2 Mar 2023 17:25:12 +0900
Subject: [PATCH v26] Storage mark files
In certain situations, specific operations followed by a crash-restart
can result in orphaned storage files. These files cannot be removed
through standard methods. To address this issue, this commit
implements 'mark files' that conveys information about the storage
file. Specifically, the "UNCOMMITED" mark file is introduced to denote
files that have not been committed and should be removed during the
next startup.
---
src/backend/access/rmgrdesc/smgrdesc.c | 37 +++
src/backend/access/transam/README | 10 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 ++
src/backend/backup/basebackup.c | 9 +-
src/backend/catalog/storage.c | 270 ++++++++++++++++++-
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 313 +++++++++++++++-------
src/backend/storage/smgr/md.c | 95 ++++++-
src/backend/storage/smgr/smgr.c | 32 +++
src/backend/storage/sync/sync.c | 21 +-
src/bin/pg_rewind/parsexlog.c | 16 ++
src/common/relpath.c | 47 ++--
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 35 ++-
src/include/common/relpath.h | 9 +-
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 8 +-
src/include/storage/smgr.h | 17 ++
src/test/recovery/t/013_crash_restart.pl | 21 ++
src/tools/pgindent/typedefs.list | 6 +
22 files changed, 843 insertions(+), 144 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index bd841b96e8..f8187385c4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,37 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rlocator.dbOid,
+ xlrec->rlocator.spcOid,
+ xlrec->rlocator.relNumber,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +86,12 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 22c8ae9755..bf83d19abd 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -741,6 +741,16 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+================================
+Smgr MARK files
+--------------------------------
+
+An smgr mark file is an empty file that is created alongside a new
+relation storage file to signal that the storage file must be cleaned
+up during recovery. In contrast to the four actions above, failing to
+remove these files will result in a data loss, in which case the
+server will shut down.
+
Skipping WAL for New RelFileLocator
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b876401260..acbf8f1b12 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2227,6 +2227,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2478,6 +2481,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2806,6 +2812,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..4d28635f64 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -42,6 +42,7 @@
#include "access/xlogutils.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "common/file_utils.h"
#include "miscadmin.h"
@@ -56,6 +57,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/datetime.h"
@@ -1795,6 +1797,14 @@ PerformWalRecovery(void)
RmgrCleanup();
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
ereport(LOG,
(errmsg("redo done at %X/%X system usage: %s",
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),
@@ -3134,6 +3144,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
EnableStandbyMode();
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 3fb9451643..1b9f909dbc 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1191,6 +1191,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relnumchars; /* Chars in filename that are the
* relnumber */
+ StorageMarks mark; /* marker file sign */
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1241,7 +1242,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
@@ -1448,6 +1449,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
strncmp(fullpath, "/", 1) == 0)
{
int excludeIdx;
+ char *p;
/* Compare file against noChecksumFiles skip list */
for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
@@ -1461,6 +1463,11 @@ is_checksummed_file(const char *fullpath, const char *filename)
return false;
}
+ /* exclude mark files */
+ p = strchr(filename, '.');
+ if (p && isalpha(p[1]) && p[2] == 0)
+ return false;
+
return true;
}
else
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index af1491aa1d..03e06246be 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,21 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +89,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -123,6 +140,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
SMgrRelation srel;
BackendId backend;
bool needs_wal;
+ PendingCleanup *pendingclean;
Assert(!IsInParallelMode()); /* couldn't update pendingSyncHash */
@@ -145,9 +163,23 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ * We don't need this during WAL-loggged CREATE DATABASE. See
+ * CreateAndCopyRelationData for detail.
+ */
srel = smgropen(rlocator, backend);
+
+ if (register_delete)
+ {
+ log_smgrcreatemark(&rlocator, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ }
+
smgrcreate(srel, MAIN_FORKNUM, false);
-
+
if (needs_wal)
log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -157,16 +189,29 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
*/
if (register_delete)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->rlocator = rlocator;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->rlocator = rlocator;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
}
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
@@ -197,6 +242,69 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +819,76 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->rlocator,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -971,6 +1149,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1059,6 +1246,71 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9fd8444ed4..1b77347978 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -346,8 +346,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3670,7 +3668,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index fb55371b1b..250cfe9e44 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,45 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
- Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ RelFileNumber relNumber; /* hash key */
+ bool has_init; /* has INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of mark files.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that are to be cleaned up.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +88,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +97,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +125,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +153,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +173,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +186,200 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNumber);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("unlogged relation RelFileNumbers",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ ForkNumber forkNum;
+ int relnumchars;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ RelFileNumber key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Put the OID portion of the name into the hash table,
+ * if it isn't already. If it has SMGR_MARK_UNCOMMITTED mark
+ * files, the storage file is in dirty state, where clean up is
+ * needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object. Otherwise checkpointer wrongly tries to flush buffers
+ * for nonexistent relation storage. This is safe as far as no other
+ * backends have accessed the relation before starting archive
+ * recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while ((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileLocatorBackend rel;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = ent->relNumber;
+
+ srels[nrels++] = smgropen(rel.locator, InvalidBackendId);
+ }
+
+ DropRelationsAllBuffers(srels, nrels);
+
+ for (i = 0; i < nrels; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
- unlogged_relation_entry ent;
+ RelFileNumber key;
+ relfile_entry *ent;
+ RelFileLocatorBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int relnumchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
- else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path));
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +388,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +396,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +434,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +488,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +519,19 @@ parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..0b64635fb8 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -141,7 +141,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -173,6 +174,82 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rlocator.locator.spcOid,
+ reln->smgr_rlocator.locator.dbOid,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create mark file \"%s\": %m", path));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1085,6 +1162,16 @@ register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum)
+{
+ register_forget_request(rlocator, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1445,12 +1532,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rlocator.dbOid, ftag->rlocator.spcOid,
+ ftag->rlocator.relNumber,InvalidBackendId,
+ MAIN_FORKNUM, mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dc466e5414..9969d84209 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -371,6 +377,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -693,6 +719,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 768d1dbfc4..9d99cb8fef 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -91,7 +91,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -235,7 +236,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -244,6 +246,21 @@ SyncPostCheckpoint(void)
* here. rmtree() also has to ignore ENOENT errors, to deal with
* the possibility that we delete the file first.
*/
+ if (errno != ENOENT)
+ ereport(WARNING,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path));
+ }
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+ path,
+ SMGR_MARK_UNCOMMITTED)
+ < 0)
+ {
+ /*
+ * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork file has been successfully removed. It's ok if the file
+ * does not exist.
+ */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 27782237d0..e9e4bafb01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,22 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 87de5f6c96..b1f6832cfa 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbOid, Oid spcOid)
*/
char *
GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcOid == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
Assert(dbOid == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNumber, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNumber, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNumber);
+ path = psprintf("global/%u%s", relNumber, markstr);
}
else if (spcOid == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbOid, relNumber);
+ path = psprintf("base/%u/%u%s",
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbOid, backendId, relNumber);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbOid, backendId, relNumber, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, relNumber);
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, backendId, relNumber);
+ dbOid, backendId, relNumber, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 45a3c7835c..0b39c6ef56 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 6b0a7aa3df..a36646c6ee 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,26 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +77,11 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 511c21682e..28c9dbcd13 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -74,7 +74,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbOid, Oid spcOid);
extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -84,7 +84,7 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
/* First argument is a RelFileLocator */
#define relpathbackend(rlocator, backend, forknum) \
GetRelationPath((rlocator).dbOid, (rlocator).spcOid, (rlocator).relNumber, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileLocator */
#define relpathperm(rlocator, forknum) \
@@ -94,4 +94,9 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
#define relpath(rlocator, forknum) \
relpathbackend((rlocator).locator, (rlocator).backend, forknum)
+/* First argument is a RelFileLocatorBackend */
+#define markpath(rlocator, forknum, mark) \
+ GetRelationPath((rlocator).locator.dbOid, (rlocator).locator.spcOid, \
+ (rlocator).locator.relNumber, \
+ (rlocator).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index f85de97d08..91612f2e42 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -179,6 +179,7 @@ extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int elevel);
extern int durable_unlink(const char *fname, int elevel);
extern void SyncDataDirectory(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 8f32af9ef3..37de1a0d7b 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index e2bbb5abe9..119dac1505 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,14 +16,16 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
int *relnumchars,
- ForkNumber *fork);
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 0935144f42..da6e0f3d64 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilelocator.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -87,7 +99,12 @@ extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index 92e7b367df..9def8d2062 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -86,6 +86,24 @@ ok( pump_until(
$killme_stdout = '';
$killme_stderr = '';
+#create a table that should *not* survive, but has rows.
+#the table's contents is requried to cause access to the storage file
+#after a restart.
+$killme_stdin .= q[
+CREATE TABLE not_alive AS SELECT 1 as a;
+SELECT pg_relation_filepath('not_alive');
+];
+ok( pump_until(
+ $killme, $psql_timeout,
+ \$killme_stdout, qr/[[:alnum:]\/]+[\r\n]$/m),
+ 'added in-creation table');
+my $not_alive_relfile = $node->data_dir . "/" . $killme_stdout;
+chomp($not_alive_relfile);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# The relfile must be exists now
+ok ( -e $not_alive_relfile, 'relfile for in-creation table');
# Start longrunning query in second session; its failure will signal that
# crash-restart has occurred. The initial wait for the trivial select is to
@@ -144,6 +162,9 @@ $killme->run();
($monitor_stdin, $monitor_stdout, $monitor_stderr) = ('', '', '');
$monitor->run();
+# The relfile must have been removed due to the recent restart.
+ok ( ! -e $not_alive_relfile,
+ 'relfile for the in-creation table should be removed after restart');
# Acquire pid of new backend
$killme_stdin .= q[
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 86a9303bf5..e7ba5d2dc8 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1986,6 +1986,7 @@ PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
PendingFsyncEntry
+PendingMarkCleanup
PendingRelDelete
PendingRelSync
PendingUnlinkEntry
@@ -2618,6 +2619,7 @@ StdRdOptIndexCleanup
StdRdOptions
Step
StopList
+StorageMarks
StrategyNumber
StreamCtl
String
@@ -3630,6 +3632,7 @@ registered_buffer
regmatch_t
regoff_t
regproc
+relfile_entry
relopt_bool
relopt_enum
relopt_enum_elt_def
@@ -3683,6 +3686,7 @@ slist_iter
slist_mutable_iter
slist_node
slock_t
+smgr_mark_action
socket_set
socklen_t
spgBulkDeleteState
@@ -3884,7 +3888,9 @@ xl_restore_point
xl_running_xacts
xl_seq_rec
xl_smgr_create
+xl_smgr_mark
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.31.1
At Fri, 03 Mar 2023 18:03:53 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Correctly they are three parts. The attached patch is the first part -
the storage mark files, which are used to identify storage files that
have not been committed and should be removed during the next
startup. This feature resolves the issue of orphaned storage files
that may result from a crash occurring during the execution of a
transaction involving the creation of a new table.I'll post all of the three parts shortly.
Mmm. It took longer than I said, but this is the patch set that
includes all three parts.
1. "Mark files" to prevent orphan storage files for in-transaction
created relations after a crash.
2. In-place persistence change: For ALTER TABLE SET LOGGED/UNLOGGED
with wal_level minimal, and ALTER TABLE SET UNLOGGED with other
wal_levels, the commands don't require a file copy for the relation
storage. ALTER TABLE SET LOGGED with non-minimal wal_level emits
bulk FPIs instead of a bunch of individual INSERTs.
3. An extension to ALTER TABLE SET (UN)LOGGED that can handle all
tables in a tablespace at once.
As a side note, I quickly go over the behavior of the mark files
introduced by the first patch, particularly what happens when deletion
fails.
(1) The mark file for MAIN fork ("<oid>.u") corresponds to all forks,
while the mark file for INIT fork ("<oid>_init.u") corresponds to
INIT fork alone.
(2) The mark file is created just before the the corresponding storage
file is made. This is always logged in the WAL.
(3) The mark file is deleted after removing the corresponding storage
file during the commit and rollback. This action is logged in the
WAL, too. If the deletion fails, an ERROR is output and the
transaction aborts.
(4) If a crash leaves a mark file behind, server will try to delete it
after successfully removing the corresponding storage file during
the subsequent startup that runs a recovery. If deletion fails,
server leaves the mark file alone with emitting a WARNING. (The
same behavior for non-mark files.)
(5) If the deletion of the mark file fails, the leftover mark file
prevents the creation of the corresponding storage file (causing
an ERROR). The leftover mark files don't result in the removal of
the wrong files due to that behavior.
(6) The mark file for an INIT fork is created only when ALTER TABLE
SET UNLOGGED is executed (not for CREATE UNLOGGED TABLE) to signal
the crash-cleanup code to remove the INIT fork. (Otherwise the
cleanup code removes the main fork instead. This is the main
objective of introducing the mark files.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v27-0001-Storage-mark-files.patchtext/x-patch; charset=us-asciiDownload
From ba4b8140fe582ceec4ea810621e17d6a1fe9c408 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 2 Mar 2023 17:25:12 +0900
Subject: [PATCH v27 1/3] Storage mark files
In certain situations, specific operations followed by a crash-restart
can result in orphaned storage files. These files cannot be removed
through standard methods. To address this issue, this commit
implements 'mark files' that conveys information about the storage
file. Specifically, the "UNCOMMITED" mark file is introduced to denote
files that have not been committed and should be removed during the
next startup.
---
src/backend/access/rmgrdesc/smgrdesc.c | 37 +++
src/backend/access/transam/README | 10 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 ++
src/backend/backup/basebackup.c | 9 +-
src/backend/catalog/storage.c | 270 ++++++++++++++++++-
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 313 +++++++++++++++-------
src/backend/storage/smgr/md.c | 95 ++++++-
src/backend/storage/smgr/smgr.c | 32 +++
src/backend/storage/sync/sync.c | 26 +-
src/bin/pg_rewind/parsexlog.c | 16 ++
src/common/relpath.c | 47 ++--
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 35 ++-
src/include/common/relpath.h | 9 +-
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 8 +-
src/include/storage/smgr.h | 17 ++
src/test/recovery/t/013_crash_restart.pl | 21 ++
src/tools/pgindent/typedefs.list | 6 +
22 files changed, 848 insertions(+), 144 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index bd841b96e8..f8187385c4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,37 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rlocator.dbOid,
+ xlrec->rlocator.spcOid,
+ xlrec->rlocator.relNumber,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +86,12 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 22c8ae9755..bf83d19abd 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -741,6 +741,16 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+================================
+Smgr MARK files
+--------------------------------
+
+An smgr mark file is an empty file that is created alongside a new
+relation storage file to signal that the storage file must be cleaned
+up during recovery. In contrast to the four actions above, failing to
+remove these files will result in a data loss, in which case the
+server will shut down.
+
Skipping WAL for New RelFileLocator
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b876401260..acbf8f1b12 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2227,6 +2227,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2478,6 +2481,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2806,6 +2812,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index dbe9394762..4d28635f64 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -42,6 +42,7 @@
#include "access/xlogutils.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "common/file_utils.h"
#include "miscadmin.h"
@@ -56,6 +57,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/datetime.h"
@@ -1795,6 +1797,14 @@ PerformWalRecovery(void)
RmgrCleanup();
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
ereport(LOG,
(errmsg("redo done at %X/%X system usage: %s",
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),
@@ -3134,6 +3144,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
EnableStandbyMode();
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 6efdefb591..3098977626 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1191,6 +1191,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relnumchars; /* Chars in filename that are the
* relnumber */
+ StorageMarks mark; /* marker file sign */
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1241,7 +1242,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
@@ -1448,6 +1449,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
strncmp(fullpath, "/", 1) == 0)
{
int excludeIdx;
+ char *p;
/* Compare file against noChecksumFiles skip list */
for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
@@ -1461,6 +1463,11 @@ is_checksummed_file(const char *fullpath, const char *filename)
return false;
}
+ /* exclude mark files */
+ p = strchr(filename, '.');
+ if (p && isalpha(p[1]) && p[2] == 0)
+ return false;
+
return true;
}
else
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index af1491aa1d..03e06246be 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,21 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +89,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -123,6 +140,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
SMgrRelation srel;
BackendId backend;
bool needs_wal;
+ PendingCleanup *pendingclean;
Assert(!IsInParallelMode()); /* couldn't update pendingSyncHash */
@@ -145,9 +163,23 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ * We don't need this during WAL-loggged CREATE DATABASE. See
+ * CreateAndCopyRelationData for detail.
+ */
srel = smgropen(rlocator, backend);
+
+ if (register_delete)
+ {
+ log_smgrcreatemark(&rlocator, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ }
+
smgrcreate(srel, MAIN_FORKNUM, false);
-
+
if (needs_wal)
log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -157,16 +189,29 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
*/
if (register_delete)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->rlocator = rlocator;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->rlocator = rlocator;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
}
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
@@ -197,6 +242,69 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +819,76 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->rlocator,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -971,6 +1149,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1059,6 +1246,71 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9fd8444ed4..1b77347978 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -346,8 +346,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3670,7 +3668,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index fb55371b1b..250cfe9e44 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,45 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
- Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ RelFileNumber relNumber; /* hash key */
+ bool has_init; /* has INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of mark files.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that are to be cleaned up.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +88,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +97,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +125,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +153,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +173,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +186,200 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNumber);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("unlogged relation RelFileNumbers",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ ForkNumber forkNum;
+ int relnumchars;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ RelFileNumber key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Put the OID portion of the name into the hash table,
+ * if it isn't already. If it has SMGR_MARK_UNCOMMITTED mark
+ * files, the storage file is in dirty state, where clean up is
+ * needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object. Otherwise checkpointer wrongly tries to flush buffers
+ * for nonexistent relation storage. This is safe as far as no other
+ * backends have accessed the relation before starting archive
+ * recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while ((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileLocatorBackend rel;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = ent->relNumber;
+
+ srels[nrels++] = smgropen(rel.locator, InvalidBackendId);
+ }
+
+ DropRelationsAllBuffers(srels, nrels);
+
+ for (i = 0; i < nrels; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
- unlogged_relation_entry ent;
+ RelFileNumber key;
+ relfile_entry *ent;
+ RelFileLocatorBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int relnumchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
- else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path));
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +388,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +396,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +434,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +488,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +519,19 @@ parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..0b64635fb8 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -141,7 +141,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
-
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
/*
* mdinit() -- Initialize private state for magnetic disk storage manager.
@@ -173,6 +174,82 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rlocator.locator.spcOid,
+ reln->smgr_rlocator.locator.dbOid,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create mark file \"%s\": %m", path));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1085,6 +1162,16 @@ register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum)
+{
+ register_forget_request(rlocator, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1445,12 +1532,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rlocator.dbOid, ftag->rlocator.spcOid,
+ ftag->rlocator.relNumber,InvalidBackendId,
+ MAIN_FORKNUM, mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dc466e5414..9969d84209 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -62,6 +62,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -371,6 +377,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -693,6 +719,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 768d1dbfc4..16cf74702e 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -91,7 +91,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -235,7 +236,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -244,6 +246,26 @@ SyncPostCheckpoint(void)
* here. rmtree() also has to ignore ENOENT errors, to deal with
* the possibility that we delete the file first.
*/
+ if (errno != ENOENT)
+ ereport(WARNING,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path));
+ }
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+ path,
+ SMGR_MARK_UNCOMMITTED)
+ < 0)
+ {
+ /*
+ * We might also have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork file has been successfully removed. It's fine if the file
+ * does not exist. Since we have successfully removed the storage
+ * file, it's no big deal if the mark file can't be removed. It
+ * will be eventually removed during a future startup. If that
+ * removal fails, the leftover mark file prevents the creation of
+ * the corresponding storage file so that mark files won't result
+ * in unexpected removal of the correct storage files.
+ */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 27782237d0..e9e4bafb01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,22 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 87de5f6c96..b1f6832cfa 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbOid, Oid spcOid)
*/
char *
GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcOid == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
Assert(dbOid == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNumber, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNumber, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNumber);
+ path = psprintf("global/%u%s", relNumber, markstr);
}
else if (spcOid == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbOid, relNumber);
+ path = psprintf("base/%u/%u%s",
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbOid, backendId, relNumber);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbOid, backendId, relNumber, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, relNumber);
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, backendId, relNumber);
+ dbOid, backendId, relNumber, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 45a3c7835c..0b39c6ef56 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 6b0a7aa3df..a36646c6ee 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,26 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +77,11 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 511c21682e..28c9dbcd13 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -74,7 +74,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbOid, Oid spcOid);
extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -84,7 +84,7 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
/* First argument is a RelFileLocator */
#define relpathbackend(rlocator, backend, forknum) \
GetRelationPath((rlocator).dbOid, (rlocator).spcOid, (rlocator).relNumber, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileLocator */
#define relpathperm(rlocator, forknum) \
@@ -94,4 +94,9 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
#define relpath(rlocator, forknum) \
relpathbackend((rlocator).locator, (rlocator).backend, forknum)
+/* First argument is a RelFileLocatorBackend */
+#define markpath(rlocator, forknum, mark) \
+ GetRelationPath((rlocator).locator.dbOid, (rlocator).locator.spcOid, \
+ (rlocator).locator.relNumber, \
+ (rlocator).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index f85de97d08..91612f2e42 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -179,6 +179,7 @@ extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int elevel);
extern int durable_unlink(const char *fname, int elevel);
extern void SyncDataDirectory(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 8f32af9ef3..37de1a0d7b 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
@@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index e2bbb5abe9..119dac1505 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,14 +16,16 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
int *relnumchars,
- ForkNumber *fork);
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 0935144f42..da6e0f3d64 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilelocator.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -87,7 +99,12 @@ extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index 92e7b367df..9def8d2062 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -86,6 +86,24 @@ ok( pump_until(
$killme_stdout = '';
$killme_stderr = '';
+#create a table that should *not* survive, but has rows.
+#the table's contents is requried to cause access to the storage file
+#after a restart.
+$killme_stdin .= q[
+CREATE TABLE not_alive AS SELECT 1 as a;
+SELECT pg_relation_filepath('not_alive');
+];
+ok( pump_until(
+ $killme, $psql_timeout,
+ \$killme_stdout, qr/[[:alnum:]\/]+[\r\n]$/m),
+ 'added in-creation table');
+my $not_alive_relfile = $node->data_dir . "/" . $killme_stdout;
+chomp($not_alive_relfile);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# The relfile must be exists now
+ok ( -e $not_alive_relfile, 'relfile for in-creation table');
# Start longrunning query in second session; its failure will signal that
# crash-restart has occurred. The initial wait for the trivial select is to
@@ -144,6 +162,9 @@ $killme->run();
($monitor_stdin, $monitor_stdout, $monitor_stderr) = ('', '', '');
$monitor->run();
+# The relfile must have been removed due to the recent restart.
+ok ( ! -e $not_alive_relfile,
+ 'relfile for the in-creation table should be removed after restart');
# Acquire pid of new backend
$killme_stdin .= q[
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 097f42e1b3..747b7557dc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1986,6 +1986,7 @@ PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
PendingFsyncEntry
+PendingMarkCleanup
PendingRelDelete
PendingRelSync
PendingUnlinkEntry
@@ -2617,6 +2618,7 @@ StdRdOptIndexCleanup
StdRdOptions
Step
StopList
+StorageMarks
StrategyNumber
StreamCtl
String
@@ -3629,6 +3631,7 @@ registered_buffer
regmatch_t
regoff_t
regproc
+relfile_entry
relopt_bool
relopt_enum
relopt_enum_elt_def
@@ -3682,6 +3685,7 @@ slist_iter
slist_mutable_iter
slist_node
slock_t
+smgr_mark_action
socket_set
socklen_t
spgBulkDeleteState
@@ -3883,7 +3887,9 @@ xl_restore_point
xl_running_xacts
xl_seq_rec
xl_smgr_create
+xl_smgr_mark
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.31.1
v27-0002-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 3531ead1788045b602f43af06fc1ba3ddf74c46b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 15 Mar 2023 15:42:09 +0900
Subject: [PATCH v27 2/3] In-place table persistence change
Currently, the command cuases a large amount of file I/O due to heap
rewrite, even though ALTER TABLE SET UNLOGGED does not require any
data rewrites. In addition, this patch changes ALTER TABLE SET LOGGED
to emit XLOG_FPI records instead of a large number of HEAP_INSERT's
when wal_level > minimal, as this option is likely to be less resource
intensive.
---
src/backend/access/rmgrdesc/smgrdesc.c | 12 +
src/backend/catalog/storage.c | 290 ++++++++++++++++++++++-
src/backend/commands/tablecmds.c | 269 ++++++++++++++++++---
src/backend/storage/buffer/bufmgr.c | 85 +++++++
src/backend/storage/file/reinit.c | 51 +++-
src/bin/pg_rewind/parsexlog.c | 6 +
src/bin/pg_rewind/pg_rewind.c | 1 -
src/include/catalog/storage_xlog.h | 8 +
src/include/storage/bufmgr.h | 2 +
src/test/recovery/t/013_crash_restart.pl | 21 --
src/tools/pgindent/typedefs.list | 1 +
11 files changed, 673 insertions(+), 73 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index f8187385c4..e2998a3ee4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -71,6 +71,15 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "%s %s", action, path);
pfree(path);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -92,6 +101,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_MARK:
id = "MARK";
break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 03e06246be..97d1230ee8 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -69,11 +69,13 @@ typedef struct PendingRelDelete
#define PCOP_UNLINK_FORK (1 << 0)
#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
typedef struct PendingCleanup
{
RelFileLocator rlocator; /* relation that need a cleanup */
int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
ForkNumber unlink_forknum; /* forknum to unlink */
StorageMarks unlink_mark; /* mark to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
@@ -223,6 +225,202 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If we have a pending-unlink for the init-fork of this relation, that
+ * means the init-fork exists since before the current transaction
+ * started. This function reverts that change just by removing the entry.
+ * See RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the mark file */
+ srel = smgropen(rlocator, InvalidBackendId);
+ log_smgrcreatemark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * For index relations, WAL-logging and file sync are performed by
+ * ambuildempty. On the other hand, we manually perform these tasks here
+ * for heap relations.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file then revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * Search for pending-unlink associated with the init-fork of the
+ * relation. The presence of one indicates that the init fork was created
+ * within the current transaction.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ /*
+ * If the init-fork was created in this transaction, we immediately remove
+ * both the init fork and mark file. Otherwise, we register an at-commit
+ * pending-unlink for the existing init fork. See
+ * RelationCreateInitFork.
+ */
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, InvalidBackendId);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+
+ /*
+ * Some AMs initialize INIT fork via buffer manager. To properly drop
+ * the init fork, we need to drop all buffers for the INIT fork first,
+ * then unlink the INIT fork along with the mark file.
+ */
+ DropRelationBuffers(srel, &forknum, 1, &firstblock);
+ log_smgrunlinkmark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -305,6 +503,25 @@ log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator *rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -858,10 +1075,28 @@ smgrDoPendingCleanups(bool isCommit)
srel = smgropen(pending->rlocator, pending->backend);
Assert((pending->op &
- ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK)) == 0);
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
if (pending->op & PCOP_UNLINK_FORK)
{
+ /*
+ * Unlink the fork file. Currently we only apply this
+ * operation for init forks and it is ceratin that the init
+ * fork is not loaded on shared buffers at this point. In
+ * the case of RelationDropInitFork, the function should
+ * have dropped buffers. In the case of
+ * RelationCreateInitFork, PCOP_SET_PERSISTENCE is set and
+ * the buffers were dropped just before.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+
/* Don't emit wal while recovery. */
if (!InRecovery)
log_smgrunlink(&pending->rlocator,
@@ -1311,6 +1546,59 @@ smgr_redo(XLogReaderState *record)
}
}
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete pending action for persistence change if any. We should have
+ * at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * At abort time, revert any changes to buffer-persistence that were
+ * made in this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3e2c5f797c..becef96927 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -55,6 +55,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5439,6 +5440,189 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: perform in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * ATRewriteTable should be used instead of this function under the
+ * following condition.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * Initially we gather all relations that require persistence change.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods do not support in-place persistence
+ * changes. GiST uses page LSNs to figure out whether a block has been
+ * modified. However UNLOGGED GiST indexes use fake LSNs that are
+ * incompatible with the real LSNs used for LOGGED indexes.
+ *
+ * Maybe if gistGetFakeLSN behaved the same way for permanent and
+ * unlogged indexes, we could potentially avoid index rebuilds in
+ * exchange for emitting some extra WAL records while the index is
+ * unlogged.
+ *
+ * Check relam against a positive list so that we take the hard way for
+ * unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * If this relation is changed to WAL-logged, immediately sync all
+ * files except for init fork to establish the initial state on
+ * storage. The buffers should have already been flushed out by
+ * RelationCreate(Drop)InitFork called immediately above. The init fork
+ * should have already been synchronized as needed.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * If wal_level >= replica, switching to LOGGED requires the relation
+ * content to be WAL-logged for later recovery. We don't emit this if
+ * wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5569,48 +5753,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..2b00ec3eed 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3240,6 +3240,91 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages of the relation out to disk when switching
+ * to PERMANENT. (or more precisely, to kernel disk buffers), ensuring
+ * that the kernel has an up-to-date view of the relation.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation
+ * to ensure that no other backend is busy dirtying more blocks of the
+ * relation.
+ *
+ * XXX currently it sequentially searches the buffer pool, should be
+ * changed to more clever ways of searching. This routine is not used in
+ * any performance-critical code paths, so it's not worth additional
+ * overhead to make it go faster; but see also DropRelationBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* The init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 250cfe9e44..bdd1200132 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -38,6 +38,7 @@ typedef struct
{
RelFileNumber relNumber; /* hash key */
bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
bool dirty_all; /* needs to remove all forks */
} relfile_entry;
@@ -45,7 +46,10 @@ typedef struct
* Clean up and reset relation files from before the last restart.
*
* If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
- * depending on the existence of mark files.
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
*
* If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
* whole relation along with the mark file.
@@ -54,7 +58,7 @@ typedef struct
* with the "init" fork, except for the "init" fork itself.
*
* If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
- * relations that are to be cleaned up.
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -241,7 +245,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
* Put the OID portion of the name into the hash table,
* if it isn't already. If it has SMGR_MARK_UNCOMMITTED mark
* files, the storage file is in dirty state, where clean up is
- * needed.
+ * needed. isn't already.
*/
key = atooid(de->d_name);
ent = hash_search(hash, &key, HASH_ENTER, &found);
@@ -249,10 +253,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
if (!found)
{
ent->has_init = false;
+ ent->dirty_init = false;
ent->dirty_all = false;
}
- if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
ent->dirty_all = true;
else
{
@@ -276,11 +283,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
{
/*
* When we come here after recovery, smgr object for this file might
- * have been created. In that case we need to drop all buffers then the
- * smgr object. Otherwise checkpointer wrongly tries to flush buffers
- * for nonexistent relation storage. This is safe as far as no other
- * backends have accessed the relation before starting archive
- * recovery.
+ * have been created. In that case we need to drop all buffers then
+ * the smgr object before initializing the unlogged relation. This is
+ * safe as far as no other backends have accessed the relation before
+ * starting archive recovery.
*/
HASH_SEQ_STATUS status;
relfile_entry *ent;
@@ -296,6 +302,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
{
RelFileLocatorBackend rel;
+ /*
+ * The relation is persistent and stays persistent. Don't drop the
+ * buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
if (maxrels <= nrels)
{
maxrels *= 2;
@@ -352,8 +365,24 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
if (!ent->has_init)
continue;
- if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
- continue;
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed transaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
+ else
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relation files.
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
/* so, nuke it! */
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index e9e4bafb01..ddc8014e55 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -434,6 +434,12 @@ extractPageInfo(XLogReaderState *record)
* empty so we don't need to bother the content.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index f7f3b8227f..b3a1f255d7 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -460,7 +460,6 @@ main(int argc, char **argv)
if (showprogress)
pg_log_info("reading source file list");
source->traverse_files(source, &process_source_file);
-
if (showprogress)
pg_log_info("reading target file list");
traverse_datadir(datadir_target, &process_target_file);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index a36646c6ee..6e79c68f5b 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -62,6 +62,12 @@ typedef struct xl_smgr_mark
smgr_mark_action action;
} xl_smgr_mark;
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -82,6 +88,8 @@ extern void log_smgrcreatemark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileLocator *rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b8a18b8081..fd34810dc2 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -156,6 +156,8 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index 9def8d2062..92e7b367df 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -86,24 +86,6 @@ ok( pump_until(
$killme_stdout = '';
$killme_stderr = '';
-#create a table that should *not* survive, but has rows.
-#the table's contents is requried to cause access to the storage file
-#after a restart.
-$killme_stdin .= q[
-CREATE TABLE not_alive AS SELECT 1 as a;
-SELECT pg_relation_filepath('not_alive');
-];
-ok( pump_until(
- $killme, $psql_timeout,
- \$killme_stdout, qr/[[:alnum:]\/]+[\r\n]$/m),
- 'added in-creation table');
-my $not_alive_relfile = $node->data_dir . "/" . $killme_stdout;
-chomp($not_alive_relfile);
-$killme_stdout = '';
-$killme_stderr = '';
-
-# The relfile must be exists now
-ok ( -e $not_alive_relfile, 'relfile for in-creation table');
# Start longrunning query in second session; its failure will signal that
# crash-restart has occurred. The initial wait for the trivial select is to
@@ -162,9 +144,6 @@ $killme->run();
($monitor_stdin, $monitor_stdout, $monitor_stderr) = ('', '', '');
$monitor->run();
-# The relfile must have been removed due to the recent restart.
-ok ( ! -e $not_alive_relfile,
- 'relfile for the in-creation table should be removed after restart');
# Acquire pid of new backend
$killme_stdin .= q[
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 747b7557dc..8dbbb09e8c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3886,6 +3886,7 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
xl_smgr_mark
xl_smgr_truncate
--
2.31.1
v27-0003-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From 6d8b4d8d1a34e4093f6c16d288aad80482d9122d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 15 Mar 2023 16:39:23 +0900
Subject: [PATCH v27 3/3] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes
relation persistence of all tables in the specified tablespace.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/catalog/storage.c | 4 +-
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/parser/gram.y | 42 +++++++
src/backend/storage/buffer/bufmgr.c | 2 +-
src/backend/tcop/utility.c | 11 ++
src/include/catalog/storage_xlog.h | 2 +-
src/include/commands/tablecmds.h | 2 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
11 files changed, 341 insertions(+), 4 deletions(-)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index d4d93eeb7c..7ee09ca9cf 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -769,6 +771,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(for identity or serial columns). However, it is also possible to
change the persistence of such sequences separately.
</para>
+ <para>
+ All tables in the current database in a tablespace can be changed by
+ using the <literal>ALL IN TABLESPACE</literal> form, which will first
+ lock all tables to be changed and then change each one. This form also
+ supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ specified roles. If the <literal>NOWAIT</literal> option is specified,
+ then the command will fail if it is unable to immediately acquire all of
+ the locks required. The <literal>information_schema</literal> relations
+ are not considered part of the system catalogs and will be changed. See
+ also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 97d1230ee8..38a88b1ccf 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -507,14 +507,14 @@ log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
* Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
*/
void
-log_smgrbufpersistence(const RelFileLocator *rlocator, bool persistence)
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
{
xl_smgr_bufpersistence xlrec;
/*
* Make an XLOG entry reporting the change of buffer persistence.
*/
- xlrec.rlocator = *rlocator;
+ xlrec.rlocator = rlocator;
xlrec.persistence = persistence;
XLogBeginInsert();
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index becef96927..ab6ba6192d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14889,6 +14889,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to change persistence of all objects in a given tablespace in
+ * the current database. Objects can be chosen based on the owner of the
+ * object also, to allow users to change persistence only their objects. The
+ * main permissions handling is done by the lower-level change persistence
+ * function.
+ *
+ * All to-be-modified objects are locked first. If NOWAIT is specified and the
+ * lock can't be acquired then we ereport(ERROR).
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified"));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're
+ * going to change persistence.
+ */
+ if (!object_ownercheck(RelationRelationId, relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname)));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid)));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileLocator newrlocator)
{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index efe88ccf9d..1616130e01 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2105,6 +2105,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *) n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2b00ec3eed..75d74caaba 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3309,7 +3309,7 @@ SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
LW_SHARED);
- FlushBuffer(bufHdr, srel);
+ FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr);
}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index c7d9d96b45..1cbd86e3c1 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -164,6 +164,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1760,6 +1761,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2688,6 +2695,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 6e79c68f5b..847660b6af 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -88,7 +88,7 @@ extern void log_smgrcreatemark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
-extern void log_smgrbufpersistence(const RelFileLocator *rlocator,
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
bool persistence);
extern void smgr_redo(XLogReaderState *record);
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index e7c2b91a58..6c0b60475a 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 028588fb33..217b26aeec 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2537,6 +2537,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index 9aabb85349..35b150b297 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -964,5 +964,81 @@ drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to materialized view testschema.amv
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index d274d9615e..eb8e247a1d 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -429,5 +429,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
--
2.31.1
At Fri, 17 Mar 2023 15:16:34 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Mmm. It took longer than I said, but this is the patch set that
includes all three parts.1. "Mark files" to prevent orphan storage files for in-transaction
created relations after a crash.2. In-place persistence change: For ALTER TABLE SET LOGGED/UNLOGGED
with wal_level minimal, and ALTER TABLE SET UNLOGGED with other
wal_levels, the commands don't require a file copy for the relation
storage. ALTER TABLE SET LOGGED with non-minimal wal_level emits
bulk FPIs instead of a bunch of individual INSERTs.3. An extension to ALTER TABLE SET (UN)LOGGED that can handle all
tables in a tablespace at once.As a side note, I quickly go over the behavior of the mark files
introduced by the first patch, particularly what happens when deletion
fails.(1) The mark file for MAIN fork ("<oid>.u") corresponds to all forks,
while the mark file for INIT fork ("<oid>_init.u") corresponds to
INIT fork alone.(2) The mark file is created just before the the corresponding storage
file is made. This is always logged in the WAL.(3) The mark file is deleted after removing the corresponding storage
file during the commit and rollback. This action is logged in the
WAL, too. If the deletion fails, an ERROR is output and the
transaction aborts.(4) If a crash leaves a mark file behind, server will try to delete it
after successfully removing the corresponding storage file during
the subsequent startup that runs a recovery. If deletion fails,
server leaves the mark file alone with emitting a WARNING. (The
same behavior for non-mark files.)(5) If the deletion of the mark file fails, the leftover mark file
prevents the creation of the corresponding storage file (causing
an ERROR). The leftover mark files don't result in the removal of
the wrong files due to that behavior.(6) The mark file for an INIT fork is created only when ALTER TABLE
SET UNLOGGED is executed (not for CREATE UNLOGGED TABLE) to signal
the crash-cleanup code to remove the INIT fork. (Otherwise the
cleanup code removes the main fork instead. This is the main
objective of introducing the mark files.)
Rebased.
I fixed some code comments and commit messages. I fixed the wrong
arrangement of some changes among patches. Most importantly, I fixed
the a bug based on a wrong assumption that init-fork is not resides on
shared buffers. Now smgrDoPendingCleanups drops buffer for a init-fork
to be removed.
The new fourth patch is a temporary fix for recently added code, which
will soon be no longer needed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v28-0001-Introduce-storage-mark-files.patchtext/x-patch; charset=us-asciiDownload
From e645e4782c4a1562aa932f87f3932b4e54beac11 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 2 Mar 2023 17:25:12 +0900
Subject: [PATCH v28 1/4] Introduce storage mark files
In specific scenarios, certain operations followed by a crash-restart
may generate orphaned storage files that cannot be removed through
standard procedures or cause the server to fail during restart. This
commit introduces 'mark files' to convey information about the storage
file. In particular, an "UNCOMMITTED" mark file is implemented to
identify uncommitted files for removal during the subsequent startup.
---
src/backend/access/rmgrdesc/smgrdesc.c | 37 +++
src/backend/access/transam/README | 10 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 ++
src/backend/backup/basebackup.c | 9 +-
src/backend/catalog/storage.c | 268 +++++++++++++++++-
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 313 +++++++++++++++-------
src/backend/storage/smgr/md.c | 94 ++++++-
src/backend/storage/smgr/smgr.c | 32 +++
src/backend/storage/sync/sync.c | 26 +-
src/bin/pg_rewind/parsexlog.c | 16 ++
src/common/relpath.c | 47 ++--
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 35 ++-
src/include/common/relpath.h | 9 +-
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 8 +-
src/include/storage/smgr.h | 17 ++
src/test/recovery/t/013_crash_restart.pl | 21 ++
src/tools/pgindent/typedefs.list | 6 +
22 files changed, 847 insertions(+), 142 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index bd841b96e8..f8187385c4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,37 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rlocator.dbOid,
+ xlrec->rlocator.spcOid,
+ xlrec->rlocator.relNumber,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +86,12 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 22c8ae9755..e10f6af0e3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -741,6 +741,16 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+================================
+Smgr MARK files
+--------------------------------
+
+A storage manger (smgr) mark file is an empty file created alongside a
+new relation storage file, indicating that the storage file requires
+cleanup during the recovery process. Unlike the previous four actions
+mentioned, failure to remove these marker files may lead to data loss,
+causing the server to shut down.
+
Skipping WAL for New RelFileLocator
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6a837e1539..4334ee198f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2224,6 +2224,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2475,6 +2478,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2799,6 +2805,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 188f6d6f85..84537842af 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -42,6 +42,7 @@
#include "access/xlogutils.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "common/file_utils.h"
#include "miscadmin.h"
@@ -56,6 +57,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/datetime.h"
@@ -1795,6 +1797,14 @@ PerformWalRecovery(void)
RmgrCleanup();
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
ereport(LOG,
(errmsg("redo done at %X/%X system usage: %s",
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),
@@ -3153,6 +3163,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
EnableStandbyMode();
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 5baea7535b..a235ab6502 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1191,6 +1191,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relnumchars; /* Chars in filename that are the
* relnumber */
+ StorageMarks mark; /* marker file sign */
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1241,7 +1242,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
@@ -1448,6 +1449,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
strncmp(fullpath, "/", 1) == 0)
{
int excludeIdx;
+ char *p;
/* Compare file against noChecksumFiles skip list */
for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
@@ -1461,6 +1463,11 @@ is_checksummed_file(const char *fullpath, const char *filename)
return false;
}
+ /* exclude mark files */
+ p = strchr(filename, '.');
+ if (p && isalpha(p[1]) && p[2] == 0)
+ return false;
+
return true;
}
else
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 2add053489..fe06c3c31d 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,21 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +89,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -123,6 +140,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
SMgrRelation srel;
BackendId backend;
bool needs_wal;
+ PendingCleanup *pendingclean;
Assert(!IsInParallelMode()); /* couldn't update pendingSyncHash */
@@ -145,7 +163,21 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ * We don't need this during WAL-loggged CREATE DATABASE. See
+ * CreateAndCopyRelationData for detail.
+ */
srel = smgropen(rlocator, backend);
+
+ if (register_delete)
+ {
+ log_smgrcreatemark(&rlocator, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ }
+
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
@@ -157,16 +189,29 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
*/
if (register_delete)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->rlocator = rlocator;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->rlocator = rlocator;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
}
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
@@ -197,6 +242,69 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +819,76 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->rlocator,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -971,6 +1149,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1059,6 +1246,71 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 277a28fc13..c3d79fe343 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -351,8 +351,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3766,7 +3764,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index fb55371b1b..e84fcbf884 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,45 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
- Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ RelFileNumber relNumber; /* hash key */
+ bool has_init; /* has INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of mark files.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that are to be cleaned up.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +88,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +97,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +125,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +153,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +173,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +186,200 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNumber);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("unlogged relation RelFileNumbers",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ {
+ ForkNumber forkNum;
+ int relnumchars;
+ StorageMarks mark;
+
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
+ &forkNum, &mark))
+ continue;
+
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
+ {
+ RelFileNumber key;
+ relfile_entry *ent;
+ bool found;
+
+ /*
+ * Put the OID portion of the name into the hash table,
+ * if it isn't already. If it has SMGR_MARK_UNCOMMITTED mark
+ * files, the storage file is in dirty state, where clean up is
+ * needed.
+ */
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
+ }
+ }
+
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
+ /*
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object. Otherwise checkpointer wrongly tries to flush buffers
+ * for nonexistent relation storage. This is safe as far as no other
+ * backends have accessed the relation before starting archive
+ * recovery.
+ */
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while ((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
+ {
+ RelFileLocatorBackend rel;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = ent->relNumber;
+
+ srels[nrels++] = smgropen(rel.locator, InvalidBackendId);
+ }
+
+ DropRelationsAllBuffers(srels, nrels);
+
+ for (i = 0; i < nrels; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
{
- HTAB *hash;
- HASHCTL ctl;
-
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
- /* Scan the directory. */
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
- unlogged_relation_entry ent;
+ RelFileNumber key;
+ relfile_entry *ent;
+ RelFileLocatorBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
-
- /*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
- */
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
- }
-
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
-
- /*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
- */
- if (hash_get_num_entries(hash) == 0)
- {
- hash_destroy(hash);
- return;
- }
-
- /*
- * Now, make a second pass and remove anything that matches.
- */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
- {
- ForkNumber forkNum;
- int relnumchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
- else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path));
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +388,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +396,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +434,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +488,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +519,19 @@ parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e982a8dd7f..6d33b4aef8 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -141,6 +141,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
static inline int
_mdfd_open_flags(void)
@@ -183,6 +185,82 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rlocator.locator.spcOid,
+ reln->smgr_rlocator.locator.dbOid,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create mark file \"%s\": %m", path));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1224,6 +1302,16 @@ register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum)
+{
+ register_forget_request(rlocator, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1589,12 +1677,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rlocator.dbOid, ftag->rlocator.spcOid,
+ ftag->rlocator.relNumber,InvalidBackendId,
+ MAIN_FORKNUM, mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 70d0d570b1..4d2553844b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -65,6 +65,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -86,6 +90,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -375,6 +381,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -722,6 +748,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 768d1dbfc4..16cf74702e 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -91,7 +91,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -235,7 +236,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -244,6 +246,26 @@ SyncPostCheckpoint(void)
* here. rmtree() also has to ignore ENOENT errors, to deal with
* the possibility that we delete the file first.
*/
+ if (errno != ENOENT)
+ ereport(WARNING,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path));
+ }
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+ path,
+ SMGR_MARK_UNCOMMITTED)
+ < 0)
+ {
+ /*
+ * We might also have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork file has been successfully removed. It's fine if the file
+ * does not exist. Since we have successfully removed the storage
+ * file, it's no big deal if the mark file can't be removed. It
+ * will be eventually removed during a future startup. If that
+ * removal fails, the leftover mark file prevents the creation of
+ * the corresponding storage file so that mark files won't result
+ * in unexpected removal of the correct storage files.
+ */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 27782237d0..e9e4bafb01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,22 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 87de5f6c96..b1f6832cfa 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbOid, Oid spcOid)
*/
char *
GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcOid == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
Assert(dbOid == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNumber, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNumber, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNumber);
+ path = psprintf("global/%u%s", relNumber, markstr);
}
else if (spcOid == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbOid, relNumber);
+ path = psprintf("base/%u/%u%s",
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbOid, backendId, relNumber);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbOid, backendId, relNumber, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, relNumber);
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, backendId, relNumber);
+ dbOid, backendId, relNumber, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 45a3c7835c..0b39c6ef56 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 6b0a7aa3df..a36646c6ee 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,26 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +77,11 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 511c21682e..28c9dbcd13 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -74,7 +74,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbOid, Oid spcOid);
extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -84,7 +84,7 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
/* First argument is a RelFileLocator */
#define relpathbackend(rlocator, backend, forknum) \
GetRelationPath((rlocator).dbOid, (rlocator).spcOid, (rlocator).relNumber, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileLocator */
#define relpathperm(rlocator, forknum) \
@@ -94,4 +94,9 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
#define relpath(rlocator, forknum) \
relpathbackend((rlocator).locator, (rlocator).backend, forknum)
+/* First argument is a RelFileLocatorBackend */
+#define markpath(rlocator, forknum, mark) \
+ GetRelationPath((rlocator).locator.dbOid, (rlocator).locator.spcOid, \
+ (rlocator).locator.relNumber, \
+ (rlocator).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 6791a406fc..35d022d8e1 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -190,6 +190,7 @@ extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int elevel);
extern int durable_unlink(const char *fname, int elevel);
extern void SyncDataDirectory(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a..de49863245 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
@@ -43,12 +47,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index e2bbb5abe9..119dac1505 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,14 +16,16 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
int *relnumchars,
- ForkNumber *fork);
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 17fba6f91a..337bc8dd1d 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -19,6 +19,18 @@
#include "storage/relfilelocator.h"
#include "utils/guc.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -88,7 +100,12 @@ extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index 92e7b367df..9def8d2062 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -86,6 +86,24 @@ ok( pump_until(
$killme_stdout = '';
$killme_stderr = '';
+#create a table that should *not* survive, but has rows.
+#the table's contents is requried to cause access to the storage file
+#after a restart.
+$killme_stdin .= q[
+CREATE TABLE not_alive AS SELECT 1 as a;
+SELECT pg_relation_filepath('not_alive');
+];
+ok( pump_until(
+ $killme, $psql_timeout,
+ \$killme_stdout, qr/[[:alnum:]\/]+[\r\n]$/m),
+ 'added in-creation table');
+my $not_alive_relfile = $node->data_dir . "/" . $killme_stdout;
+chomp($not_alive_relfile);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# The relfile must be exists now
+ok ( -e $not_alive_relfile, 'relfile for in-creation table');
# Start longrunning query in second session; its failure will signal that
# crash-restart has occurred. The initial wait for the trivial select is to
@@ -144,6 +162,9 @@ $killme->run();
($monitor_stdin, $monitor_stdout, $monitor_stderr) = ('', '', '');
$monitor->run();
+# The relfile must have been removed due to the recent restart.
+ok ( ! -e $not_alive_relfile,
+ 'relfile for the in-creation table should be removed after restart');
# Acquire pid of new backend
$killme_stdin .= q[
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b4058b88c3..72e81c084c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1992,6 +1992,7 @@ PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
PendingFsyncEntry
+PendingMarkCleanup
PendingRelDelete
PendingRelSync
PendingUnlinkEntry
@@ -2626,6 +2627,7 @@ StdRdOptIndexCleanup
StdRdOptions
Step
StopList
+StorageMarks
StrategyNumber
StreamCtl
String
@@ -3639,6 +3641,7 @@ registered_buffer
regmatch_t
regoff_t
regproc
+relfile_entry
relopt_bool
relopt_enum
relopt_enum_elt_def
@@ -3692,6 +3695,7 @@ slist_iter
slist_mutable_iter
slist_node
slock_t
+smgr_mark_action
socket_set
socklen_t
spgBulkDeleteState
@@ -3893,7 +3897,9 @@ xl_restore_point
xl_running_xacts
xl_seq_rec
xl_smgr_create
+xl_smgr_mark
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.31.1
v28-0002-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From ff532016d5167c2d6c3e3853962292272d47e69f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 25 Apr 2023 15:49:10 +0900
Subject: [PATCH v28 2/4] In-place table persistence change
Previously, the command caused a large amount of file I/O due to heap
rewrites, even though ALTER TABLE SET UNLOGGED does not require any
data rewrites. This patch eliminates the need for
rewrites. Additionally, ALTER TABLE SET LOGGED is updated to emit
XLOG_FPI records instead of numerous HEAP_INSERTs when wal_level >
minimal, reducing resource consumption.
---
src/backend/access/rmgrdesc/smgrdesc.c | 12 +
src/backend/catalog/storage.c | 295 ++++++++++++++++++++++++-
src/backend/commands/tablecmds.c | 268 ++++++++++++++++++----
src/backend/storage/buffer/bufmgr.c | 84 +++++++
src/backend/storage/file/reinit.c | 51 ++++-
src/bin/pg_rewind/parsexlog.c | 6 +
src/include/catalog/storage_xlog.h | 8 +
src/include/storage/bufmgr.h | 2 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 674 insertions(+), 53 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index f8187385c4..e2998a3ee4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -71,6 +71,15 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "%s %s", action, path);
pfree(path);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -92,6 +101,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_MARK:
id = "MARK";
break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fe06c3c31d..6106376525 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -69,11 +69,13 @@ typedef struct PendingRelDelete
#define PCOP_UNLINK_FORK (1 << 0)
#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
typedef struct PendingCleanup
{
RelFileLocator rlocator; /* relation that need a cleanup */
int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
ForkNumber unlink_forknum; /* forknum to unlink */
StorageMarks unlink_mark; /* mark to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
@@ -223,6 +225,202 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If a pending-unlink exists for this relation's init-fork, it indicates
+ * the init-fork's existed before the current transaction; this function
+ * reverts the pending-unlink by removing the entry. See
+ * RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the mark file */
+ srel = smgropen(rlocator, InvalidBackendId);
+ log_smgrcreatemark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * For index relations, WAL-logging and file sync are handled by
+ * ambuildempty. In contrast, for heap relations, these tasks are performed
+ * directly.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file then revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * Search for a pending-unlink associated with the init-fork of the
+ * relation. Its presence indicates that the init-fork was created within
+ * the current transaction.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ /*
+ * If the init-fork was created in this transaction, remove both the
+ * init-fork and mark file. Otherwise, register an at-commit pending-unlink
+ * for the existing init-fork. See RelationCreateInitFork.
+ */
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, InvalidBackendId);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+
+ /*
+ * Some AMs initialize init-fork via the buffer manager. To properly
+ * drop the init-fork, first drop all buffers for the init-fork, then
+ * unlink the init-fork and the mark file.
+ */
+ DropRelationBuffers(srel, &forknum, 1, &firstblock);
+ log_smgrunlinkmark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -305,6 +503,25 @@ log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -858,10 +1075,29 @@ smgrDoPendingCleanups(bool isCommit)
srel = smgropen(pending->rlocator, pending->backend);
Assert((pending->op &
- ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK)) == 0);
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
if (pending->op & PCOP_UNLINK_FORK)
{
+ BlockNumber firstblock = 0;
+
+ /*
+ * Unlink the fork file. Currently this operation is
+ * applied only to init-forks. As it is not ceratin that
+ * the init-fork is not loaded on shared buffers, drop all
+ * buffers for it.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+ DropRelationBuffers(srel, &pending->unlink_forknum, 1,
+ &firstblock);
+
/* Don't emit wal while recovery. */
if (!InRecovery)
log_smgrunlink(&pending->rlocator,
@@ -1286,8 +1522,8 @@ smgr_redo(XLogReaderState *record)
else
{
/*
- * Delete pending action for this mark file if any. We should have
- * at most one entry for this action.
+ * Delete any pending action for this mark file, if present. There
+ * should be at most one entry for this action.
*/
PendingCleanup *prev = NULL;
@@ -1311,6 +1547,59 @@ smgr_redo(XLogReaderState *record)
}
}
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete any pending action for persistence change, if present. There
+ * should be at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * During abort, revert any changes to buffer persistence made made in
+ * this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 343fe61115..26446db085 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -55,6 +55,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5464,6 +5465,188 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: perform in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Use ATRewriteTable instead of this function under the following
+ * condition.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * Initially, gather all relations that require a persistence change.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods don't support in-place persistence
+ * changes. GiST uses page LSNs to figure out whether a block has been
+ * modified. However, UNLOGGED GiST indexes use fake LSNs, which are
+ * incompatible with the real LSNs used for LOGGED indexes.
+ *
+ * Potentially, if gistGetFakeLSN behaved similarly for both permanent
+ * and unlogged indexes, we could avoid index rebuilds by emitting
+ * extra WAL records while the index is unlogged.
+ *
+ * Compare relam against a positive list to ensure the hard way is
+ * taken for unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * If this relation becomes WAL-logged, immediately sync all files
+ * except the init-fork to establish the initial state on storage. The
+ * buffers should have already been flushed out by
+ * RelationCreate(Drop)InitFork called just above. The init-fork should
+ * already be synchronized as required.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * If wal_level >= replica, switching to LOGGED necessitates WAL-logging
+ * the relation content for later recovery. This is not emitted when
+ * wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5594,48 +5777,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..14f42c283f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3702,6 +3702,90 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages to disk (or kernel disk buffers) when
+ * switching to PERMANENT, ensuring the kernel has an up-to-date view of
+ * the relation.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation
+ * to ensure no other backend is busy dirtying more blocks.
+ *
+ * XXX currently it sequentially searches the buffer pool; consider
+ * implementing more efficient search methods. This routine isn't used in
+ * performance-critical code paths, so it's not worth additional overhead
+ * to make it go faster; see also DropRelationBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* The init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index e84fcbf884..a5d8763e15 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -38,6 +38,7 @@ typedef struct
{
RelFileNumber relNumber; /* hash key */
bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
bool dirty_all; /* needs to remove all forks */
} relfile_entry;
@@ -45,7 +46,10 @@ typedef struct
* Clean up and reset relation files from before the last restart.
*
* If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
- * depending on the existence of mark files.
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
*
* If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
* whole relation along with the mark file.
@@ -54,7 +58,7 @@ typedef struct
* with the "init" fork, except for the "init" fork itself.
*
* If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
- * relations that are to be cleaned up.
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -241,7 +245,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
* Put the OID portion of the name into the hash table,
* if it isn't already. If it has SMGR_MARK_UNCOMMITTED mark
* files, the storage file is in dirty state, where clean up is
- * needed.
+ * needed. isn't already.
*/
key = atooid(de->d_name);
ent = hash_search(hash, &key, HASH_ENTER, &found);
@@ -249,10 +253,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
if (!found)
{
ent->has_init = false;
+ ent->dirty_init = false;
ent->dirty_all = false;
}
- if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
ent->dirty_all = true;
else
{
@@ -276,11 +283,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
{
/*
* When we come here after recovery, smgr object for this file might
- * have been created. In that case we need to drop all buffers then the
- * smgr object. Otherwise checkpointer wrongly tries to flush buffers
- * for nonexistent relation storage. This is safe as far as no other
- * backends have accessed the relation before starting archive
- * recovery.
+ * have been created. In that case we need to drop all buffers then
+ * the smgr object before initializing the unlogged relation. This is
+ * safe as far as no other backends have accessed the relation before
+ * starting archive recovery.
*/
HASH_SEQ_STATUS status;
relfile_entry *ent;
@@ -296,6 +302,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
{
RelFileLocatorBackend rel;
+ /*
+ * The relation is persistent and stays persistent. Don't drop the
+ * buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
if (maxrels <= nrels)
{
maxrels *= 2;
@@ -352,8 +365,24 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
if (!ent->has_init)
continue;
- if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
- continue;
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed transaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
+ else
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relation files.
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
/* so, nuke it! */
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index e9e4bafb01..ddc8014e55 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -434,6 +434,12 @@ extractPageInfo(XLogReaderState *record)
* empty so we don't need to bother the content.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index a36646c6ee..847660b6af 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -62,6 +62,12 @@ typedef struct xl_smgr_mark
smgr_mark_action action;
} xl_smgr_mark;
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -82,6 +88,8 @@ extern void log_smgrcreatemark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6ab00daa2e..2440803a6e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -222,6 +222,8 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 72e81c084c..3bdbb189a3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3896,6 +3896,7 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
xl_smgr_mark
xl_smgr_truncate
--
2.31.1
v28-0003-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-patch; charset=us-asciiDownload
From d5558924963528b99cd15dfb55a956871583a2fe Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 25 Apr 2023 16:12:23 +0900
Subject: [PATCH v28 3/4] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
Simplifies ALTER TABLE SET LOGGED/UNLOGGED invocation by allowing
users to specify relations based on tablespace or owner.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 338 insertions(+)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index d4d93eeb7c..7ee09ca9cf 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -769,6 +771,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(for identity or serial columns). However, it is also possible to
change the persistence of such sequences separately.
</para>
+ <para>
+ All tables in the current database in a tablespace can be changed by
+ using the <literal>ALL IN TABLESPACE</literal> form, which will first
+ lock all tables to be changed and then change each one. This form also
+ supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ specified roles. If the <literal>NOWAIT</literal> option is specified,
+ then the command will fail if it is unable to immediately acquire all of
+ the locks required. The <literal>information_schema</literal> relations
+ are not considered part of the system catalogs and will be changed. See
+ also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 26446db085..d4d045f560 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14966,6 +14966,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to modify the persistence of all objects in a specific
+ * tablespace in the current database. Objects can be filtered by owner,
+ * enabling users to update the persistence of only their objects. The primary
+ * permission handling is managed by the lower-level change persistence
+ * function.
+ *
+ * All objects to be modified are locked first. If NOWAIT is specified and the
+ * lock can't be acquired, an ERROR is thrown.
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified"));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're
+ * going to change persistence.
+ */
+ if (!object_ownercheck(RelationRelationId, relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname)));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid)));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileLocator newrlocator)
{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index acf6cf4866..93f542c1c4 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2130,6 +2130,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *) n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 30b51bf4d3..a10b6220e9 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -164,6 +164,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1770,6 +1771,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2698,6 +2705,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 17b9404937..a92835bc62 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index cc7b32b279..97cc736811 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2644,6 +2644,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index 9aabb85349..35b150b297 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -964,5 +964,81 @@ drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to materialized view testschema.amv
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index d274d9615e..eb8e247a1d 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -429,5 +429,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3bdbb189a3..b2686ec473 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -97,6 +97,7 @@ AlterTSConfigurationStmt
AlterTSDictionaryStmt
AlterTableCmd
AlterTableMoveAllStmt
+AlterTableSetLoggedAllStmt
AlterTableSpaceOptionsStmt
AlterTableStmt
AlterTableType
--
2.31.1
v28-0004-Tentative-deletion-of-include.patchtext/x-patch; charset=us-asciiDownload
From f8277928e13564d0095d5fdc1d5961e4311a0e52 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 25 Apr 2023 12:08:46 +0900
Subject: [PATCH v28 4/4] Tentative deletion of #include
I expect this will be deleted soon in the master branch.
---
src/include/storage/smgr.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 337bc8dd1d..4964146106 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,7 +17,6 @@
#include "lib/ilist.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
-#include "utils/guc.h"
/*
* Storage marks is a file of which existence suggests something about a
--
2.31.1
On Tue, Apr 25, 2023 at 9:55 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
Rebased.
I fixed some code comments and commit messages. I fixed the wrong
arrangement of some changes among patches. Most importantly, I fixed
the a bug based on a wrong assumption that init-fork is not resides on
shared buffers. Now smgrDoPendingCleanups drops buffer for a init-fork
to be removed.The new fourth patch is a temporary fix for recently added code, which
will soon be no longer needed.
Hi Kyotaro,
I've retested v28 of the patch with everything that came to my mind
(basic tests, --enable-tap-tests, restarts/crashes along adding the
data, checking if there were any files left over and I've checked for
stuff that earlier was causing problems: GiST on geometry[PostGIS]).
The only thing I've not tested this time were the performance runs
done earlier. The patch passed all my very limited tests along with
make check-world. Patch looks good to me on the surface from a
usability point of view. I haven't looked at the code, so the patch
might still need an in-depth review.
Regards,
-Jakub Wartak.
(I find the misspelled subject makes it difficult to find the thread..)
At Thu, 27 Apr 2023 14:47:41 +0200, Jakub Wartak <jakub.wartak@enterprisedb.com> wrote in
On Tue, Apr 25, 2023 at 9:55 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:Rebased.
I fixed some code comments and commit messages. I fixed the wrong
arrangement of some changes among patches. Most importantly, I fixed
the a bug based on a wrong assumption that init-fork is not resides on
shared buffers. Now smgrDoPendingCleanups drops buffer for a init-fork
to be removed.The new fourth patch is a temporary fix for recently added code, which
will soon be no longer needed.
This is no longer needed. Thank you, Thomas!
Hi Kyotaro,
I've retested v28 of the patch with everything that came to my mind
(basic tests, --enable-tap-tests, restarts/crashes along adding the
data, checking if there were any files left over and I've checked for
stuff that earlier was causing problems: GiST on geometry[PostGIS]).
Maybe it's fixed by dropping buffers.
The only thing I've not tested this time were the performance runs
done earlier. The patch passed all my very limited tests along with
make check-world. Patch looks good to me on the surface from a
usability point of view. I haven't looked at the code, so the patch
might still need an in-depth review.
Thank you for conducting a thorough test. In this patchset, the first
one might be useful on its own and it is the most complex part. I'll
recheck it.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
I think there are some good ideas here. I started to take a look at the
patches, and I've attached a rebased version of the patch set. Apologies
if I am repeating any discussions from upthread.
First, I tested the time difference in ALTER TABLE SET UNLOGGED/LOGGED with
the patch applied, and the results looked pretty impressive.
before:
postgres=# alter table test set unlogged;
ALTER TABLE
Time: 5108.071 ms (00:05.108)
postgres=# alter table test set logged;
ALTER TABLE
Time: 6747.648 ms (00:06.748)
after:
postgres=# alter table test set unlogged;
ALTER TABLE
Time: 25.609 ms
postgres=# alter table test set logged;
ALTER TABLE
Time: 1241.800 ms (00:01.242)
My first question is whether 0001 is a prerequisite to 0002. I'm assuming
it is, but the reason wasn't immediately obvious to me. If it's just
nice-to-have, perhaps we could simplify the patch set a bit. I see that
Heikki had some general concerns with the marker file approach [0]/messages/by-id/9827ebd3-de2e-fd52-4091-a568387b1fc2@iki.fi, so
perhaps it is at least worth brainstorming some alternatives if we _do_
need it.
[0]: /messages/by-id/9827ebd3-de2e-fd52-4091-a568387b1fc2@iki.fi
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v29-0001-Introduce-storage-mark-files.patchtext/x-diff; charset=us-asciiDownload
From 5a4fb063a8b5e8a731373c4d06e51ad7fbeebebd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 2 Mar 2023 17:25:12 +0900
Subject: [PATCH v29 1/3] Introduce storage mark files
In specific scenarios, certain operations followed by a crash-restart
may generate orphaned storage files that cannot be removed through
standard procedures or cause the server to fail during restart. This
commit introduces 'mark files' to convey information about the storage
file. In particular, an "UNCOMMITTED" mark file is implemented to
identify uncommitted files for removal during the subsequent startup.
---
src/backend/access/rmgrdesc/smgrdesc.c | 37 +++
src/backend/access/transam/README | 10 +
src/backend/access/transam/xact.c | 7 +
src/backend/access/transam/xlogrecovery.c | 18 ++
src/backend/backup/basebackup.c | 9 +-
src/backend/catalog/storage.c | 268 +++++++++++++++++++-
src/backend/storage/file/fd.c | 4 +-
src/backend/storage/file/reinit.c | 287 +++++++++++++++-------
src/backend/storage/smgr/md.c | 94 ++++++-
src/backend/storage/smgr/smgr.c | 32 +++
src/backend/storage/sync/sync.c | 26 +-
src/bin/pg_rewind/parsexlog.c | 16 ++
src/common/relpath.c | 47 ++--
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_xlog.h | 35 ++-
src/include/common/relpath.h | 9 +-
src/include/storage/fd.h | 1 +
src/include/storage/md.h | 8 +-
src/include/storage/reinit.h | 8 +-
src/include/storage/smgr.h | 17 ++
src/test/recovery/t/013_crash_restart.pl | 21 ++
src/tools/pgindent/typedefs.list | 6 +
22 files changed, 834 insertions(+), 129 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index bd841b96e8..f8187385c4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,37 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+ char *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
+
+ appendStringInfoString(buf, path);
+ pfree(path);
+ }
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+ char *path = GetRelationPath(xlrec->rlocator.dbOid,
+ xlrec->rlocator.spcOid,
+ xlrec->rlocator.relNumber,
+ InvalidBackendId,
+ xlrec->forkNum, xlrec->mark);
+ char *action = "<none>";
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ action = "CREATE";
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ action = "DELETE";
+ break;
+ }
+
+ appendStringInfo(buf, "%s %s", action, path);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +86,12 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_UNLINK:
+ id = "UNLINK";
+ break;
+ case XLOG_SMGR_MARK:
+ id = "MARK";
+ break;
}
return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 22c8ae9755..e10f6af0e3 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -741,6 +741,16 @@ we must panic and abort recovery. The DBA will have to manually clean up
then restart recovery. This is part of the reason for not writing a WAL
entry until we've successfully done the original action.
+================================
+Smgr MARK files
+--------------------------------
+
+A storage manger (smgr) mark file is an empty file created alongside a
+new relation storage file, indicating that the storage file requires
+cleanup during the recovery process. Unlike the previous four actions
+mentioned, failure to remove these marker files may lead to data loss,
+causing the server to shut down.
+
Skipping WAL for New RelFileLocator
--------------------------------
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8daaa535ed..2b0af2a938 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2224,6 +2224,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2475,6 +2478,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise delete mark files for files created during this transaction. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2799,6 +2805,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..8e09936993 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -42,6 +42,7 @@
#include "access/xlogutils.h"
#include "backup/basebackup.h"
#include "catalog/pg_control.h"
+#include "catalog/storage.h"
#include "commands/tablespace.h"
#include "common/file_utils.h"
#include "miscadmin.h"
@@ -56,6 +57,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/builtins.h"
#include "utils/datetime.h"
@@ -1795,6 +1797,14 @@ PerformWalRecovery(void)
RmgrCleanup();
+ /* cleanup garbage files left during crash recovery */
+ if (!InArchiveRecovery)
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
ereport(LOG,
(errmsg("redo done at %X/%X system usage: %s",
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),
@@ -3153,6 +3163,14 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
{
ereport(DEBUG1,
(errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+
+ /* cleanup garbage files left during crash recovery */
+ ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_CLEANUP);
+
+ /* run rollback cleanup if any */
+ smgrDoPendingDeletes(false);
+
InArchiveRecovery = true;
if (StandbyModeRequested)
EnableStandbyMode();
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 45be21131c..9af0982fe1 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -1191,6 +1191,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
ForkNumber relForkNum; /* Type of fork if file is a relation */
int relnumchars; /* Chars in filename that are the
* relnumber */
+ StorageMarks mark; /* marker file sign */
/* Skip special stuff */
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
@@ -1241,7 +1242,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly,
/* Exclude all forks for unlogged tables except the init fork */
if (isDbDir &&
parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &relForkNum))
+ &relForkNum, &mark))
{
/* Never exclude init forks */
if (relForkNum != INIT_FORKNUM)
@@ -1448,6 +1449,7 @@ is_checksummed_file(const char *fullpath, const char *filename)
strncmp(fullpath, "/", 1) == 0)
{
int excludeIdx;
+ char *p;
/* Compare file against noChecksumFiles skip list */
for (excludeIdx = 0; noChecksumFiles[excludeIdx].name != NULL; excludeIdx++)
@@ -1461,6 +1463,11 @@ is_checksummed_file(const char *fullpath, const char *filename)
return false;
}
+ /* exclude mark files */
+ p = strchr(filename, '.');
+ if (p && isalpha(p[1]) && p[2] == 0)
+ return false;
+
return true;
}
else
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 2add053489..fe06c3c31d 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,6 +19,7 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
@@ -66,6 +67,21 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_UNLINK_MARK (1 << 1)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ StorageMarks unlink_mark; /* mark to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +89,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -123,6 +140,7 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
SMgrRelation srel;
BackendId backend;
bool needs_wal;
+ PendingCleanup *pendingclean;
Assert(!IsInParallelMode()); /* couldn't update pendingSyncHash */
@@ -145,7 +163,21 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return NULL; /* placate compiler */
}
+ /*
+ * We are going to create a new storage file. If server crashes before the
+ * current transaction ends the file needs to be cleaned up. The
+ * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup.
+ * We don't need this during WAL-loggged CREATE DATABASE. See
+ * CreateAndCopyRelationData for detail.
+ */
srel = smgropen(rlocator, backend);
+
+ if (register_delete)
+ {
+ log_smgrcreatemark(&rlocator, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ }
+
smgrcreate(srel, MAIN_FORKNUM, false);
if (needs_wal)
@@ -157,16 +189,29 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
*/
if (register_delete)
{
- PendingRelDelete *pending;
+ PendingRelDelete *pendingdel;
- pending = (PendingRelDelete *)
+ pendingdel = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->backend = backend;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ pendingdel->rlocator = rlocator;
+ pendingdel->backend = backend;
+ pendingdel->atCommit = false; /* delete if abort */
+ pendingdel->nestLevel = GetCurrentTransactionNestLevel();
+ pendingdel->next = pendingDeletes;
+ pendingDeletes = pendingdel;
+
+ /* drop mark files at commit */
+ pendingclean = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pendingclean->rlocator = rlocator;
+ pendingclean->op = PCOP_UNLINK_MARK;
+ pendingclean->unlink_forknum = MAIN_FORKNUM;
+ pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pendingclean->backend = backend;
+ pendingclean->atCommit = true;
+ pendingclean->nestLevel = GetCurrentTransactionNestLevel();
+ pendingclean->next = pendingCleanups;
+ pendingCleanups = pendingclean;
}
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
@@ -197,6 +242,69 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL.
+ */
+void
+log_smgrcreatemark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_CREATE;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL.
+ */
+void
+log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
+ StorageMarks mark)
+{
+ xl_smgr_mark xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file creation.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+ xlrec.mark = mark;
+ xlrec.action = XLOG_SMGR_MARK_UNLINK;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +819,76 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+
+ if (pending->op & PCOP_UNLINK_MARK)
+ {
+ if (!InRecovery)
+ log_smgrunlinkmark(&pending->rlocator,
+ pending->unlink_forknum,
+ pending->unlink_mark);
+
+ smgrunlinkmark(srel, pending->unlink_forknum,
+ pending->unlink_mark, InRecovery);
+ smgrclose(srel);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -971,6 +1149,15 @@ smgr_redo(XLogReaderState *record)
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1059,6 +1246,71 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_MARK)
+ {
+ xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ bool created = false;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+
+ switch (xlrec->action)
+ {
+ case XLOG_SMGR_MARK_CREATE:
+ smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true);
+ created = true;
+ break;
+ case XLOG_SMGR_MARK_UNLINK:
+ smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true);
+ break;
+ default:
+ elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark);
+ }
+
+ if (created)
+ {
+ /* revert mark file operation at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = xlrec->forkNum;
+ pending->unlink_mark = xlrec->mark;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ else
+ {
+ /*
+ * Delete pending action for this mark file if any. We should have
+ * at most one entry for this action.
+ */
+ PendingCleanup *prev = NULL;
+
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ pending->unlink_forknum == xlrec->forkNum &&
+ (pending->op & PCOP_UNLINK_MARK) != 0)
+ {
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index a027a8aabc..a841c2eab8 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -351,8 +351,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel);
static void datadir_fsync_fname(const char *fname, bool isdir, int elevel);
static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
-static int fsync_parent_path(const char *fname, int elevel);
-
/*
* pg_fsync --- do fsync with or without writethrough
@@ -3814,7 +3812,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
* This is aimed at making file operations persistent on disk in case of
* an OS crash or power failure.
*/
-static int
+int
fsync_parent_path(const char *fname, int elevel)
{
char parentpath[MAXPGPATH];
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index fb55371b1b..e84fcbf884 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -16,29 +16,45 @@
#include <unistd.h>
+#include "access/xlogrecovery.h"
+#include "catalog/pg_tablespace_d.h"
#include "common/relpath.h"
#include "postmaster/startup.h"
+#include "storage/bufmgr.h"
#include "storage/copydir.h"
#include "storage/fd.h"
+#include "storage/md.h"
#include "storage/reinit.h"
+#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
- int op);
+ Oid tspid, int op);
static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
- int op);
+ Oid tspid, Oid dbid, int op);
typedef struct
{
- Oid reloid; /* hash key */
-} unlogged_relation_entry;
+ RelFileNumber relNumber; /* hash key */
+ bool has_init; /* has INIT fork */
+ bool dirty_all; /* needs to remove all forks */
+} relfile_entry;
/*
- * Reset unlogged relations from before the last restart.
+ * Clean up and reset relation files from before the last restart.
*
- * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
- * relation with an "init" fork, except for the "init" fork itself.
+ * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
+ * depending on the existence of mark files.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
+ * whole relation along with the mark file.
+ *
+ * Otherwise, if the "init" fork is found. we remove all forks of any relation
+ * with the "init" fork, except for the "init" fork itself.
+ *
+ * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
+ * relations that are to be cleaned up.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -72,7 +88,7 @@ ResetUnloggedRelations(int op)
/*
* First process unlogged files in pg_default ($PGDATA/base)
*/
- ResetUnloggedRelationsInTablespaceDir("base", op);
+ ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op);
/*
* Cycle through directories for all non-default tablespaces.
@@ -81,13 +97,19 @@ ResetUnloggedRelations(int op)
while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
{
+ Oid tspid;
+
if (strcmp(spc_de->d_name, ".") == 0 ||
strcmp(spc_de->d_name, "..") == 0)
continue;
snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
- ResetUnloggedRelationsInTablespaceDir(temp_path, op);
+
+ tspid = atooid(spc_de->d_name);
+
+ Assert(tspid != 0);
+ ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op);
}
FreeDir(spc_dir);
@@ -103,7 +125,8 @@ ResetUnloggedRelations(int op)
* Process one tablespace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
+ResetUnloggedRelationsInTablespaceDir(const char *tsdirname,
+ Oid tspid, int op)
{
DIR *ts_dir;
struct dirent *de;
@@ -130,6 +153,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
while ((de = ReadDir(ts_dir, tsdirname)) != NULL)
{
+ Oid dbid;
+
/*
* We're only interested in the per-database directories, which have
* numeric names. Note that this code will also (properly) ignore "."
@@ -148,7 +173,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s",
dbspace_path);
- ResetUnloggedRelationsInDbspaceDir(dbspace_path, op);
+ dbid = atooid(de->d_name);
+ Assert(dbid != 0);
+
+ ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op);
}
FreeDir(ts_dir);
@@ -158,125 +186,200 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op)
* Process one per-dbspace directory for ResetUnloggedRelations
*/
static void
-ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
+ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
+ Oid tspid, Oid dbid, int op)
{
DIR *dbspace_dir;
struct dirent *de;
char rm_path[MAXPGPATH * 2];
+ HTAB *hash;
+ HASHCTL ctl;
/* Caller must specify at least one operation. */
- Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0);
+ Assert((op & (UNLOGGED_RELATION_CLEANUP |
+ UNLOGGED_RELATION_DROP_BUFFER |
+ UNLOGGED_RELATION_INIT)) != 0);
/*
* Cleanup is a two-pass operation. First, we go through and identify all
* the files with init forks. Then, we go through again and nuke
* everything with the same OID except the init fork.
*/
- if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
+
+ /*
+ * It's possible that someone could create tons of unlogged relations in
+ * the same database & tablespace, so we'd better use a hash table rather
+ * than an array or linked list to keep track of which files need to be
+ * reset. Otherwise, this cleanup operation would be O(n^2).
+ */
+ memset(&ctl, 0, sizeof(ctl));
+ ctl.keysize = sizeof(RelFileNumber);
+ ctl.entrysize = sizeof(relfile_entry);
+ hash = hash_create("unlogged relation RelFileNumbers",
+ 32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+ /* Collect INIT fork and mark files in the directory. */
+ dbspace_dir = AllocateDir(dbspacedirname);
+ while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
- HTAB *hash;
- HASHCTL ctl;
+ ForkNumber forkNum;
+ int relnumchars;
+ StorageMarks mark;
- /*
- * It's possible that someone could create a ton of unlogged relations
- * in the same database & tablespace, so we'd better use a hash table
- * rather than an array or linked list to keep track of which files
- * need to be reset. Otherwise, this cleanup operation would be
- * O(n^2).
- */
- ctl.keysize = sizeof(Oid);
- ctl.entrysize = sizeof(unlogged_relation_entry);
- ctl.hcxt = CurrentMemoryContext;
- hash = hash_create("unlogged relation OIDs", 32, &ctl,
- HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+ /* Skip anything that doesn't look like a relation data file. */
+ if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
+ &forkNum, &mark))
+ continue;
- /* Scan the directory. */
- dbspace_dir = AllocateDir(dbspacedirname);
- while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+ if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED)
{
- ForkNumber forkNum;
- int relnumchars;
- unlogged_relation_entry ent;
-
- /* Skip anything that doesn't look like a relation data file. */
- if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* Also skip it unless this is the init fork. */
- if (forkNum != INIT_FORKNUM)
- continue;
+ RelFileNumber key;
+ relfile_entry *ent;
+ bool found;
/*
- * Put the OID portion of the name into the hash table, if it
- * isn't already.
+ * Put the OID portion of the name into the hash table,
+ * if it isn't already. If it has SMGR_MARK_UNCOMMITTED mark
+ * files, the storage file is in dirty state, where clean up is
+ * needed.
*/
- ent.reloid = atooid(de->d_name);
- (void) hash_search(hash, &ent, HASH_ENTER, NULL);
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_ENTER, &found);
+
+ if (!found)
+ {
+ ent->has_init = false;
+ ent->dirty_all = false;
+ }
+
+ if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_all = true;
+ else
+ {
+ Assert(forkNum == INIT_FORKNUM);
+ ent->has_init = true;
+ }
}
+ }
- /* Done with the first pass. */
- FreeDir(dbspace_dir);
+ /* Done with the first pass. */
+ FreeDir(dbspace_dir);
+
+ /* nothing to do if we don't have init nor cleanup forks */
+ if (hash_get_num_entries(hash) < 1)
+ {
+ hash_destroy(hash);
+ return;
+ }
+ if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0)
+ {
/*
- * If we didn't find any init forks, there's no point in continuing;
- * we can bail out now.
+ * When we come here after recovery, smgr object for this file might
+ * have been created. In that case we need to drop all buffers then the
+ * smgr object. Otherwise checkpointer wrongly tries to flush buffers
+ * for nonexistent relation storage. This is safe as far as no other
+ * backends have accessed the relation before starting archive
+ * recovery.
*/
- if (hash_get_num_entries(hash) == 0)
+ HASH_SEQ_STATUS status;
+ relfile_entry *ent;
+ SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8);
+ int maxrels = 8;
+ int nrels = 0;
+ int i;
+
+ Assert(!HotStandbyActive());
+
+ hash_seq_init(&status, hash);
+ while ((ent = (relfile_entry *) hash_seq_search(&status)) != NULL)
{
- hash_destroy(hash);
- return;
+ RelFileLocatorBackend rel;
+
+ if (maxrels <= nrels)
+ {
+ maxrels *= 2;
+ srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+ }
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = ent->relNumber;
+
+ srels[nrels++] = smgropen(rel.locator, InvalidBackendId);
}
- /*
- * Now, make a second pass and remove anything that matches.
- */
+ DropRelationsAllBuffers(srels, nrels);
+
+ for (i = 0; i < nrels; i++)
+ smgrclose(srels[i]);
+ }
+
+ /*
+ * Now, make a second pass and remove anything that matches.
+ */
+ if ((op & UNLOGGED_RELATION_CLEANUP) != 0)
+ {
dbspace_dir = AllocateDir(dbspacedirname);
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
- unlogged_relation_entry ent;
+ RelFileNumber key;
+ relfile_entry *ent;
+ RelFileLocatorBackend rel;
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
- continue;
-
- /* We never remove the init fork. */
- if (forkNum == INIT_FORKNUM)
+ &forkNum, &mark))
continue;
/*
* See whether the OID portion of the name shows up in the hash
* table. If so, nuke it!
*/
- ent.reloid = atooid(de->d_name);
- if (hash_search(hash, &ent, HASH_FIND, NULL))
+ key = atooid(de->d_name);
+ ent = hash_search(hash, &key, HASH_FIND, NULL);
+
+ if (!ent)
+ continue;
+
+ if (!ent->dirty_all)
{
- snprintf(rm_path, sizeof(rm_path), "%s/%s",
- dbspacedirname, de->d_name);
- if (unlink(rm_path) < 0)
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not remove file \"%s\": %m",
- rm_path)));
- else
- elog(DEBUG2, "unlinked file \"%s\"", rm_path);
+ /* clean permanent relations don't need cleanup */
+ if (!ent->has_init)
+ continue;
+
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
}
+
+ /* so, nuke it! */
+ snprintf(rm_path, sizeof(rm_path), "%s/%s",
+ dbspacedirname, de->d_name);
+ if (unlink(rm_path) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m",
+ rm_path));
+
+ rel.backend = InvalidBackendId;
+ rel.locator.spcOid = tspid;
+ rel.locator.dbOid = dbid;
+ rel.locator.relNumber = atooid(de->d_name);
+
+ ForgetRelationForkSyncRequests(rel, forkNum);
}
/* Cleanup is complete. */
FreeDir(dbspace_dir);
- hash_destroy(hash);
}
/*
* Initialization happens after cleanup is complete: we copy each init
- * fork file to the corresponding main fork file. Note that if we are
- * asked to do both cleanup and init, we may never get here: if the
- * cleanup code determines that there are no init forks in this dbspace,
- * it will return before we get to this point.
+ * fork file to the corresponding main fork file.
*/
if ((op & UNLOGGED_RELATION_INIT) != 0)
{
@@ -285,6 +388,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[OIDCHARS + 1];
char srcpath[MAXPGPATH * 2];
@@ -292,9 +396,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -328,15 +434,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
{
ForkNumber forkNum;
+ StorageMarks mark;
int relnumchars;
char relnumbuf[OIDCHARS + 1];
char mainpath[MAXPGPATH];
/* Skip anything that doesn't look like a relation data file. */
if (!parse_filename_for_nontemp_relation(de->d_name, &relnumchars,
- &forkNum))
+ &forkNum, &mark))
continue;
+ Assert(mark == SMGR_MARK_NONE);
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -379,7 +488,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
*/
bool
parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
- ForkNumber *fork)
+ ForkNumber *fork, StorageMarks *mark)
{
int pos;
@@ -410,11 +519,19 @@ parse_filename_for_nontemp_relation(const char *name, int *relnumchars,
for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar)
;
- if (segchar <= 1)
- return false;
- pos += segchar;
+ if (segchar > 1)
+ pos += segchar;
}
+ /* mark file? */
+ if (name[pos] == '.' && name[pos + 1] != 0)
+ {
+ *mark = name[pos + 1];
+ pos += 2;
+ }
+ else
+ *mark = SMGR_MARK_NONE;
+
/* Now we should be at the end. */
if (name[pos] != '\0')
return false;
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index fdecbad170..6eb50d40a0 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -141,6 +141,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum,
BlockNumber blkno, bool skipFsync, int behavior);
static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
MdfdVec *seg);
+static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum,
+ StorageMarks mark);
static inline int
_mdfd_open_flags(void)
@@ -183,6 +185,82 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
}
+/*
+ * mdcreatemark() -- Create a mark file.
+ *
+ * If isRedo is true, it's okay for the file to exist already.
+ */
+void
+mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ /* See mdcreate for details.. */
+ TablespaceCreateDbspace(reln->smgr_rlocator.locator.spcOid,
+ reln->smgr_rlocator.locator.dbOid,
+ isRedo);
+
+ fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL);
+ if (fd < 0 && (!isRedo || errno != EEXIST))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create mark file \"%s\": %m", path));
+
+ pg_fsync(fd);
+ close(fd);
+
+ /*
+ * To guarantee that the creation of the file is persistent, fsync its
+ * parent directory.
+ */
+ fsync_parent_path(path, ERROR);
+
+ pfree(path);
+}
+
+
+/*
+ * mdunlinkmark() -- Delete the mark file
+ *
+ * If isRedo is true, it's okay for the file being not found.
+ */
+void
+mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark,
+ bool isRedo)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+
+ if (!isRedo || mdmarkexists(reln, forkNum, mark))
+ durable_unlink(path, ERROR);
+
+ pfree(path);
+}
+
+/*
+ * mdmarkexists() -- Check if the file exists.
+ */
+static bool
+mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark)
+{
+ char *path = markpath(reln->smgr_rlocator, forkNum, mark);
+ int fd;
+
+ fd = BasicOpenFile(path, O_RDONLY);
+ if (fd < 0 && errno != ENOENT)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not access mark file \"%s\": %m", path));
+ pfree(path);
+
+ if (fd < 0)
+ return false;
+
+ close(fd);
+ return true;
+}
+
/*
* mdcreate() -- Create a new relation on magnetic disk.
*
@@ -1227,6 +1305,16 @@ register_forget_request(RelFileLocatorBackend rlocator, ForkNumber forknum,
RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
}
+/*
+ * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork
+ */
+void
+ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum)
+{
+ register_forget_request(rlocator, forknum, 0);
+}
+
/*
* ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB
*/
@@ -1592,12 +1680,14 @@ mdsyncfiletag(const FileTag *ftag, char *path)
* Return 0 on success, -1 on failure, with errno set.
*/
int
-mdunlinkfiletag(const FileTag *ftag, char *path)
+mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark)
{
char *p;
/* Compute the path. */
- p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+ p = GetRelationPath(ftag->rlocator.dbOid, ftag->rlocator.spcOid,
+ ftag->rlocator.relNumber,InvalidBackendId,
+ MAIN_FORKNUM, mark);
strlcpy(path, p, MAXPGPATH);
pfree(p);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f76c4605db..eafe14bb5e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -65,6 +65,10 @@ typedef struct f_smgr
void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
+ void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+ void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
} f_smgr;
static const f_smgr smgrsw[] = {
@@ -86,6 +90,8 @@ static const f_smgr smgrsw[] = {
.smgr_nblocks = mdnblocks,
.smgr_truncate = mdtruncate,
.smgr_immedsync = mdimmedsync,
+ .smgr_createmark = mdcreatemark,
+ .smgr_unlinkmark = mdunlinkmark,
}
};
@@ -375,6 +381,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo);
}
+/*
+ * smgrcreatemark() -- Create a mark file
+ */
+void
+smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo);
+}
+
+/*
+ * smgrunlinkmark() -- Delete a mark file
+ */
+void
+smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark,
+ bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo);
+}
+
/*
* smgrdosyncall() -- Immediately sync all forks of all given relations
*
@@ -722,6 +748,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 04fcb06056..6b072bcdd0 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0;
typedef struct SyncOps
{
int (*sync_syncfiletag) (const FileTag *ftag, char *path);
- int (*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+ int (*sync_unlinkfiletag) (const FileTag *ftag, char *path,
+ StorageMarks mark);
bool (*sync_filetagmatches) (const FileTag *ftag,
const FileTag *candidate);
} SyncOps;
@@ -233,7 +234,8 @@ SyncPostCheckpoint(void)
/* Unlink the file */
if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
- path) < 0)
+ path,
+ SMGR_MARK_NONE) < 0)
{
/*
* There's a race condition, when the database is dropped at the
@@ -242,6 +244,26 @@ SyncPostCheckpoint(void)
* here. rmtree() also has to ignore ENOENT errors, to deal with
* the possibility that we delete the file first.
*/
+ if (errno != ENOENT)
+ ereport(WARNING,
+ errcode_for_file_access(),
+ errmsg("could not remove file \"%s\": %m", path));
+ }
+ else if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+ path,
+ SMGR_MARK_UNCOMMITTED)
+ < 0)
+ {
+ /*
+ * We might also have SMGR_MARK_UNCOMMITTED file. Remove it if the
+ * fork file has been successfully removed. It's fine if the file
+ * does not exist. Since we have successfully removed the storage
+ * file, it's no big deal if the mark file can't be removed. It
+ * will be eventually removed during a future startup. If that
+ * removal fails, the leftover mark file prevents the creation of
+ * the corresponding storage file so that mark files won't result
+ * in unexpected removal of the correct storage files.
+ */
if (errno != ENOENT)
ereport(WARNING,
(errcode_for_file_access(),
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 27782237d0..e9e4bafb01 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,22 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK)
+ {
+ /*
+ * We can safely ignore there. We'll see that the file don't exist in
+ * the target data dir, and copy them in from the source system. No
+ * need to do anything special here.
+ */
+ }
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK)
+ {
+ /*
+ * We can safely ignore these, The file will be removed from the
+ * target, if it doesn't exist in the source system. The files are
+ * empty so we don't need to bother the content.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/common/relpath.c b/src/common/relpath.c
index 87de5f6c96..b1f6832cfa 100644
--- a/src/common/relpath.c
+++ b/src/common/relpath.c
@@ -139,9 +139,15 @@ GetDatabasePath(Oid dbOid, Oid spcOid)
*/
char *
GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber)
+ int backendId, ForkNumber forkNumber, char mark)
{
char *path;
+ char markstr[4];
+
+ if (mark == 0)
+ markstr[0] = 0;
+ else
+ snprintf(markstr, sizeof(markstr), ".%c", mark);
if (spcOid == GLOBALTABLESPACE_OID)
{
@@ -149,10 +155,10 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
Assert(dbOid == 0);
Assert(backendId == InvalidBackendId);
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("global/%u_%s",
- relNumber, forkNames[forkNumber]);
+ path = psprintf("global/%u_%s%s",
+ relNumber, forkNames[forkNumber], markstr);
else
- path = psprintf("global/%u", relNumber);
+ path = psprintf("global/%u%s", relNumber, markstr);
}
else if (spcOid == DEFAULTTABLESPACE_OID)
{
@@ -160,22 +166,22 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/%u_%s",
+ path = psprintf("base/%u/%u_%s%s",
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/%u",
- dbOid, relNumber);
+ path = psprintf("base/%u/%u%s",
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("base/%u/t%d_%u_%s",
+ path = psprintf("base/%u/t%d_%u_%s%s",
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("base/%u/t%d_%u",
- dbOid, backendId, relNumber);
+ path = psprintf("base/%u/t%d_%u%s",
+ dbOid, backendId, relNumber, markstr);
}
}
else
@@ -184,27 +190,28 @@ GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
if (backendId == InvalidBackendId)
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, relNumber);
+ dbOid, relNumber, markstr);
}
else
{
if (forkNumber != MAIN_FORKNUM)
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
dbOid, backendId, relNumber,
- forkNames[forkNumber]);
+ forkNames[forkNumber], markstr);
else
- path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u",
+ path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s",
spcOid, TABLESPACE_VERSION_DIRECTORY,
- dbOid, backendId, relNumber);
+ dbOid, backendId, relNumber, markstr);
}
}
+
return path;
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 45a3c7835c..0b39c6ef56 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 6b0a7aa3df..a36646c6ee 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -18,17 +18,23 @@
#include "lib/stringinfo.h"
#include "storage/block.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
/*
* Declarations for smgr-related XLOG records
*
- * Note: we log file creation and truncation here, but logging of deletion
- * actions is handled by xact.c, because it is part of transaction commit.
+ * Note: we log file creation, truncation and buffer persistence change here,
+ * but logging of deletion actions is handled mainly by xact.c, because it is
+ * part of transaction commit in most cases. However, there's a case where
+ * init forks are deleted outside control of transaction.
*/
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_MARK 0x40
+#define XLOG_SMGR_BUFPERSISTENCE 0x50
typedef struct xl_smgr_create
{
@@ -36,6 +42,26 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
+typedef enum smgr_mark_action
+{
+ XLOG_SMGR_MARK_CREATE = 'c',
+ XLOG_SMGR_MARK_UNLINK = 'u'
+} smgr_mark_action;
+
+typedef struct xl_smgr_mark
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+ StorageMarks mark;
+ smgr_mark_action action;
+} xl_smgr_mark;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +77,11 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrcreatemark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
+ ForkNumber forkNum, StorageMarks mark);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 511c21682e..28c9dbcd13 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -74,7 +74,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork);
extern char *GetDatabasePath(Oid dbOid, Oid spcOid);
extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
- int backendId, ForkNumber forkNumber);
+ int backendId, ForkNumber forkNumber, char mark);
/*
* Wrapper macros for GetRelationPath. Beware of multiple
@@ -84,7 +84,7 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
/* First argument is a RelFileLocator */
#define relpathbackend(rlocator, backend, forknum) \
GetRelationPath((rlocator).dbOid, (rlocator).spcOid, (rlocator).relNumber, \
- backend, forknum)
+ backend, forknum, 0)
/* First argument is a RelFileLocator */
#define relpathperm(rlocator, forknum) \
@@ -94,4 +94,9 @@ extern char *GetRelationPath(Oid dbOid, Oid spcOid, RelFileNumber relNumber,
#define relpath(rlocator, forknum) \
relpathbackend((rlocator).locator, (rlocator).backend, forknum)
+/* First argument is a RelFileLocatorBackend */
+#define markpath(rlocator, forknum, mark) \
+ GetRelationPath((rlocator).locator.dbOid, (rlocator).locator.spcOid, \
+ (rlocator).locator.relNumber, \
+ (rlocator).backend, forknum, mark)
#endif /* RELPATH_H */
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 6791a406fc..35d022d8e1 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -190,6 +190,7 @@ extern void pg_flush_data(int fd, off_t offset, off_t nbytes);
extern int pg_truncate(const char *path, off_t length);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern int fsync_parent_path(const char *fname, int elevel);
extern int durable_rename(const char *oldfile, const char *newfile, int elevel);
extern int durable_unlink(const char *fname, int elevel);
extern void SyncDataDirectory(void);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 941879ee6a..de49863245 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -23,6 +23,10 @@
extern void mdinit(void);
extern void mdopen(SMgrRelation reln);
extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
@@ -43,12 +47,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
BlockNumber nblocks);
extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+extern void ForgetRelationForkSyncRequests(RelFileLocatorBackend rlocator,
+ ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
-extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark);
extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
#endif /* MD_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index e2bbb5abe9..119dac1505 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,14 +16,16 @@
#define REINIT_H
#include "common/relpath.h"
-
+#include "storage/smgr.h"
extern void ResetUnloggedRelations(int op);
extern bool parse_filename_for_nontemp_relation(const char *name,
int *relnumchars,
- ForkNumber *fork);
+ ForkNumber *fork,
+ StorageMarks *mark);
#define UNLOGGED_RELATION_CLEANUP 0x0001
-#define UNLOGGED_RELATION_INIT 0x0002
+#define UNLOGGED_RELATION_DROP_BUFFER 0x0002
+#define UNLOGGED_RELATION_INIT 0x0004
#endif /* REINIT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..4964146106 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,18 @@
#include "storage/block.h"
#include "storage/relfilelocator.h"
+/*
+ * Storage marks is a file of which existence suggests something about a
+ * file. The name of such files is "<filename>.<mark>", where the mark is one
+ * of the values of StorageMarks. Since ".<digit>" means segment files so don't
+ * use digits for the mark character.
+ */
+typedef enum StorageMarks
+{
+ SMGR_MARK_NONE = 0,
+ SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */
+} StorageMarks;
+
/*
* smgr.c maintains a table of SMgrRelation objects, which are essentially
* cached file handles. An SMgrRelation is created (if not already present)
@@ -87,7 +99,12 @@ extern void smgrcloseall(void);
extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
+extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
+extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum,
+ StorageMarks mark, bool isRedo);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index ce57792f31..d32a1b7aa8 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -86,6 +86,24 @@ ok( pump_until(
$killme_stdout = '';
$killme_stderr = '';
+#create a table that should *not* survive, but has rows.
+#the table's contents is requried to cause access to the storage file
+#after a restart.
+$killme_stdin .= q[
+CREATE TABLE not_alive AS SELECT 1 as a;
+SELECT pg_relation_filepath('not_alive');
+];
+ok( pump_until(
+ $killme, $psql_timeout,
+ \$killme_stdout, qr/[[:alnum:]\/]+[\r\n]$/m),
+ 'added in-creation table');
+my $not_alive_relfile = $node->data_dir . "/" . $killme_stdout;
+chomp($not_alive_relfile);
+$killme_stdout = '';
+$killme_stderr = '';
+
+# The relfile must be exists now
+ok ( -e $not_alive_relfile, 'relfile for in-creation table');
# Start longrunning query in second session; its failure will signal that
# crash-restart has occurred. The initial wait for the trivial select is to
@@ -144,6 +162,9 @@ $killme->run();
($monitor_stdin, $monitor_stdout, $monitor_stderr) = ('', '', '');
$monitor->run();
+# The relfile must have been removed due to the recent restart.
+ok ( ! -e $not_alive_relfile,
+ 'relfile for the in-creation table should be removed after restart');
# Acquire pid of new backend
$killme_stdin .= q[
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 51b7951ad8..057e3d3104 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1997,6 +1997,7 @@ PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
PendingFsyncEntry
+PendingMarkCleanup
PendingRelDelete
PendingRelSync
PendingUnlinkEntry
@@ -2639,6 +2640,7 @@ StdRdOptIndexCleanup
StdRdOptions
Step
StopList
+StorageMarks
StrategyNumber
StreamCtl
StreamStopReason
@@ -3674,6 +3676,7 @@ registered_buffer
regmatch_t
regoff_t
regproc
+relfile_entry
relopt_bool
relopt_enum
relopt_enum_elt_def
@@ -3729,6 +3732,7 @@ slist_iter
slist_mutable_iter
slist_node
slock_t
+smgr_mark_action
socket_set
socklen_t
spgBulkDeleteState
@@ -3936,7 +3940,9 @@ xl_restore_point
xl_running_xacts
xl_seq_rec
xl_smgr_create
+xl_smgr_mark
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.25.1
v29-0002-In-place-table-persistence-change.patchtext/x-diff; charset=us-asciiDownload
From 1ba019a026c66c71bafdc9c71393bef4251917b2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 25 Apr 2023 15:49:10 +0900
Subject: [PATCH v29 2/3] In-place table persistence change
Previously, the command caused a large amount of file I/O due to heap
rewrites, even though ALTER TABLE SET UNLOGGED does not require any
data rewrites. This patch eliminates the need for
rewrites. Additionally, ALTER TABLE SET LOGGED is updated to emit
XLOG_FPI records instead of numerous HEAP_INSERTs when wal_level >
minimal, reducing resource consumption.
---
src/backend/access/rmgrdesc/smgrdesc.c | 12 +
src/backend/catalog/storage.c | 295 ++++++++++++++++++++++++-
src/backend/commands/tablecmds.c | 270 ++++++++++++++++++----
src/backend/storage/buffer/bufmgr.c | 84 +++++++
src/backend/storage/file/reinit.c | 51 ++++-
src/bin/pg_rewind/parsexlog.c | 6 +
src/include/catalog/storage_xlog.h | 8 +
src/include/storage/bufmgr.h | 2 +
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 675 insertions(+), 54 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index f8187385c4..e2998a3ee4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -71,6 +71,15 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
appendStringInfo(buf, "%s %s", action, path);
pfree(path);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -92,6 +101,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_MARK:
id = "MARK";
break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fe06c3c31d..6106376525 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -69,11 +69,13 @@ typedef struct PendingRelDelete
#define PCOP_UNLINK_FORK (1 << 0)
#define PCOP_UNLINK_MARK (1 << 1)
+#define PCOP_SET_PERSISTENCE (1 << 2)
typedef struct PendingCleanup
{
RelFileLocator rlocator; /* relation that need a cleanup */
int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
ForkNumber unlink_forknum; /* forknum to unlink */
StorageMarks unlink_mark; /* mark to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
@@ -223,6 +225,202 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If a pending-unlink exists for this relation's init-fork, it indicates
+ * the init-fork's existed before the current transaction; this function
+ * reverts the pending-unlink by removing the entry. See
+ * RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create the init fork, along with the mark file */
+ srel = smgropen(rlocator, InvalidBackendId);
+ log_smgrcreatemark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * For index relations, WAL-logging and file sync are handled by
+ * ambuildempty. In contrast, for heap relations, these tasks are performed
+ * directly.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file then revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* drop mark file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_MARK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->unlink_mark = SMGR_MARK_UNCOMMITTED;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * Search for a pending-unlink associated with the init-fork of the
+ * relation. Its presence indicates that the init-fork was created within
+ * the current transaction.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum != INIT_FORKNUM)
+ {
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ /*
+ * If the init-fork was created in this transaction, remove both the
+ * init-fork and mark file. Otherwise, register an at-commit pending-unlink
+ * for the existing init-fork. See RelationCreateInitFork.
+ */
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, InvalidBackendId);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+
+ /*
+ * Some AMs initialize init-fork via the buffer manager. To properly
+ * drop the init-fork, first drop all buffers for the init-fork, then
+ * unlink the init-fork and the mark file.
+ */
+ DropRelationBuffers(srel, &forknum, 1, &firstblock);
+ log_smgrunlinkmark(&rlocator, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED);
+ smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false);
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+
+ /* revert buffer-persistence changes at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = false;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -305,6 +503,25 @@ log_smgrunlinkmark(const RelFileLocator *rlocator, ForkNumber forkNum,
XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -858,10 +1075,29 @@ smgrDoPendingCleanups(bool isCommit)
srel = smgropen(pending->rlocator, pending->backend);
Assert((pending->op &
- ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK)) == 0);
+ ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK |
+ PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
if (pending->op & PCOP_UNLINK_FORK)
{
+ BlockNumber firstblock = 0;
+
+ /*
+ * Unlink the fork file. Currently this operation is
+ * applied only to init-forks. As it is not ceratin that
+ * the init-fork is not loaded on shared buffers, drop all
+ * buffers for it.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+ DropRelationBuffers(srel, &pending->unlink_forknum, 1,
+ &firstblock);
+
/* Don't emit wal while recovery. */
if (!InRecovery)
log_smgrunlink(&pending->rlocator,
@@ -1286,8 +1522,8 @@ smgr_redo(XLogReaderState *record)
else
{
/*
- * Delete pending action for this mark file if any. We should have
- * at most one entry for this action.
+ * Delete any pending action for this mark file, if present. There
+ * should be at most one entry for this action.
*/
PendingCleanup *prev = NULL;
@@ -1311,6 +1547,59 @@ smgr_redo(XLogReaderState *record)
}
}
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete any pending action for persistence change, if present. There
+ * should be at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * During abort, revert any changes to buffer persistence made made in
+ * this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 727f151750..3192a97b0e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -55,6 +55,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5464,6 +5465,188 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: perform in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Use ATRewriteTable instead of this function under the following
+ * condition.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * Initially, gather all relations that require a persistence change.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods don't support in-place persistence
+ * changes. GiST uses page LSNs to figure out whether a block has been
+ * modified. However, UNLOGGED GiST indexes use fake LSNs, which are
+ * incompatible with the real LSNs used for LOGGED indexes.
+ *
+ * Potentially, if gistGetFakeLSN behaved similarly for both permanent
+ * and unlogged indexes, we could avoid index rebuilds by emitting
+ * extra WAL records while the index is unlogged.
+ *
+ * Compare relam against a positive list to ensure the hard way is
+ * taken for unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * If this relation becomes WAL-logged, immediately sync all files
+ * except the init-fork to establish the initial state on storage. The
+ * buffers should have already been flushed out by
+ * RelationCreate(Drop)InitFork called just above. The init-fork should
+ * already be synchronized as required.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * If wal_level >= replica, switching to LOGGED necessitates WAL-logging
+ * the relation content for later recovery. This is not emitted when
+ * wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5594,48 +5777,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
-
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
+
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index df22aaa1c5..7d21fd5ac5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3708,6 +3708,90 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages to disk (or kernel disk buffers) when
+ * switching to PERMANENT, ensuring the kernel has an up-to-date view of
+ * the relation.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation
+ * to ensure no other backend is busy dirtying more blocks.
+ *
+ * XXX currently it sequentially searches the buffer pool; consider
+ * implementing more efficient search methods. This routine isn't used in
+ * performance-critical code paths, so it's not worth additional overhead
+ * to make it go faster; see also DropRelationBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* The init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index e84fcbf884..a5d8763e15 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -38,6 +38,7 @@ typedef struct
{
RelFileNumber relNumber; /* hash key */
bool has_init; /* has INIT fork */
+ bool dirty_init; /* needs to remove INIT fork */
bool dirty_all; /* needs to remove all forks */
} relfile_entry;
@@ -45,7 +46,10 @@ typedef struct
* Clean up and reset relation files from before the last restart.
*
* If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations
- * depending on the existence of mark files.
+ * depending on the existence of the "cleanup" forks.
+ *
+ * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the
+ * init fork along with the mark file.
*
* If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the
* whole relation along with the mark file.
@@ -54,7 +58,7 @@ typedef struct
* with the "init" fork, except for the "init" fork itself.
*
* If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all
- * relations that are to be cleaned up.
+ * relations that have the "cleanup" and/or the "init" forks.
*
* If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
* fork.
@@ -241,7 +245,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
* Put the OID portion of the name into the hash table,
* if it isn't already. If it has SMGR_MARK_UNCOMMITTED mark
* files, the storage file is in dirty state, where clean up is
- * needed.
+ * needed. isn't already.
*/
key = atooid(de->d_name);
ent = hash_search(hash, &key, HASH_ENTER, &found);
@@ -249,10 +253,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
if (!found)
{
ent->has_init = false;
+ ent->dirty_init = false;
ent->dirty_all = false;
}
- if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
+ ent->dirty_init = true;
+ else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED)
ent->dirty_all = true;
else
{
@@ -276,11 +283,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
{
/*
* When we come here after recovery, smgr object for this file might
- * have been created. In that case we need to drop all buffers then the
- * smgr object. Otherwise checkpointer wrongly tries to flush buffers
- * for nonexistent relation storage. This is safe as far as no other
- * backends have accessed the relation before starting archive
- * recovery.
+ * have been created. In that case we need to drop all buffers then
+ * the smgr object before initializing the unlogged relation. This is
+ * safe as far as no other backends have accessed the relation before
+ * starting archive recovery.
*/
HASH_SEQ_STATUS status;
relfile_entry *ent;
@@ -296,6 +302,13 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
{
RelFileLocatorBackend rel;
+ /*
+ * The relation is persistent and stays persistent. Don't drop the
+ * buffers for this relation.
+ */
+ if (ent->has_init && ent->dirty_init)
+ continue;
+
if (maxrels <= nrels)
{
maxrels *= 2;
@@ -352,8 +365,24 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname,
if (!ent->has_init)
continue;
- if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
- continue;
+ if (ent->dirty_init)
+ {
+ /*
+ * The crashed transaction did SET UNLOGGED. This relation
+ * is restored to a LOGGED relation.
+ */
+ if (forkNum != INIT_FORKNUM)
+ continue;
+ }
+ else
+ {
+ /*
+ * we don't remove the INIT fork of a non-dirty
+ * relation files.
+ */
+ if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE)
+ continue;
+ }
}
/* so, nuke it! */
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index e9e4bafb01..ddc8014e55 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -434,6 +434,12 @@ extractPageInfo(XLogReaderState *record)
* empty so we don't need to bother the content.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index a36646c6ee..847660b6af 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -62,6 +62,12 @@ typedef struct xl_smgr_mark
smgr_mark_action action;
} xl_smgr_mark;
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -82,6 +88,8 @@ extern void log_smgrcreatemark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
extern void log_smgrunlinkmark(const RelFileLocator *rlocator,
ForkNumber forkNum, StorageMarks mark);
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 0f5fb6be00..e3d9273710 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -222,6 +222,8 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 057e3d3104..f72c5c8a9e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3939,6 +3939,7 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
xl_smgr_mark
xl_smgr_truncate
--
2.25.1
v29-0003-New-command-ALTER-TABLE-ALL-IN-TABLESPACE-SET-LO.patchtext/x-diff; charset=us-asciiDownload
From 429251a8d0db5cef2a21226a969ea6cc482b811d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 25 Apr 2023 16:12:23 +0900
Subject: [PATCH v29 3/3] New command ALTER TABLE ALL IN TABLESPACE SET
LOGGED/UNLOGGED
Simplifies ALTER TABLE SET LOGGED/UNLOGGED invocation by allowing
users to specify relations based on tablespace or owner.
---
doc/src/sgml/ref/alter_table.sgml | 15 +++
src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++
src/backend/parser/gram.y | 42 +++++++
src/backend/tcop/utility.c | 11 ++
src/include/commands/tablecmds.h | 2 +
src/include/nodes/parsenodes.h | 10 ++
src/test/regress/expected/tablespace.out | 76 ++++++++++++
src/test/regress/sql/tablespace.sql | 41 +++++++
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 338 insertions(+)
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index d4d93eeb7c..7ee09ca9cf 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
SET SCHEMA <replaceable class="parameter">new_schema</replaceable>
ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ]
+ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable> [, ... ] ]
+ SET { LOGGED | UNLOGGED } [ NOWAIT ]
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable> | DEFAULT }
ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable>
@@ -769,6 +771,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM
(for identity or serial columns). However, it is also possible to
change the persistence of such sequences separately.
</para>
+ <para>
+ All tables in the current database in a tablespace can be changed by
+ using the <literal>ALL IN TABLESPACE</literal> form, which will first
+ lock all tables to be changed and then change each one. This form also
+ supports
+ <literal>OWNED BY</literal>, which will only change tables owned by the
+ specified roles. If the <literal>NOWAIT</literal> option is specified,
+ then the command will fail if it is unable to immediately acquire all of
+ the locks required. The <literal>information_schema</literal> relations
+ are not considered part of the system catalogs and will be changed. See
+ also
+ <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3192a97b0e..24c6dd2aeb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14966,6 +14966,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt)
return new_tablespaceoid;
}
+/*
+ * Alter Table ALL ... SET LOGGED/UNLOGGED
+ *
+ * Allows a user to modify the persistence of all objects in a specific
+ * tablespace in the current database. Objects can be filtered by owner,
+ * enabling users to update the persistence of only their objects. The primary
+ * permission handling is managed by the lower-level change persistence
+ * function.
+ *
+ * All objects to be modified are locked first. If NOWAIT is specified and the
+ * lock can't be acquired, an ERROR is thrown.
+ */
+void
+AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt)
+{
+ List *relations = NIL;
+ ListCell *l;
+ ScanKeyData key[1];
+ Relation rel;
+ TableScanDesc scan;
+ HeapTuple tuple;
+ Oid tablespaceoid;
+ List *role_oids = roleSpecsToIds(stmt->roles);
+
+ /* Ensure we were not asked to change something we can't */
+ if (stmt->objtype != OBJECT_TABLE)
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("only tables can be specified"));
+
+ /* Get the tablespace OID */
+ tablespaceoid = get_tablespace_oid(stmt->tablespacename, false);
+
+ /*
+ * Now that the checks are done, check if we should set either to
+ * InvalidOid because it is our database's default tablespace.
+ */
+ if (tablespaceoid == MyDatabaseTableSpace)
+ tablespaceoid = InvalidOid;
+
+ /*
+ * Walk the list of objects in the tablespace to pick up them. This will
+ * only find objects in our database, of course.
+ */
+ ScanKeyInit(&key[0],
+ Anum_pg_class_reltablespace,
+ BTEqualStrategyNumber, F_OIDEQ,
+ ObjectIdGetDatum(tablespaceoid));
+
+ rel = table_open(RelationRelationId, AccessShareLock);
+ scan = table_beginscan_catalog(rel, 1, key);
+ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+ {
+ Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple);
+ Oid relOid = relForm->oid;
+
+ /*
+ * Do not pick-up objects in pg_catalog as part of this, if an admin
+ * really wishes to do so, they can issue the individual ALTER
+ * commands directly.
+ *
+ * Also, explicitly avoid any shared tables, temp tables, or TOAST
+ * (TOAST will be changed with the main table).
+ */
+ if (IsCatalogNamespace(relForm->relnamespace) ||
+ relForm->relisshared ||
+ isAnyTempNamespace(relForm->relnamespace) ||
+ IsToastNamespace(relForm->relnamespace))
+ continue;
+
+ /* Only pick up the object type requested */
+ if (relForm->relkind != RELKIND_RELATION)
+ continue;
+
+ /* Check if we are only picking-up objects owned by certain roles */
+ if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner))
+ continue;
+
+ /*
+ * Handle permissions-checking here since we are locking the tables
+ * and also to avoid doing a bunch of work only to fail part-way. Note
+ * that permissions will also be checked by AlterTableInternal().
+ *
+ * Caller must be considered an owner on the table of which we're
+ * going to change persistence.
+ */
+ if (!object_ownercheck(RelationRelationId, relOid, GetUserId()))
+ aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)),
+ NameStr(relForm->relname));
+
+ if (stmt->nowait &&
+ !ConditionalLockRelationOid(relOid, AccessExclusiveLock))
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_IN_USE),
+ errmsg("aborting because lock on relation \"%s.%s\" is not available",
+ get_namespace_name(relForm->relnamespace),
+ NameStr(relForm->relname)));
+ else
+ LockRelationOid(relOid, AccessExclusiveLock);
+
+ /*
+ * Add to our list of objects of which we're going to change
+ * persistence.
+ */
+ relations = lappend_oid(relations, relOid);
+ }
+
+ table_endscan(scan);
+ table_close(rel, AccessShareLock);
+
+ if (relations == NIL)
+ ereport(NOTICE,
+ errcode(ERRCODE_NO_DATA_FOUND),
+ errmsg("no matching relations in tablespace \"%s\" found",
+ tablespaceoid == InvalidOid ? "(database default)" :
+ get_tablespace_name(tablespaceoid)));
+
+ /*
+ * Everything is locked, loop through and change persistence of all of the
+ * relations.
+ */
+ foreach(l, relations)
+ {
+ List *cmds = NIL;
+ AlterTableCmd *cmd = makeNode(AlterTableCmd);
+
+ if (stmt->logged)
+ cmd->subtype = AT_SetLogged;
+ else
+ cmd->subtype = AT_SetUnLogged;
+
+ cmds = lappend(cmds, cmd);
+
+ EventTriggerAlterTableStart((Node *) stmt);
+ /* OID is set by AlterTableInternal */
+ AlterTableInternal(lfirst_oid(l), cmds, false);
+ EventTriggerAlterTableEnd();
+ }
+}
+
static void
index_copy_data(Relation rel, RelFileLocator newrlocator)
{
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index b3bdf947b6..5b2ae92608 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2110,6 +2110,48 @@ AlterTableStmt:
n->nowait = $13;
$$ = (Node *) n;
}
+ | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = true;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = true;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->logged = false;
+ n->nowait = $9;
+ $$ = (Node *)n;
+ }
+ | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait
+ {
+ AlterTableSetLoggedAllStmt *n =
+ makeNode(AlterTableSetLoggedAllStmt);
+ n->tablespacename = $6;
+ n->objtype = OBJECT_TABLE;
+ n->roles = $9;
+ n->logged = false;
+ n->nowait = $12;
+ $$ = (Node *)n;
+ }
| ALTER INDEX qualified_name alter_table_cmds
{
AlterTableStmt *n = makeNode(AlterTableStmt);
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index e3ccf6c7f7..fcf550c839 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -164,6 +164,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree)
case T_AlterTSConfigurationStmt:
case T_AlterTSDictionaryStmt:
case T_AlterTableMoveAllStmt:
+ case T_AlterTableSetLoggedAllStmt:
case T_AlterTableSpaceOptionsStmt:
case T_AlterTableStmt:
case T_AlterTypeStmt:
@@ -1771,6 +1772,12 @@ ProcessUtilitySlow(ParseState *pstate,
commandCollected = true;
break;
+ case T_AlterTableSetLoggedAllStmt:
+ AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree);
+ /* commands are stashed in AlterTableSetLoggedAll */
+ commandCollected = true;
+ break;
+
case T_DropStmt:
ExecDropStmt((DropStmt *) parsetree, isTopLevel);
/* no commands stashed for DROP */
@@ -2699,6 +2706,10 @@ CreateCommandTag(Node *parsetree)
tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype);
break;
+ case T_AlterTableSetLoggedAllStmt:
+ tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype);
+ break;
+
case T_AlterTableStmt:
tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype);
break;
diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h
index 16b6126669..28ef0dc8c0 100644
--- a/src/include/commands/tablecmds.h
+++ b/src/include/commands/tablecmds.h
@@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse);
extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt);
+extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt * stmt);
+
extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt,
Oid *oldschema);
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 2565348303..373c8c5650 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2681,6 +2681,16 @@ typedef struct AlterTableMoveAllStmt
bool nowait;
} AlterTableMoveAllStmt;
+typedef struct AlterTableSetLoggedAllStmt
+{
+ NodeTag type;
+ char *tablespacename;
+ ObjectType objtype; /* Object type to move */
+ List *roles; /* List of roles to change objects of */
+ bool logged;
+ bool nowait;
+} AlterTableSetLoggedAllStmt;
+
/* ----------------------
* Create/Alter Extension Statements
* ----------------------
diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out
index 9aabb85349..35b150b297 100644
--- a/src/test/regress/expected/tablespace.out
+++ b/src/test/regress/expected/tablespace.out
@@ -964,5 +964,81 @@ drop cascades to table testschema.part
drop cascades to table testschema.atable
drop cascades to materialized view testschema.amv
drop cascades to table testschema.tablespace_acl
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | p
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | p
+ uu1 | regress_tablespace | p
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+RESET ROLE;
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+ relname | spcname | relpersistence
+---------+--------------------+----------------
+ lsu | regress_tablespace | u
+ usu | regress_tablespace | u
+ lu1 | regress_tablespace | u
+ uu1 | regress_tablespace | u
+ _lsu | | p
+ _usu | | u
+ _lu1 | | p
+ _uu1 | | u
+(8 rows)
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+NOTICE: drop cascades to 8 other objects
+DETAIL: drop cascades to table testschema.lsu
+drop cascades to table testschema.usu
+drop cascades to table testschema._lsu
+drop cascades to table testschema._usu
+drop cascades to table testschema.lu1
+drop cascades to table testschema.uu1
+drop cascades to table testschema._lu1
+drop cascades to table testschema._uu1
+DROP TABLESPACE regress_tablespace;
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql
index d274d9615e..eb8e247a1d 100644
--- a/src/test/regress/sql/tablespace.sql
+++ b/src/test/regress/sql/tablespace.sql
@@ -429,5 +429,46 @@ DROP TABLESPACE regress_tblspace_renamed;
DROP SCHEMA testschema CASCADE;
+
+--
+-- Check persistence change in a tablespace
+CREATE SCHEMA testschema;
+GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1;
+CREATE TABLESPACE regress_tablespace LOCATION '';
+GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1;
+
+CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default;
+SET ROLE regress_tablespace_user1;
+CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace;
+CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace;
+CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default;
+CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace
+ OWNED BY regress_tablespace_user1 SET LOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+RESET ROLE;
+
+ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED;
+
+SELECT relname, t.spcname, relpersistence
+ FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid)
+ WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid;
+
+-- Should succeed
+DROP SCHEMA testschema CASCADE;
+DROP TABLESPACE regress_tablespace;
+
DROP ROLE regress_tablespace_user1;
DROP ROLE regress_tablespace_user2;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f72c5c8a9e..8a6c101bdf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -97,6 +97,7 @@ AlterTSConfigurationStmt
AlterTSDictionaryStmt
AlterTableCmd
AlterTableMoveAllStmt
+AlterTableSetLoggedAllStmt
AlterTableSpaceOptionsStmt
AlterTableStmt
AlterTableType
--
2.25.1
Thank you for looking this!
At Mon, 14 Aug 2023 12:38:48 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in
I think there are some good ideas here. I started to take a look at the
patches, and I've attached a rebased version of the patch set. Apologies
if I am repeating any discussions from upthread.First, I tested the time difference in ALTER TABLE SET UNLOGGED/LOGGED with
the patch applied, and the results looked pretty impressive.before:
postgres=# alter table test set unlogged;
ALTER TABLE
Time: 5108.071 ms (00:05.108)
postgres=# alter table test set logged;
ALTER TABLE
Time: 6747.648 ms (00:06.748)after:
postgres=# alter table test set unlogged;
ALTER TABLE
Time: 25.609 ms
postgres=# alter table test set logged;
ALTER TABLE
Time: 1241.800 ms (00:01.242)
Thanks for confirmation. The difference between the both directions is
that making a table logged requires to emit WAL records for the entire
content.
My first question is whether 0001 is a prerequisite to 0002. I'm assuming
it is, but the reason wasn't immediately obvious to me. If it's just
In 0002, if a backend crashes after creating an init fork file but
before the associated commit, a lingering fork file could result in
data loss on the next startup. Thus, an utterly reliable file cleanup
mechanism is essential. 0001 also addresses the orphan storage files
issue arising from ALTER TABLE and similar commands.
nice-to-have, perhaps we could simplify the patch set a bit. I see that
Heikki had some general concerns with the marker file approach [0], so
perhaps it is at least worth brainstorming some alternatives if we _do_
need it.
The rationale behind the file-based implementation is that any
leftover init fork file from a crash needs to be deleted before the
reinit(INIT) process kicks in, which happens irrelevantly to WAL,
before the start of crash recovery. I could implement it separately
from the reinit module, but I didn't since that results in almost a
duplication.
As commented in xlog.c, the purpose of the pre-recovery reinit CLEANUP
phase is to ensure hot standbys don't encounter erroneous unlogged
relations. Based on that requirement, we need a mechanism to
guarantee that additional crucial operations are executed reliably at
the next startup post-crash, right before recovery kicks in (or reinit
CLEANUP). 0001 persists this data on a per-operation basis tightly
bonded to their target objects.
I could turn this into something like undo longs in a simple form, but
I'd rather not craft a general-purpose undo log system for this unelss
it's absolutely necessary.
[0] /messages/by-id/9827ebd3-de2e-fd52-4091-a568387b1fc2@iki.fi
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Thu, 24 Aug 2023 11:22:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I could turn this into something like undo longs in a simple form, but
I'd rather not craft a general-purpose undo log system for this unelss
it's absolutely necessary.
This is a patch for a basic undo log implementation. It looks like it
works well for some orphan-files-after-a-crash and data-loss-on-reinit
cases. However, it is far from complete and likely has issues with
crash-safety and the durability of undo log files (and memory leaks
and performance and..).
I'm posting this to move the discussion forward.
(This doesn't contain the third file "ALTER TABLE ..ALL IN TABLESPACE" part.)
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v29-0001-Introduce-undo-log-implementation.patchtext/x-patch; charset=us-asciiDownload
From da5696b9026fa916ae991f7da616062c5b19e705 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v29 1/2] Introduce undo log implementation
This patch adds a simple implementation of UNDO log feature.
---
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/rmgr.c | 4 +-
src/backend/access/transam/simpleundolog.c | 343 +++++++++++++++++++++
src/backend/access/transam/twophase.c | 3 +
src/backend/access/transam/xact.c | 24 ++
src/backend/access/transam/xlog.c | 20 +-
src/backend/catalog/storage.c | 171 ++++++++++
src/backend/storage/file/reinit.c | 78 +++++
src/backend/storage/smgr/smgr.c | 9 +
src/bin/initdb/initdb.c | 17 +
src/bin/pg_rewind/parsexlog.c | 2 +-
src/bin/pg_waldump/rmgrdesc.c | 2 +-
src/include/access/rmgr.h | 2 +-
src/include/access/rmgrlist.h | 44 +--
src/include/access/simpleundolog.h | 36 +++
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_ulog.h | 35 +++
src/include/catalog/storage_xlog.h | 9 +
src/include/storage/reinit.h | 2 +
src/include/storage/smgr.h | 1 +
src/tools/pgindent/typedefs.list | 6 +
21 files changed, 780 insertions(+), 32 deletions(-)
create mode 100644 src/backend/access/transam/simpleundolog.c
create mode 100644 src/include/access/simpleundolog.h
create mode 100644 src/include/catalog/storage_ulog.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..531505cbbd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -21,6 +21,7 @@ OBJS = \
rmgr.o \
slru.o \
subtrans.o \
+ simpleundolog.o \
timeline.o \
transam.o \
twophase.o \
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7d67eda5f7..840cbdecd3 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -35,8 +35,8 @@
#include "utils/relmapper.h"
/* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
- { name, redo, desc, identify, startup, cleanup, mask, decode },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
+ { name, redo, desc, identify, startup, cleanup, mask, decode},
RmgrData RmgrTable[RM_MAX_ID + 1] = {
#include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/simpleundolog.c b/src/backend/access/transam/simpleundolog.c
new file mode 100644
index 0000000000..ebbacce298
--- /dev/null
+++ b/src/backend/access/transam/simpleundolog.c
@@ -0,0 +1,343 @@
+/*-------------------------------------------------------------------------
+ *
+ * simpleundolog.c
+ * Simple implementation of PostgreSQL transaction-undo-log manager
+ *
+ * In this module, procedures required during a transaction abort are
+ * logged. Persisting this information becomes crucial, particularly for
+ * ensuring reliable post-processing during the restart following a transaction
+ * crash. At present, in this module, logging of information is performed by
+ * simply appending data to a created file.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/clog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/simpleundolog.h"
+#include "access/twophase_rmgr.h"
+#include "access/parallel.h"
+#include "access/xact.h"
+#include "catalog/storage_ulog.h"
+#include "storage/fd.h"
+
+#define ULOG_FILE_MAGIC 0x12345678
+
+typedef struct UndoLogFileHeader
+{
+ int32 magic;
+ bool prepared;
+} UndoLogFileHeader;
+
+typedef struct UndoDescData
+{
+ const char *name;
+ void (*rm_undo) (SimpleUndoLogRecord *record, bool prepared);
+} UndoDescData;
+
+/* must be kept in sync with RmgrData definition in xlog_internal.h */
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
+ { name, undo },
+
+UndoDescData UndoRoutines[RM_MAX_ID + 1] = {
+#include "access/rmgrlist.h"
+};
+#undef PG_RMGR
+
+#if defined(O_DSYNC)
+static int undo_sync_mode = O_DSYNC;
+#elif defined(O_SYNC)
+static int undo_sync_mode = O_SYNC;
+#else
+static int undo_sync_mode = 0;
+#endif
+
+static char current_ulogfile_name[MAXPGPATH];
+static int current_ulogfile_fd = -1;
+static int current_xid = InvalidTransactionId;
+static UndoLogFileHeader current_fhdr;
+
+static void
+undolog_check_file_header(void)
+{
+ if (read(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not read undolog file \"%s\": %m",
+ current_ulogfile_name));
+ if (current_fhdr.magic != ULOG_FILE_MAGIC)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("invalid undolog file \"%s\": magic don't match",
+ current_ulogfile_name));
+}
+
+static bool
+undolog_open_current_file(TransactionId xid, bool forread, bool append)
+{
+ int omode;
+
+ if (current_ulogfile_fd >= 0)
+ {
+ /* use existing open file */
+ if (current_xid == xid)
+ {
+ if (append)
+ return true;
+
+ if (lseek(current_ulogfile_fd,
+ sizeof(UndoLogFileHeader), SEEK_SET) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+
+ close(current_ulogfile_fd);
+ current_ulogfile_fd = -1;
+ ReleaseExternalFD();
+ }
+
+ current_xid = xid;
+ if (!TransactionIdIsValid(xid))
+ return false;
+
+ omode = PG_BINARY | undo_sync_mode;
+
+ if (forread)
+ omode |= O_RDONLY;
+ else
+ {
+ omode |= O_RDWR;
+
+ if (!append)
+ omode |= O_TRUNC;
+ }
+
+ snprintf(current_ulogfile_name, MAXPGPATH, "%s/%08x",
+ SIMPLE_UNDOLOG_DIR, xid);
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name, omode);
+ if (current_ulogfile_fd >= 0)
+ undolog_check_file_header();
+ else
+ {
+ if (forread)
+ return false;
+
+ current_fhdr.magic = ULOG_FILE_MAGIC;
+ current_fhdr.prepared = false;
+
+ omode |= O_CREAT;
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name, omode);
+ if (current_ulogfile_fd < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not create undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ if (write(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not write undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+
+ /*
+ * move file pointer to the end of the file. we do this not using O_APPEND,
+ * to allow us to modify data at any location in the file. We already moved
+ * to the first record in the case of !append.
+ */
+ if (append)
+ {
+ if (lseek(current_ulogfile_fd, 0, SEEK_END) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+ ReserveExternalFD();
+
+ return true;
+}
+
+/*
+ * Write ulog record
+ */
+void
+SimpleUndoLogWrite(RmgrId rmgr, uint8 info,
+ TransactionId xid, void *data, int len)
+{
+ int reclen = sizeof(SimpleUndoLogRecord) + len;
+ SimpleUndoLogRecord *rec = palloc(reclen);
+ pg_crc32c undodata_crc;
+
+ Assert(!IsParallelWorker());
+ Assert(xid != InvalidTransactionId);
+
+ undolog_open_current_file(xid, false, true);
+
+ rec->ul_tot_len = reclen;
+ rec->ul_rmid = rmgr;
+ rec->ul_info = info;
+ rec->ul_xid = current_xid;
+
+ memcpy((char *)rec + sizeof(SimpleUndoLogRecord), data, len);
+
+ /* Calculate CRC of the data */
+ INIT_CRC32C(undodata_crc);
+ COMP_CRC32C(undodata_crc, rec,
+ reclen - offsetof(SimpleUndoLogRecord, ul_rmid));
+ rec->ul_crc = undodata_crc;
+
+
+ if (write(current_ulogfile_fd, rec, reclen) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not write to undolog file \"%s\": %m",
+ current_ulogfile_name));
+}
+
+static void
+SimpleUndoLogUndo(bool cleanup)
+{
+ int bufsize;
+ char *buf;
+
+ bufsize = 1024;
+ buf = palloc(bufsize);
+
+ Assert(current_ulogfile_fd >= 0);
+
+ while (read(current_ulogfile_fd, buf, sizeof(SimpleUndoLogRecord)) ==
+ sizeof(SimpleUndoLogRecord))
+ {
+ SimpleUndoLogRecord *rec = (SimpleUndoLogRecord *) buf;
+ int readlen = rec->ul_tot_len - sizeof(SimpleUndoLogRecord);
+ int ret;
+
+ if (rec->ul_tot_len > bufsize)
+ {
+ bufsize *= 2;
+ buf = repalloc(buf, bufsize);
+ }
+
+ ret = read(current_ulogfile_fd,
+ buf + sizeof(SimpleUndoLogRecord), readlen);
+ if (ret != readlen)
+ {
+ if (ret < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not read undo log file \"%s\": %m",
+ current_ulogfile_name));
+
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("reading undo log expected %d bytes, but actually %d: %s",
+ readlen, ret, current_ulogfile_name));
+
+ }
+
+ UndoRoutines[rec->ul_rmid].rm_undo(rec,
+ current_fhdr.prepared && cleanup);
+ }
+}
+
+void
+AtEOXact_SimpleUndoLog(bool isCommit, TransactionId xid)
+{
+ if (IsParallelWorker())
+ return;
+
+ if (!undolog_open_current_file(xid, true, false))
+ return;
+
+ if (!isCommit)
+ SimpleUndoLogUndo(false);
+
+ if (current_ulogfile_fd > 0)
+ {
+ if (close(current_ulogfile_fd) != 0)
+ ereport(PANIC, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ current_ulogfile_name));
+
+ current_ulogfile_fd = -1;
+ ReleaseExternalFD();
+ durable_unlink(current_ulogfile_name, FATAL);
+ }
+
+ return;
+}
+
+void
+UndoLogCleanup(void)
+{
+ DIR *dirdesc;
+ struct dirent *de;
+ char **loglist;
+ int loglistspace = 128;
+ int loglistlen = 0;
+ int i;
+
+ loglist = palloc(sizeof(char*) * loglistspace);
+
+ dirdesc = AllocateDir(SIMPLE_UNDOLOG_DIR);
+ while ((de = ReadDir(dirdesc, SIMPLE_UNDOLOG_DIR)) != NULL)
+ {
+ if (strspn(de->d_name, "01234567890abcdef") < strlen(de->d_name))
+ continue;
+
+ if (loglistlen >= loglistspace)
+ {
+ loglistspace *= 2;
+ loglist = repalloc(loglist, sizeof(char*) * loglistspace);
+ }
+ loglist[loglistlen++] = pstrdup(de->d_name);
+ }
+
+ for (i = 0 ; i < loglistlen ; i++)
+ {
+ snprintf(current_ulogfile_name, MAXPGPATH, "%s/%s",
+ SIMPLE_UNDOLOG_DIR, loglist[i]);
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name,
+ O_RDWR | PG_BINARY |
+ undo_sync_mode);
+ undolog_check_file_header();
+ SimpleUndoLogUndo(true);
+ if (close(current_ulogfile_fd) != 0)
+ ereport(PANIC, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ current_ulogfile_name));
+ current_ulogfile_fd = -1;
+
+ /* do not remove ulog files for prepared transactions */
+ if (!current_fhdr.prepared)
+ durable_unlink(current_ulogfile_name, FATAL);
+ }
+}
+
+void
+SimpleUndoLogSetPrpared(TransactionId xid, bool prepared)
+{
+ Assert(xid != InvalidTransactionId);
+
+ undolog_open_current_file(xid, false, true);
+ current_fhdr.prepared = prepared;
+ if (lseek(current_ulogfile_fd, 0, SEEK_SET) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ if (write(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not write undolog file \"%s\": %m",
+ current_ulogfile_name));
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c6af8cfd7e..a32ec28eb0 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -78,6 +78,7 @@
#include "access/commit_ts.h"
#include "access/htup_details.h"
+#include "access/simpleundolog.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
@@ -1565,6 +1566,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
abortstats,
gid);
+ AtEOXact_SimpleUndoLog(isCommit, xid);
+
ProcArrayRemove(proc, latestXid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8daaa535ed..8bbe8fdb08 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -24,6 +24,7 @@
#include "access/multixact.h"
#include "access/parallel.h"
#include "access/subtrans.h"
+#include "access/simpleundolog.h"
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xact.h"
@@ -2224,6 +2225,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise perform uncommitted storage file deletion. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2365,6 +2369,7 @@ CommitTransaction(void)
AtEOXact_on_commit_actions(true);
AtEOXact_Namespace(true, is_parallel_worker);
AtEOXact_SMgr();
+ AtEOXact_SimpleUndoLog(true, GetCurrentTransactionIdIfAny());
AtEOXact_Files(true);
AtEOXact_ComboCid();
AtEOXact_HashTables(true);
@@ -2475,6 +2480,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise perform uncommitted storage file deletion. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2799,6 +2807,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
@@ -2866,6 +2875,7 @@ AbortTransaction(void)
AtEOXact_on_commit_actions(false);
AtEOXact_Namespace(false, is_parallel_worker);
AtEOXact_SMgr();
+ AtEOXact_SimpleUndoLog(false, GetCurrentTransactionIdIfAny());
AtEOXact_Files(false);
AtEOXact_ComboCid();
AtEOXact_HashTables(false);
@@ -5002,6 +5012,8 @@ CommitSubTransaction(void)
AtEOSubXact_Inval(true);
AtSubCommit_smgr();
+ AtEOXact_SimpleUndoLog(true, GetCurrentTransactionIdIfAny());
+
/*
* The only lock we actually release here is the subtransaction XID lock.
*/
@@ -5181,6 +5193,7 @@ AbortSubTransaction(void)
RESOURCE_RELEASE_AFTER_LOCKS,
false, false);
AtSubAbort_smgr();
+ AtEOXact_SimpleUndoLog(false, GetCurrentTransactionIdIfAny());
AtEOXact_GUC(false, s->gucNestLevel);
AtEOSubXact_SPI(false, s->subTransactionId);
@@ -5660,7 +5673,10 @@ XactLogCommitRecord(TimestampTz commit_time,
if (!TransactionIdIsValid(twophase_xid))
info = XLOG_XACT_COMMIT;
else
+ {
+ elog(LOG, "COMMIT PREPARED: %d", twophase_xid);
info = XLOG_XACT_COMMIT_PREPARED;
+ }
/* First figure out and collect all the information needed */
@@ -6060,6 +6076,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ AtEOXact_SimpleUndoLog(true, xid);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6171,6 +6189,8 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ AtEOXact_SimpleUndoLog(false, xid);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6236,6 +6256,10 @@ xact_redo(XLogReaderState *record)
}
else if (info == XLOG_XACT_PREPARE)
{
+ xl_xact_prepare *xlrec = (xl_xact_prepare *) XLogRecGetData(record);
+
+ AtEOXact_SimpleUndoLog(true, xlrec->xid);
+
/*
* Store xid and start/end pointers of the WAL record in TwoPhaseState
* gxact entry.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..d6cb9aceec 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -51,6 +51,7 @@
#include "access/heaptoast.h"
#include "access/multixact.h"
#include "access/rewriteheap.h"
+#include "access/simpleundolog.h"
#include "access/subtrans.h"
#include "access/timeline.h"
#include "access/transam.h"
@@ -5385,6 +5386,12 @@ StartupXLOG(void)
/* Check that the GUCs used to generate the WAL allow recovery */
CheckRequiredParameterValues();
+ /*
+ * Perform undo processing. This must be done before resetting unlogged
+ * relations.
+ */
+ UndoLogCleanup();
+
/*
* We're in recovery, so unlogged relations may be trashed and must be
* reset. This should be done BEFORE allowing Hot Standby
@@ -5530,14 +5537,17 @@ StartupXLOG(void)
}
/*
- * Reset unlogged relations to the contents of their INIT fork. This is
- * done AFTER recovery is complete so as to include any unlogged relations
- * created during recovery, but BEFORE recovery is marked as having
- * completed successfully. Otherwise we'd not retry if any of the post
- * end-of-recovery steps fail.
+ * Process undo logs left ater recovery, then reset unlogged relations to
+ * the contents of their INIT fork. This is done AFTER recovery is complete
+ * so as to include any file creations during recovery, but BEFORE recovery
+ * is marked as having completed successfully. Otherwise we'd not retry if
+ * any of the post end-of-recovery steps fail.
*/
if (InRecovery)
+ {
+ UndoLogCleanup();
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+ }
/*
* Pre-scan prepared transactions to find out the range of XIDs present.
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 2add053489..1778801bbd 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,16 +19,20 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogutils.h"
+#include "access/simpleundolog.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "catalog/storage_ulog.h"
#include "miscadmin.h"
#include "storage/freespace.h"
+#include "storage/reinit.h"
#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
@@ -66,6 +70,19 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +90,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -148,6 +166,19 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
srel = smgropen(rlocator, backend);
smgrcreate(srel, MAIN_FORKNUM, false);
+ /* Write undo log, this requires irrelevant to needs_wal */
+ if (register_delete)
+ {
+ ul_uncommitted_storage ul_storage;
+
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = MAIN_FORKNUM;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+ }
+
if (needs_wal)
log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -191,12 +222,32 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
*/
xlrec.rlocator = *rlocator;
xlrec.forkNum = forkNum;
+ xlrec.xid = GetTopTransactionId();
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(xlrec));
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +762,75 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op & ~(PCOP_UNLINK_FORK)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ BlockNumber firstblock = 0;
+
+ /*
+ * Unlink the fork file. Currently this operation is
+ * applied only to init-forks. As it is not ceratin that
+ * the init-fork is not loaded on shared buffers, drop all
+ * buffers for it.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+ DropRelationBuffers(srel, &pending->unlink_forknum, 1,
+ &firstblock);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -920,6 +1040,9 @@ PostPrepare_smgr(void)
/* must explicitly free the list entry */
pfree(pending);
}
+
+ /* Mark undolog as prepared */
+ SimpleUndoLogSetPrpared(GetCurrentTransactionId(), true);
}
@@ -967,10 +1090,28 @@ smgr_redo(XLogReaderState *record)
{
xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
SMgrRelation reln;
+ ul_uncommitted_storage ul_storage;
+
+ /* write undo log */
+ ul_storage.rlocator = xlrec->rlocator;
+ ul_storage.forknum = xlrec->forkNum;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ xlrec->xid,
+ &ul_storage, sizeof(ul_storage));
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1062,3 +1203,33 @@ smgr_redo(XLogReaderState *record)
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
+
+void
+smgr_undo(SimpleUndoLogRecord *record, bool crash_prepared)
+{
+ uint8 info = record->ul_info;
+
+
+ if (info == ULOG_SMGR_UNCOMMITED_STORAGE)
+ {
+ ul_uncommitted_storage *ul_storage =
+ (ul_uncommitted_storage *) ULogRecGetData(record);
+
+ if (!crash_prepared)
+ {
+ SMgrRelation reln;
+
+ reln = smgropen(ul_storage->rlocator, InvalidBackendId);
+ smgrunlink(reln, ul_storage->forknum, true);
+ smgrclose(reln);
+ }
+ else
+ {
+ /* Inform reinit to ignore this file during cleanup */
+ ResetUnloggedRelationIgnore(ul_storage->rlocator);
+ }
+
+ }
+ else
+ elog(PANIC, "smgr_undo: unknown op code %u", info);
+}
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index fb55371b1b..d302feadb1 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -34,6 +34,39 @@ typedef struct
Oid reloid; /* hash key */
} unlogged_relation_entry;
+static char **ignore_files = NULL;
+static int nignore_elems = 0;
+static int nignore_files = 0;
+
+/*
+ * identify the file should be ignored during resetting unlogged relations.
+ */
+static bool
+reinit_ignore_file(const char *dirname, const char *name)
+{
+ char fnamebuf[MAXPGPATH];
+ int len;
+
+ if (nignore_files == 0)
+ return false;
+
+ strncpy(fnamebuf, dirname, MAXPGPATH - 1);
+ strncat(fnamebuf, "/", MAXPGPATH - 1);
+ strncat(fnamebuf, name, MAXPGPATH - 1);
+ fnamebuf[MAXPGPATH - 1] = 0;
+
+ for (int i = 0 ; i < nignore_files ; i++)
+ {
+ /* match ignoring fork part */
+ len = strlen(ignore_files[i]);
+ if (strncmp(fnamebuf, ignore_files[i], len) == 0 &&
+ (fnamebuf[len] == 0 || fnamebuf[len] == '_'))
+ return true;
+ }
+
+ return false;
+}
+
/*
* Reset unlogged relations from before the last restart.
*
@@ -203,6 +236,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -243,6 +280,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* We never remove the init fork. */
if (forkNum == INIT_FORKNUM)
continue;
@@ -295,6 +336,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -337,6 +382,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -365,6 +414,35 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
}
}
+/*
+ * Record relfilenodes that should be left alone during reinitializing unlogged
+ * relations.
+ */
+void
+ResetUnloggedRelationIgnore(RelFileLocator rloc)
+{
+ RelFileLocatorBackend rbloc;
+
+ if (nignore_files >= nignore_elems)
+ {
+ if (ignore_files == NULL)
+ {
+ nignore_elems = 16;
+ ignore_files = palloc(sizeof(char *) * nignore_elems);
+ }
+ else
+ {
+ nignore_elems *= 2;
+ ignore_files = repalloc(ignore_files,
+ sizeof(char *) * nignore_elems);
+ }
+ }
+
+ rbloc.backend = InvalidBackendId;
+ rbloc.locator = rloc;
+ ignore_files[nignore_files++] = relpath(rbloc, MAIN_FORKNUM);
+}
+
/*
* Basic parsing of putative relation filenames.
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5d0f3d515c..92945c32c3 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -723,6 +723,15 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+/*
+ * smgrunlink() -- unlink the storage file
+ */
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 905b979947..c0938bdf3a 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -303,6 +303,7 @@ void setup_signals(void);
void setup_text_search(void);
void create_data_directory(void);
void create_xlog_or_symlink(void);
+void create_ulog(void);
void warn_on_mount_point(int error);
void initialize_data_directory(void);
@@ -2938,6 +2939,21 @@ create_xlog_or_symlink(void)
free(subdirloc);
}
+/* Create undo log directory */
+void
+create_ulog(void)
+{
+ char *subdirloc;
+
+ /* form name of the place for the subdirectory */
+ subdirloc = psprintf("%s/pg_ulog", pg_data);
+
+ if (mkdir(subdirloc, pg_dir_create_mode) < 0)
+ pg_fatal("could not create directory \"%s\": %m",
+ subdirloc);
+
+ free(subdirloc);
+}
void
warn_on_mount_point(int error)
@@ -2972,6 +2988,7 @@ initialize_data_directory(void)
create_data_directory();
create_xlog_or_symlink();
+ create_ulog();
/* Create required subdirectories (other than pg_wal) */
printf(_("creating subdirectories ... "));
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 27782237d0..87b4659e27 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -28,7 +28,7 @@
* RmgrNames is an array of the built-in resource manager names, to make error
* messages a bit nicer.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
name,
static const char *RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 6b8c17bb4c..a21009c5b8 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -32,7 +32,7 @@
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
{ name, desc, identify},
static const RmgrDescData RmgrDescTable[RM_N_BUILTIN_IDS] = {
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index 3b6a497e1b..d705de9256 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
* Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
* file format.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
symname,
typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 463bcb67c5..e15d951000 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
*/
/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, smgr_undo)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode, NULL)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode, NULL)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL, NULL)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL, NULL)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL, NULL)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL, NULL)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL, NULL)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode, NULL)
diff --git a/src/include/access/simpleundolog.h b/src/include/access/simpleundolog.h
new file mode 100644
index 0000000000..3d3bd2f7e2
--- /dev/null
+++ b/src/include/access/simpleundolog.h
@@ -0,0 +1,36 @@
+#ifndef SIMPLE_UNDOLOG_H
+#define SIMPLE_UNDOLOG_H
+
+#include "access/rmgr.h"
+#include "port/pg_crc32c.h"
+
+#define SIMPLE_UNDOLOG_DIR "pg_ulog"
+
+typedef struct SimpleUndoLogRecord
+{
+ uint32 ul_tot_len; /* total length of entire record */
+ pg_crc32c ul_crc; /* CRC for this record */
+ RmgrId ul_rmid; /* resource manager for this record */
+ uint8 ul_info; /* record info */
+ TransactionId ul_xid; /* transaction id */
+ /* rmgr-specific data follow, no padding */
+} SimpleUndoLogRecord;
+
+extern void SimpleUndoLogWrite(RmgrId rmgr, uint8 info,
+ TransactionId xid, void *data, int len);
+extern void SimpleUndoLogSetPrpared(TransactionId xid, bool prepared);
+extern void AtEOXact_SimpleUndoLog(bool isCommit, TransactionId xid);
+extern void UndoLogCleanup(void);
+
+extern void AtPrepare_UndoLog(TransactionId xid);
+extern void PostPrepare_UndoLog(void);
+extern void undolog_twophase_recover(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_postcommit(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_postabort(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_standby_recover(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+
+#endif /* SIMPLE_UNDOLOG_H */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 45a3c7835c..0b39c6ef56 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_ulog.h b/src/include/catalog/storage_ulog.h
new file mode 100644
index 0000000000..8e47428e66
--- /dev/null
+++ b/src/include/catalog/storage_ulog.h
@@ -0,0 +1,35 @@
+/*-------------------------------------------------------------------------
+ *
+ * storage_ulog.h
+ * prototypes for Undo Log support for backend/catalog/storage.c
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/storage_ulog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STORAGE_ULOG_H
+#define STORAGE_ULOG_H
+
+/* ULOG gives us high 4 bits (just following xlog) */
+#define ULOG_SMGR_UNCOMMITED_STORAGE 0x10
+
+/* undo log entry for uncommitted storage files */
+typedef struct ul_uncommitted_storage
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ bool remove;
+} ul_uncommitted_storage;
+
+/* flags for xl_smgr_truncate */
+#define SMGR_TRUNCATE_HEAP 0x0001
+
+void smgr_undo(SimpleUndoLogRecord *record, bool crash_prepared);
+
+#define ULogRecGetData(record) ((char *)record + sizeof(SimpleUndoLogRecord))
+
+#endif /* STORAGE_XLOG_H */
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 6b0a7aa3df..5122f5b61d 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -29,13 +29,21 @@
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
typedef struct xl_smgr_create
{
RelFileLocator rlocator;
ForkNumber forkNum;
+ TransactionId xid;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +59,7 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index e2bbb5abe9..ccd182531d 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,9 +16,11 @@
#define REINIT_H
#include "common/relpath.h"
+#include "storage/relfilelocator.h"
extern void ResetUnloggedRelations(int op);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
extern bool parse_filename_for_nontemp_relation(const char *name,
int *relnumchars,
ForkNumber *fork);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..74194cf1e4 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -88,6 +88,7 @@ extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 49a33c0387..b9255e5e25 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1996,6 +1996,7 @@ PatternInfo
PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
+PendingCleanup
PendingFsyncEntry
PendingRelDelete
PendingRelSync
@@ -2553,6 +2554,7 @@ SimplePtrListCell
SimpleStats
SimpleStringList
SimpleStringListCell
+SimpleUndoLogRecord
SingleBoundSortItem
Size
SkipPages
@@ -2909,6 +2911,8 @@ ULONG
ULONG_PTR
UV
UVersionInfo
+UndoDescData
+UndoLogFileHeader
UnicodeNormalizationForm
UnicodeNormalizationQC
Unique
@@ -3826,6 +3830,7 @@ uint8
uint8_t
uint8x16_t
uintptr_t
+ul_uncommitted_storage
unicodeStyleBorderFormat
unicodeStyleColumnFormat
unicodeStyleFormat
@@ -3938,6 +3943,7 @@ xl_running_xacts
xl_seq_rec
xl_smgr_create
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.39.3
v29-0002-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 8cfe03157c412f6936e5c1b156d1ce28ac922763 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 4 Sep 2023 17:23:05 +0900
Subject: [PATCH v29 2/2] In-place table persistence change
Previously, the command caused a large amount of file I/O due to heap
rewrites, even though ALTER TABLE SET UNLOGGED does not require any
data rewrites. This patch eliminates the need for
rewrites. Additionally, ALTER TABLE SET LOGGED is updated to emit
XLOG_FPI records instead of numerous HEAP_INSERTs when wal_level >
minimal, reducing resource consumption.
---
src/backend/access/rmgrdesc/smgrdesc.c | 12 +
src/backend/catalog/storage.c | 338 ++++++++++++++++++++++++-
src/backend/commands/tablecmds.c | 268 +++++++++++++++++---
src/backend/storage/buffer/bufmgr.c | 84 ++++++
src/bin/pg_rewind/parsexlog.c | 6 +
src/include/catalog/storage_xlog.h | 10 +
src/include/storage/bufmgr.h | 3 +
src/include/storage/reinit.h | 2 +-
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 683 insertions(+), 41 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index bd841b96e8..620e02bc26 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,15 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +64,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 1778801bbd..e7c917c50f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -71,11 +71,13 @@ typedef struct PendingRelDelete
} PendingRelDelete;
#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_SET_PERSISTENCE (1 << 1)
typedef struct PendingCleanup
{
RelFileLocator rlocator; /* relation that need a cleanup */
int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
ForkNumber unlink_forknum; /* forknum to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
@@ -209,6 +211,208 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ ul_uncommitted_storage ul_storage;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If a pending-unlink exists for this relation's init-fork, it indicates
+ * the init-fork's existed before the current transaction; this function
+ * reverts the pending-unlink by removing the entry. See
+ * RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ /* write cancel log for preceding undo log entry */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create undo log entry, then the init fork */
+ srel = smgropen(rlocator, InvalidBackendId);
+
+ /* write undo log */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * For index relations, WAL-logging and file sync are handled by
+ * ambuildempty. In contrast, for heap relations, these tasks are performed
+ * directly.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file then revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * Search for a pending-unlink associated with the init-fork of the
+ * relation. Its presence indicates that the init-fork was created within
+ * the current transaction.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ ul_uncommitted_storage ul_storage;
+
+ /* write cancel log for preceding undo log entry */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ /*
+ * If the init-fork was created in this transaction, remove the init-fork
+ * and cancel preceding undo log. Otherwise, register an at-commit
+ * pending-unlink for the existing init-fork. See RelationCreateInitFork.
+ */
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, InvalidBackendId);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+ ul_uncommitted_storage ul_storage;
+
+ /*
+ * Some AMs initialize init-fork via the buffer manager. To properly
+ * drop the init-fork, first drop all buffers for the init-fork, then
+ * unlink the init-fork and cancel preceding undo log.
+ */
+ DropRelationBuffers(srel, &forknum, 1, &firstblock);
+
+ /* cancel existing undo log */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -248,6 +452,25 @@ log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -800,7 +1023,14 @@ smgrDoPendingCleanups(bool isCommit)
srel = smgropen(pending->rlocator, pending->backend);
- Assert((pending->op & ~(PCOP_UNLINK_FORK)) == 0);
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
if (pending->op & PCOP_UNLINK_FORK)
{
@@ -1200,6 +1430,112 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete any pending action for persistence change, if present. There
+ * should be at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * During abort, revert any changes to buffer persistence made made in
+ * this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete any pending action for persistence change, if present. There
+ * should be at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * During abort, revert any changes to buffer persistence made made in
+ * this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 47c556669f..6c4cfbfa78 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -55,6 +55,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5571,6 +5572,188 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: perform in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Use ATRewriteTable instead of this function under the following
+ * condition.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * Initially, gather all relations that require a persistence change.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods don't support in-place persistence
+ * changes. GiST uses page LSNs to figure out whether a block has been
+ * modified. However, UNLOGGED GiST indexes use fake LSNs, which are
+ * incompatible with the real LSNs used for LOGGED indexes.
+ *
+ * Potentially, if gistGetFakeLSN behaved similarly for both permanent
+ * and unlogged indexes, we could avoid index rebuilds by emitting
+ * extra WAL records while the index is unlogged.
+ *
+ * Compare relam against a positive list to ensure the hard way is
+ * taken for unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ reindex_index(reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * If this relation becomes WAL-logged, immediately sync all files
+ * except the init-fork to establish the initial state on storage. The
+ * buffers should have already been flushed out by
+ * RelationCreate(Drop)InitFork called just above. The init-fork should
+ * already be synchronized as required.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * If wal_level >= replica, switching to LOGGED necessitates WAL-logging
+ * the relation content for later recovery. This is not emitted when
+ * wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5701,48 +5884,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3bd82dbfca..04ab6ec8a7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3708,6 +3708,90 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages to disk (or kernel disk buffers) when
+ * switching to PERMANENT, ensuring the kernel has an up-to-date view of
+ * the relation.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation
+ * to ensure no other backend is busy dirtying more blocks.
+ *
+ * XXX currently it sequentially searches the buffer pool; consider
+ * implementing more efficient search methods. This routine isn't used in
+ * performance-critical code paths, so it's not worth additional overhead
+ * to make it go faster; see also DropRelationBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* The init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 87b4659e27..db12f4f397 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,12 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 5122f5b61d..eaa162f0c7 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -14,6 +14,7 @@
#ifndef STORAGE_XLOG_H
#define STORAGE_XLOG_H
+#include "access/simpleundolog.h"
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
#include "storage/block.h"
@@ -30,6 +31,7 @@
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_BUFPERSISTENCE 0x40
typedef struct xl_smgr_create
{
@@ -44,6 +46,12 @@ typedef struct xl_smgr_unlink
ForkNumber forkNum;
} xl_smgr_unlink;
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -60,6 +68,8 @@ typedef struct xl_smgr_truncate
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b379c76e27..0e4e290392 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -222,6 +222,9 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
+
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index ccd182531d..e59fb7892e 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -20,10 +20,10 @@
extern void ResetUnloggedRelations(int op);
-extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
extern bool parse_filename_for_nontemp_relation(const char *name,
int *relnumchars,
ForkNumber *fork);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
#define UNLOGGED_RELATION_CLEANUP 0x0001
#define UNLOGGED_RELATION_INIT 0x0002
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b9255e5e25..2c34434555 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3941,6 +3941,7 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
xl_smgr_truncate
xl_smgr_unlink
--
2.39.3
On Mon, 4 Sept 2023 at 16:59, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
At Thu, 24 Aug 2023 11:22:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
I could turn this into something like undo longs in a simple form, but
I'd rather not craft a general-purpose undo log system for this unelss
it's absolutely necessary.This is a patch for a basic undo log implementation. It looks like it
works well for some orphan-files-after-a-crash and data-loss-on-reinit
cases. However, it is far from complete and likely has issues with
crash-safety and the durability of undo log files (and memory leaks
and performance and..).I'm posting this to move the discussion forward.
(This doesn't contain the third file "ALTER TABLE ..ALL IN TABLESPACE" part.)
CFBot shows compilation issues at [1]https://cirrus-ci.com/task/5916232528953344 with:
09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_twophase.c.o: in function
`FinishPreparedTransaction':
[09:34:44.987] /tmp/cirrus-ci-build/build/../src/backend/access/transam/twophase.c:1569:
undefined reference to `AtEOXact_SimpleUndoLog'
[09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function
`CommitTransaction':
[09:34:44.987] /tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:2372:
undefined reference to `AtEOXact_SimpleUndoLog'
[09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function
`AbortTransaction':
[09:34:44.987] /tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:2878:
undefined reference to `AtEOXact_SimpleUndoLog'
[09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function
`CommitSubTransaction':
[09:34:44.987] /tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:5016:
undefined reference to `AtEOXact_SimpleUndoLog'
[09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function
`AbortSubTransaction':
[09:34:44.987] /tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:5197:
undefined reference to `AtEOXact_SimpleUndoLog'
[09:34:44.987] /usr/bin/ld:
src/backend/postgres_lib.a.p/access_transam_xact.c.o:/tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:6080:
more undefined references to `AtEOXact_SimpleUndoLog' follow
[1]: https://cirrus-ci.com/task/5916232528953344
Regards,
Vignesh
At Tue, 9 Jan 2024 15:07:20 +0530, vignesh C <vignesh21@gmail.com> wrote in
CFBot shows compilation issues at [1] with:
Thanks!
The reason for those errors was that I didn't consider Meson at the
time. Additionally, the signature change of reindex_index() caused the
build failure. I fixed both issues. While addressing these issues, I
modified the simpleundolog module to honor
wal_sync_method. Previously, the sync method for undo logs was
determined independently, separate from xlog.c. However, I'm still not
satisfied with the method for handling PG_O_DIRECT.
In this version, I have added the changes to enable the use of
wal_sync_method outside of xlog.c as the first part of the patchset.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v31-0001-Export-wal_sync_method-related-functions.patchtext/x-patch; charset=us-asciiDownload
From 40749357f24adf89dc79db9b34f5c053288489bb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 15 Jan 2024 15:57:53 +0900
Subject: [PATCH v31 1/3] Export wal_sync_method related functions
Export several functions related to wal_sync_method for use in
subsequent commits. Since PG_O_DIRECT cannot be used in those commits,
the new function XLogGetSyncBit() will mask PG_O_DIRECT.
---
src/backend/access/transam/xlog.c | 73 +++++++++++++++++++++----------
src/include/access/xlog.h | 2 +
2 files changed, 52 insertions(+), 23 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..c5f51849ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8403,21 +8403,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra)
}
}
+/*
+ * Exported version of get_sync_bit()
+ *
+ * Do not expose PG_O_DIRECT for uses outside xlog.c.
+ */
+int
+XLogGetSyncBit(void)
+{
+ return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT;
+}
+
/*
- * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ * Issue appropriate kind of fsync (if any) according to wal_sync_method.
*
- * 'fd' is a file descriptor for the XLOG file to be fsync'd.
- * 'segno' is for error reporting purposes.
+ * 'fd' is a file descriptor for the file to be fsync'd.
*/
-void
-issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+const char *
+XLogFsyncFile(int fd)
{
- char *msg = NULL;
+ const char *msg = NULL;
instr_time start;
- Assert(tli != 0);
-
/*
* Quick exit if fsync is disabled or write() has already synced the WAL
* file.
@@ -8425,7 +8433,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
if (!enableFsync ||
wal_sync_method == WAL_SYNC_METHOD_OPEN ||
wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
- return;
+ return NULL;
/* Measure I/O timing to sync the WAL file */
if (track_wal_io_timing)
@@ -8460,19 +8468,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
break;
}
- /* PANIC if failed to fsync */
- if (msg)
- {
- char xlogfname[MAXFNAMELEN];
- int save_errno = errno;
-
- XLogFileName(xlogfname, tli, segno, wal_segment_size);
- errno = save_errno;
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg(msg, xlogfname)));
- }
-
pgstat_report_wait_end();
/*
@@ -8486,7 +8481,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
}
- PendingWalStats.wal_sync++;
+ if (msg != NULL)
+ PendingWalStats.wal_sync++;
+
+ return msg;
+}
+
+/*
+ * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to be fsync'd.
+ * 'segno' is for error reporting purposes.
+ */
+void
+issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+{
+ const char *msg;
+
+ Assert(tli != 0);
+
+ msg = XLogFsyncFile(fd);
+
+ /* PANIC if failed to fsync */
+ if (msg)
+ {
+ char xlogfname[MAXFNAMELEN];
+ int save_errno = errno;
+
+ XLogFileName(xlogfname, tli, segno, wal_segment_size);
+ errno = save_errno;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg(msg, xlogfname)));
+ }
}
/*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..2a0d65b537 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record);
extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
extern const char *xlog_identify(uint8 info);
+extern int XLogGetSyncBit(void);
+extern const char *XLogFsyncFile(int fd);
extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
extern bool RecoveryInProgress(void);
--
2.39.3
v31-0002-Introduce-undo-log-implementation.patchtext/x-patch; charset=us-asciiDownload
From 5c120b94c407b971485ab52133399305e5e81a88 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v31 2/3] Introduce undo log implementation
This patch adds a simple implementation of UNDO log feature.
---
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/rmgr.c | 4 +-
src/backend/access/transam/simpleundolog.c | 362 +++++++++++++++++++++
src/backend/access/transam/twophase.c | 3 +
src/backend/access/transam/xact.c | 24 ++
src/backend/access/transam/xlog.c | 20 +-
src/backend/catalog/storage.c | 171 ++++++++++
src/backend/storage/file/reinit.c | 78 +++++
src/backend/storage/smgr/smgr.c | 9 +
src/bin/initdb/initdb.c | 17 +
src/bin/pg_rewind/parsexlog.c | 2 +-
src/bin/pg_waldump/rmgrdesc.c | 2 +-
src/include/access/rmgr.h | 2 +-
src/include/access/rmgrlist.h | 44 +--
src/include/access/simpleundolog.h | 36 ++
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_ulog.h | 35 ++
src/include/catalog/storage_xlog.h | 9 +
src/include/storage/reinit.h | 2 +
src/include/storage/smgr.h | 1 +
src/tools/pgindent/typedefs.list | 6 +
22 files changed, 800 insertions(+), 32 deletions(-)
create mode 100644 src/backend/access/transam/simpleundolog.c
create mode 100644 src/include/access/simpleundolog.h
create mode 100644 src/include/catalog/storage_ulog.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..531505cbbd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -21,6 +21,7 @@ OBJS = \
rmgr.o \
slru.o \
subtrans.o \
+ simpleundolog.o \
timeline.o \
transam.o \
twophase.o \
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557c..c1225636b5 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
'rmgr.c',
'slru.c',
'subtrans.c',
+ 'simpleundolog.c',
'timeline.c',
'transam.c',
'twophase.c',
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7d67eda5f7..840cbdecd3 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -35,8 +35,8 @@
#include "utils/relmapper.h"
/* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
- { name, redo, desc, identify, startup, cleanup, mask, decode },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
+ { name, redo, desc, identify, startup, cleanup, mask, decode},
RmgrData RmgrTable[RM_MAX_ID + 1] = {
#include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/simpleundolog.c b/src/backend/access/transam/simpleundolog.c
new file mode 100644
index 0000000000..e22ed67bae
--- /dev/null
+++ b/src/backend/access/transam/simpleundolog.c
@@ -0,0 +1,362 @@
+/*-------------------------------------------------------------------------
+ *
+ * simpleundolog.c
+ * Simple implementation of PostgreSQL transaction-undo-log manager
+ *
+ * In this module, procedures required during a transaction abort are
+ * logged. Persisting this information becomes crucial, particularly for
+ * ensuring reliable post-processing during the restart following a transaction
+ * crash. At present, in this module, logging of information is performed by
+ * simply appending data to a created file.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/clog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/simpleundolog.h"
+#include "access/twophase_rmgr.h"
+#include "access/parallel.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "catalog/storage_ulog.h"
+#include "storage/fd.h"
+
+#define ULOG_FILE_MAGIC 0x12345678
+
+typedef struct UndoLogFileHeader
+{
+ int32 magic;
+ bool prepared;
+} UndoLogFileHeader;
+
+typedef struct UndoDescData
+{
+ const char *name;
+ void (*rm_undo) (SimpleUndoLogRecord *record, bool prepared);
+} UndoDescData;
+
+/* must be kept in sync with RmgrData definition in xlog_internal.h */
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
+ { name, undo },
+
+UndoDescData UndoRoutines[RM_MAX_ID + 1] = {
+#include "access/rmgrlist.h"
+};
+#undef PG_RMGR
+
+static char current_ulogfile_name[MAXPGPATH];
+static int current_ulogfile_fd = -1;
+static int current_xid = InvalidTransactionId;
+static UndoLogFileHeader current_fhdr;
+
+static void
+undolog_check_file_header(void)
+{
+ if (read(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not read undolog file \"%s\": %m",
+ current_ulogfile_name));
+ if (current_fhdr.magic != ULOG_FILE_MAGIC)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("invalid undolog file \"%s\": magic don't match",
+ current_ulogfile_name));
+}
+
+static void
+undolog_sync_current_file(void)
+{
+ const char *msg;
+
+ msg = XLogFsyncFile(current_ulogfile_fd);
+
+ /* PANIC if failed to fsync */
+ if (msg)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg(msg, current_ulogfile_name)));
+ }
+}
+
+static bool
+undolog_open_current_file(TransactionId xid, bool forread, bool append)
+{
+ int omode;
+
+ if (current_ulogfile_fd >= 0)
+ {
+ /* use existing open file */
+ if (current_xid == xid)
+ {
+ if (append)
+ return true;
+
+ if (lseek(current_ulogfile_fd,
+ sizeof(UndoLogFileHeader), SEEK_SET) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+
+ close(current_ulogfile_fd);
+ current_ulogfile_fd = -1;
+ ReleaseExternalFD();
+ }
+
+ current_xid = xid;
+ if (!TransactionIdIsValid(xid))
+ return false;
+
+ omode = PG_BINARY | XLogGetSyncBit();
+
+ if (forread)
+ omode |= O_RDONLY;
+ else
+ {
+ omode |= O_RDWR;
+
+ if (!append)
+ omode |= O_TRUNC;
+ }
+
+ snprintf(current_ulogfile_name, MAXPGPATH, "%s/%08x",
+ SIMPLE_UNDOLOG_DIR, xid);
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name, omode);
+ if (current_ulogfile_fd >= 0)
+ undolog_check_file_header();
+ else
+ {
+ if (forread)
+ return false;
+
+ current_fhdr.magic = ULOG_FILE_MAGIC;
+ current_fhdr.prepared = false;
+
+ omode |= O_CREAT;
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name, omode);
+ if (current_ulogfile_fd < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not create undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ if (write(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not write undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+
+ /*
+ * move file pointer to the end of the file. we do this not using O_APPEND,
+ * to allow us to modify data at any location in the file. We already moved
+ * to the first record in the case of !append.
+ */
+ if (append)
+ {
+ if (lseek(current_ulogfile_fd, 0, SEEK_END) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+ ReserveExternalFD();
+
+ /* sync the file according to wal_sync_method */
+ undolog_sync_current_file();
+
+ return true;
+}
+
+/*
+ * Write an undolog record
+ */
+void
+SimpleUndoLogWrite(RmgrId rmgr, uint8 info,
+ TransactionId xid, void *data, int len)
+{
+ int reclen = sizeof(SimpleUndoLogRecord) + len;
+ SimpleUndoLogRecord *rec = palloc(reclen);
+ pg_crc32c undodata_crc;
+
+ Assert(!IsParallelWorker());
+ Assert(xid != InvalidTransactionId);
+
+ undolog_open_current_file(xid, false, true);
+
+ rec->ul_tot_len = reclen;
+ rec->ul_rmid = rmgr;
+ rec->ul_info = info;
+ rec->ul_xid = current_xid;
+
+ memcpy((char *)rec + sizeof(SimpleUndoLogRecord), data, len);
+
+ /* Calculate CRC of the data */
+ INIT_CRC32C(undodata_crc);
+ COMP_CRC32C(undodata_crc, rec,
+ reclen - offsetof(SimpleUndoLogRecord, ul_rmid));
+ rec->ul_crc = undodata_crc;
+
+
+ if (write(current_ulogfile_fd, rec, reclen) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not write to undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ undolog_sync_current_file();
+}
+
+static void
+SimpleUndoLogUndo(bool cleanup)
+{
+ int bufsize;
+ char *buf;
+
+ bufsize = 1024;
+ buf = palloc(bufsize);
+
+ Assert(current_ulogfile_fd >= 0);
+
+ while (read(current_ulogfile_fd, buf, sizeof(SimpleUndoLogRecord)) ==
+ sizeof(SimpleUndoLogRecord))
+ {
+ SimpleUndoLogRecord *rec = (SimpleUndoLogRecord *) buf;
+ int readlen = rec->ul_tot_len - sizeof(SimpleUndoLogRecord);
+ int ret;
+
+ if (rec->ul_tot_len > bufsize)
+ {
+ bufsize *= 2;
+ buf = repalloc(buf, bufsize);
+ }
+
+ ret = read(current_ulogfile_fd,
+ buf + sizeof(SimpleUndoLogRecord), readlen);
+ if (ret != readlen)
+ {
+ if (ret < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not read undo log file \"%s\": %m",
+ current_ulogfile_name));
+
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("reading undo log expected %d bytes, but actually %d: %s",
+ readlen, ret, current_ulogfile_name));
+
+ }
+
+ UndoRoutines[rec->ul_rmid].rm_undo(rec,
+ current_fhdr.prepared && cleanup);
+ }
+}
+
+void
+AtEOXact_SimpleUndoLog(bool isCommit, TransactionId xid)
+{
+ if (IsParallelWorker())
+ return;
+
+ if (!undolog_open_current_file(xid, true, false))
+ return;
+
+ if (!isCommit)
+ SimpleUndoLogUndo(false);
+
+ if (current_ulogfile_fd > 0)
+ {
+ if (close(current_ulogfile_fd) != 0)
+ ereport(PANIC, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ current_ulogfile_name));
+
+ current_ulogfile_fd = -1;
+ ReleaseExternalFD();
+ durable_unlink(current_ulogfile_name, FATAL);
+ }
+
+ return;
+}
+
+void
+UndoLogCleanup(void)
+{
+ DIR *dirdesc;
+ struct dirent *de;
+ char **loglist;
+ int loglistspace = 128;
+ int loglistlen = 0;
+ int i;
+
+ loglist = palloc(sizeof(char*) * loglistspace);
+
+ dirdesc = AllocateDir(SIMPLE_UNDOLOG_DIR);
+ while ((de = ReadDir(dirdesc, SIMPLE_UNDOLOG_DIR)) != NULL)
+ {
+ if (strspn(de->d_name, "01234567890abcdef") < strlen(de->d_name))
+ continue;
+
+ if (loglistlen >= loglistspace)
+ {
+ loglistspace *= 2;
+ loglist = repalloc(loglist, sizeof(char*) * loglistspace);
+ }
+ loglist[loglistlen++] = pstrdup(de->d_name);
+ }
+
+ for (i = 0 ; i < loglistlen ; i++)
+ {
+ snprintf(current_ulogfile_name, MAXPGPATH, "%s/%s",
+ SIMPLE_UNDOLOG_DIR, loglist[i]);
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name,
+ O_RDWR | PG_BINARY |
+ XLogGetSyncBit());
+ undolog_check_file_header();
+ SimpleUndoLogUndo(true);
+ if (close(current_ulogfile_fd) != 0)
+ ereport(PANIC, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ current_ulogfile_name));
+ current_ulogfile_fd = -1;
+
+ /* do not remove ulog files for prepared transactions */
+ if (!current_fhdr.prepared)
+ durable_unlink(current_ulogfile_name, FATAL);
+ }
+}
+
+/*
+ * Mark this xid as prepared
+ */
+void
+SimpleUndoLogSetPrpared(TransactionId xid, bool prepared)
+{
+ Assert(xid != InvalidTransactionId);
+
+ undolog_open_current_file(xid, false, true);
+ current_fhdr.prepared = prepared;
+ if (lseek(current_ulogfile_fd, 0, SEEK_SET) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ if (write(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not write undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ undolog_sync_current_file();
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8426458f7f..bc9cdfc41a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -78,6 +78,7 @@
#include "access/commit_ts.h"
#include "access/htup_details.h"
+#include "access/simpleundolog.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
@@ -1603,6 +1604,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
abortstats,
gid);
+ AtEOXact_SimpleUndoLog(isCommit, xid);
+
ProcArrayRemove(proc, latestXid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 464858117e..1371df44b2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -24,6 +24,7 @@
#include "access/multixact.h"
#include "access/parallel.h"
#include "access/subtrans.h"
+#include "access/simpleundolog.h"
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xact.h"
@@ -2224,6 +2225,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise perform uncommitted storage file deletion. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2366,6 +2370,7 @@ CommitTransaction(void)
AtEOXact_on_commit_actions(true);
AtEOXact_Namespace(true, is_parallel_worker);
AtEOXact_SMgr();
+ AtEOXact_SimpleUndoLog(true, GetCurrentTransactionIdIfAny());
AtEOXact_Files(true);
AtEOXact_ComboCid();
AtEOXact_HashTables(true);
@@ -2475,6 +2480,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise perform uncommitted storage file deletion. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2799,6 +2807,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
@@ -2866,6 +2875,7 @@ AbortTransaction(void)
AtEOXact_on_commit_actions(false);
AtEOXact_Namespace(false, is_parallel_worker);
AtEOXact_SMgr();
+ AtEOXact_SimpleUndoLog(false, GetCurrentTransactionIdIfAny());
AtEOXact_Files(false);
AtEOXact_ComboCid();
AtEOXact_HashTables(false);
@@ -5003,6 +5013,8 @@ CommitSubTransaction(void)
AtEOSubXact_Inval(true);
AtSubCommit_smgr();
+ AtEOXact_SimpleUndoLog(true, GetCurrentTransactionIdIfAny());
+
/*
* The only lock we actually release here is the subtransaction XID lock.
*/
@@ -5196,6 +5208,7 @@ AbortSubTransaction(void)
RESOURCE_RELEASE_AFTER_LOCKS,
false, false);
AtSubAbort_smgr();
+ AtEOXact_SimpleUndoLog(false, GetCurrentTransactionIdIfAny());
AtEOXact_GUC(false, s->gucNestLevel);
AtEOSubXact_SPI(false, s->subTransactionId);
@@ -5676,7 +5689,10 @@ XactLogCommitRecord(TimestampTz commit_time,
if (!TransactionIdIsValid(twophase_xid))
info = XLOG_XACT_COMMIT;
else
+ {
+ elog(LOG, "COMMIT PREPARED: %d", twophase_xid);
info = XLOG_XACT_COMMIT_PREPARED;
+ }
/* First figure out and collect all the information needed */
@@ -6076,6 +6092,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ AtEOXact_SimpleUndoLog(true, xid);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6187,6 +6205,8 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ AtEOXact_SimpleUndoLog(false, xid);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6252,6 +6272,10 @@ xact_redo(XLogReaderState *record)
}
else if (info == XLOG_XACT_PREPARE)
{
+ xl_xact_prepare *xlrec = (xl_xact_prepare *) XLogRecGetData(record);
+
+ AtEOXact_SimpleUndoLog(true, xlrec->xid);
+
/*
* Store xid and start/end pointers of the WAL record in TwoPhaseState
* gxact entry.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c5f51849ee..7bb712c0ae 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -51,6 +51,7 @@
#include "access/heaptoast.h"
#include "access/multixact.h"
#include "access/rewriteheap.h"
+#include "access/simpleundolog.h"
#include "access/subtrans.h"
#include "access/timeline.h"
#include "access/transam.h"
@@ -5539,6 +5540,12 @@ StartupXLOG(void)
/* Check that the GUCs used to generate the WAL allow recovery */
CheckRequiredParameterValues();
+ /*
+ * Perform undo processing. This must be done before resetting unlogged
+ * relations.
+ */
+ UndoLogCleanup();
+
/*
* We're in recovery, so unlogged relations may be trashed and must be
* reset. This should be done BEFORE allowing Hot Standby
@@ -5684,14 +5691,17 @@ StartupXLOG(void)
}
/*
- * Reset unlogged relations to the contents of their INIT fork. This is
- * done AFTER recovery is complete so as to include any unlogged relations
- * created during recovery, but BEFORE recovery is marked as having
- * completed successfully. Otherwise we'd not retry if any of the post
- * end-of-recovery steps fail.
+ * Process undo logs left ater recovery, then reset unlogged relations to
+ * the contents of their INIT fork. This is done AFTER recovery is complete
+ * so as to include any file creations during recovery, but BEFORE recovery
+ * is marked as having completed successfully. Otherwise we'd not retry if
+ * any of the post end-of-recovery steps fail.
*/
if (InRecovery)
+ {
+ UndoLogCleanup();
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+ }
/*
* Pre-scan prepared transactions to find out the range of XIDs present.
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index b155c03386..03553c4980 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,16 +19,20 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogutils.h"
+#include "access/simpleundolog.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "catalog/storage_ulog.h"
#include "miscadmin.h"
#include "storage/freespace.h"
+#include "storage/reinit.h"
#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
@@ -66,6 +70,19 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +90,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -148,6 +166,19 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
srel = smgropen(rlocator, backend);
smgrcreate(srel, MAIN_FORKNUM, false);
+ /* Write undo log, this requires irrelevant to needs_wal */
+ if (register_delete)
+ {
+ ul_uncommitted_storage ul_storage;
+
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = MAIN_FORKNUM;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+ }
+
if (needs_wal)
log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -191,12 +222,32 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
*/
xlrec.rlocator = *rlocator;
xlrec.forkNum = forkNum;
+ xlrec.xid = GetTopTransactionId();
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(xlrec));
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +762,75 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op & ~(PCOP_UNLINK_FORK)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ BlockNumber firstblock = 0;
+
+ /*
+ * Unlink the fork file. Currently this operation is
+ * applied only to init-forks. As it is not ceratin that
+ * the init-fork is not loaded on shared buffers, drop all
+ * buffers for it.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+ DropRelationBuffers(srel, &pending->unlink_forknum, 1,
+ &firstblock);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -920,6 +1040,9 @@ PostPrepare_smgr(void)
/* must explicitly free the list entry */
pfree(pending);
}
+
+ /* Mark undolog as prepared */
+ SimpleUndoLogSetPrpared(GetCurrentTransactionId(), true);
}
@@ -967,10 +1090,28 @@ smgr_redo(XLogReaderState *record)
{
xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
SMgrRelation reln;
+ ul_uncommitted_storage ul_storage;
+
+ /* write undo log */
+ ul_storage.rlocator = xlrec->rlocator;
+ ul_storage.forknum = xlrec->forkNum;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ xlrec->xid,
+ &ul_storage, sizeof(ul_storage));
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1062,3 +1203,33 @@ smgr_redo(XLogReaderState *record)
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
+
+void
+smgr_undo(SimpleUndoLogRecord *record, bool crash_prepared)
+{
+ uint8 info = record->ul_info;
+
+
+ if (info == ULOG_SMGR_UNCOMMITED_STORAGE)
+ {
+ ul_uncommitted_storage *ul_storage =
+ (ul_uncommitted_storage *) ULogRecGetData(record);
+
+ if (!crash_prepared)
+ {
+ SMgrRelation reln;
+
+ reln = smgropen(ul_storage->rlocator, InvalidBackendId);
+ smgrunlink(reln, ul_storage->forknum, true);
+ smgrclose(reln);
+ }
+ else
+ {
+ /* Inform reinit to ignore this file during cleanup */
+ ResetUnloggedRelationIgnore(ul_storage->rlocator);
+ }
+
+ }
+ else
+ elog(PANIC, "smgr_undo: unknown op code %u", info);
+}
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f1cd1a38d9..5fb35bad77 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -34,6 +34,39 @@ typedef struct
RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
+static char **ignore_files = NULL;
+static int nignore_elems = 0;
+static int nignore_files = 0;
+
+/*
+ * identify the file should be ignored during resetting unlogged relations.
+ */
+static bool
+reinit_ignore_file(const char *dirname, const char *name)
+{
+ char fnamebuf[MAXPGPATH];
+ int len;
+
+ if (nignore_files == 0)
+ return false;
+
+ strncpy(fnamebuf, dirname, MAXPGPATH - 1);
+ strncat(fnamebuf, "/", MAXPGPATH - 1);
+ strncat(fnamebuf, name, MAXPGPATH - 1);
+ fnamebuf[MAXPGPATH - 1] = 0;
+
+ for (int i = 0 ; i < nignore_files ; i++)
+ {
+ /* match ignoring fork part */
+ len = strlen(ignore_files[i]);
+ if (strncmp(fnamebuf, ignore_files[i], len) == 0 &&
+ (fnamebuf[len] == 0 || fnamebuf[len] == '_'))
+ return true;
+ }
+
+ return false;
+}
+
/*
* Reset unlogged relations from before the last restart.
*
@@ -204,6 +237,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -243,6 +280,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* We never remove the init fork. */
if (forkNum == INIT_FORKNUM)
continue;
@@ -294,6 +335,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -337,6 +382,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -366,6 +415,35 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
}
}
+/*
+ * Record relfilenodes that should be left alone during reinitializing unlogged
+ * relations.
+ */
+void
+ResetUnloggedRelationIgnore(RelFileLocator rloc)
+{
+ RelFileLocatorBackend rbloc;
+
+ if (nignore_files >= nignore_elems)
+ {
+ if (ignore_files == NULL)
+ {
+ nignore_elems = 16;
+ ignore_files = palloc(sizeof(char *) * nignore_elems);
+ }
+ else
+ {
+ nignore_elems *= 2;
+ ignore_files = repalloc(ignore_files,
+ sizeof(char *) * nignore_elems);
+ }
+ }
+
+ rbloc.backend = InvalidBackendId;
+ rbloc.locator = rloc;
+ ignore_files[nignore_files++] = relpath(rbloc, MAIN_FORKNUM);
+}
+
/*
* Basic parsing of putative relation filenames.
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c7..52da360d32 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -729,6 +729,15 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+/*
+ * smgrunlink() -- unlink the storage file
+ */
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ac409b0006..31747b5db8 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -305,6 +305,7 @@ void setup_signals(void);
void setup_text_search(void);
void create_data_directory(void);
void create_xlog_or_symlink(void);
+void create_ulog(void);
void warn_on_mount_point(int error);
void initialize_data_directory(void);
@@ -2933,6 +2934,21 @@ create_xlog_or_symlink(void)
free(subdirloc);
}
+/* Create undo log directory */
+void
+create_ulog(void)
+{
+ char *subdirloc;
+
+ /* form name of the place for the subdirectory */
+ subdirloc = psprintf("%s/pg_ulog", pg_data);
+
+ if (mkdir(subdirloc, pg_dir_create_mode) < 0)
+ pg_fatal("could not create directory \"%s\": %m",
+ subdirloc);
+
+ free(subdirloc);
+}
void
warn_on_mount_point(int error)
@@ -2967,6 +2983,7 @@ initialize_data_directory(void)
create_data_directory();
create_xlog_or_symlink();
+ create_ulog();
/* Create required subdirectories (other than pg_wal) */
printf(_("creating subdirectories ... "));
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 22f7351fdc..525b98899f 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -28,7 +28,7 @@
* RmgrNames is an array of the built-in resource manager names, to make error
* messages a bit nicer.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
name,
static const char *const RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 6b8c17bb4c..a21009c5b8 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -32,7 +32,7 @@
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
{ name, desc, identify},
static const RmgrDescData RmgrDescTable[RM_N_BUILTIN_IDS] = {
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index 3b6a497e1b..d705de9256 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
* Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
* file format.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
symname,
typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 78e6b908c6..7f0abded93 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
*/
/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, smgr_undo)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode, NULL)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode, NULL)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL, NULL)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL, NULL)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL, NULL)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL, NULL)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL, NULL)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode, NULL)
diff --git a/src/include/access/simpleundolog.h b/src/include/access/simpleundolog.h
new file mode 100644
index 0000000000..3d3bd2f7e2
--- /dev/null
+++ b/src/include/access/simpleundolog.h
@@ -0,0 +1,36 @@
+#ifndef SIMPLE_UNDOLOG_H
+#define SIMPLE_UNDOLOG_H
+
+#include "access/rmgr.h"
+#include "port/pg_crc32c.h"
+
+#define SIMPLE_UNDOLOG_DIR "pg_ulog"
+
+typedef struct SimpleUndoLogRecord
+{
+ uint32 ul_tot_len; /* total length of entire record */
+ pg_crc32c ul_crc; /* CRC for this record */
+ RmgrId ul_rmid; /* resource manager for this record */
+ uint8 ul_info; /* record info */
+ TransactionId ul_xid; /* transaction id */
+ /* rmgr-specific data follow, no padding */
+} SimpleUndoLogRecord;
+
+extern void SimpleUndoLogWrite(RmgrId rmgr, uint8 info,
+ TransactionId xid, void *data, int len);
+extern void SimpleUndoLogSetPrpared(TransactionId xid, bool prepared);
+extern void AtEOXact_SimpleUndoLog(bool isCommit, TransactionId xid);
+extern void UndoLogCleanup(void);
+
+extern void AtPrepare_UndoLog(TransactionId xid);
+extern void PostPrepare_UndoLog(void);
+extern void undolog_twophase_recover(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_postcommit(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_postabort(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_standby_recover(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+
+#endif /* SIMPLE_UNDOLOG_H */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 72ef3ee92c..2a63eabcbd 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_ulog.h b/src/include/catalog/storage_ulog.h
new file mode 100644
index 0000000000..8e47428e66
--- /dev/null
+++ b/src/include/catalog/storage_ulog.h
@@ -0,0 +1,35 @@
+/*-------------------------------------------------------------------------
+ *
+ * storage_ulog.h
+ * prototypes for Undo Log support for backend/catalog/storage.c
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/storage_ulog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STORAGE_ULOG_H
+#define STORAGE_ULOG_H
+
+/* ULOG gives us high 4 bits (just following xlog) */
+#define ULOG_SMGR_UNCOMMITED_STORAGE 0x10
+
+/* undo log entry for uncommitted storage files */
+typedef struct ul_uncommitted_storage
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ bool remove;
+} ul_uncommitted_storage;
+
+/* flags for xl_smgr_truncate */
+#define SMGR_TRUNCATE_HEAP 0x0001
+
+void smgr_undo(SimpleUndoLogRecord *record, bool crash_prepared);
+
+#define ULogRecGetData(record) ((char *)record + sizeof(SimpleUndoLogRecord))
+
+#endif /* STORAGE_XLOG_H */
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index a490e05f88..807c0f8235 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -29,13 +29,21 @@
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
typedef struct xl_smgr_create
{
RelFileLocator rlocator;
ForkNumber forkNum;
+ TransactionId xid;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +59,7 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index 1373d509df..c57ae26b4c 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,9 +16,11 @@
#define REINIT_H
#include "common/relpath.h"
+#include "storage/relfilelocator.h"
extern void ResetUnloggedRelations(int op);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
extern bool parse_filename_for_nontemp_relation(const char *name,
RelFileNumber *relnumber,
ForkNumber *fork,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a056..2eb1e3ed5e 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -88,6 +88,7 @@ extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f582eb59e7..cf11358d8d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2009,6 +2009,7 @@ PatternInfo
PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
+PendingCleanup
PendingFsyncEntry
PendingRelDelete
PendingRelSync
@@ -2573,6 +2574,7 @@ SimplePtrListCell
SimpleStats
SimpleStringList
SimpleStringListCell
+SimpleUndoLogRecord
SingleBoundSortItem
Size
SkipPages
@@ -2932,6 +2934,8 @@ ULONG
ULONG_PTR
UV
UVersionInfo
+UndoDescData
+UndoLogFileHeader
UnicodeNormalizationForm
UnicodeNormalizationQC
Unique
@@ -3852,6 +3856,7 @@ uint8
uint8_t
uint8x16_t
uintptr_t
+ul_uncommitted_storage
unicodeStyleBorderFormat
unicodeStyleColumnFormat
unicodeStyleFormat
@@ -3965,6 +3970,7 @@ xl_running_xacts
xl_seq_rec
xl_smgr_create
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.39.3
v31-0003-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 7d0ab70fff64fa38209932a05d8d4e2e2193d8ec Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 4 Sep 2023 17:23:05 +0900
Subject: [PATCH v31 3/3] In-place table persistence change
Previously, the command caused a large amount of file I/O due to heap
rewrites, even though ALTER TABLE SET UNLOGGED does not require any
data rewrites. This patch eliminates the need for
rewrites. Additionally, ALTER TABLE SET LOGGED is updated to emit
XLOG_FPI records instead of numerous HEAP_INSERTs when wal_level >
minimal, reducing resource consumption.
---
src/backend/access/rmgrdesc/smgrdesc.c | 12 +
src/backend/catalog/storage.c | 338 ++++++++++++++++++++++++-
src/backend/commands/tablecmds.c | 269 +++++++++++++++++---
src/backend/storage/buffer/bufmgr.c | 84 ++++++
src/bin/pg_rewind/parsexlog.c | 6 +
src/include/catalog/storage_xlog.h | 10 +
src/include/storage/bufmgr.h | 3 +
src/include/storage/reinit.h | 2 +-
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 684 insertions(+), 41 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 71410e0a2d..77a8fdb045 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,15 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +64,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 03553c4980..6616466f61 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -71,11 +71,13 @@ typedef struct PendingRelDelete
} PendingRelDelete;
#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_SET_PERSISTENCE (1 << 1)
typedef struct PendingCleanup
{
RelFileLocator rlocator; /* relation that need a cleanup */
int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
ForkNumber unlink_forknum; /* forknum to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
@@ -209,6 +211,208 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ ul_uncommitted_storage ul_storage;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If a pending-unlink exists for this relation's init-fork, it indicates
+ * the init-fork's existed before the current transaction; this function
+ * reverts the pending-unlink by removing the entry. See
+ * RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ /* write cancel log for preceding undo log entry */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create undo log entry, then the init fork */
+ srel = smgropen(rlocator, InvalidBackendId);
+
+ /* write undo log */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * For index relations, WAL-logging and file sync are handled by
+ * ambuildempty. In contrast, for heap relations, these tasks are performed
+ * directly.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file then revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * Search for a pending-unlink associated with the init-fork of the
+ * relation. Its presence indicates that the init-fork was created within
+ * the current transaction.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ ul_uncommitted_storage ul_storage;
+
+ /* write cancel log for preceding undo log entry */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ /*
+ * If the init-fork was created in this transaction, remove the init-fork
+ * and cancel preceding undo log. Otherwise, register an at-commit
+ * pending-unlink for the existing init-fork. See RelationCreateInitFork.
+ */
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, InvalidBackendId);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+ ul_uncommitted_storage ul_storage;
+
+ /*
+ * Some AMs initialize init-fork via the buffer manager. To properly
+ * drop the init-fork, first drop all buffers for the init-fork, then
+ * unlink the init-fork and cancel preceding undo log.
+ */
+ DropRelationBuffers(srel, &forknum, 1, &firstblock);
+
+ /* cancel existing undo log */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -248,6 +452,25 @@ log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -800,7 +1023,14 @@ smgrDoPendingCleanups(bool isCommit)
srel = smgropen(pending->rlocator, pending->backend);
- Assert((pending->op & ~(PCOP_UNLINK_FORK)) == 0);
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
if (pending->op & PCOP_UNLINK_FORK)
{
@@ -1200,6 +1430,112 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete any pending action for persistence change, if present. There
+ * should be at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * During abort, revert any changes to buffer persistence made made in
+ * this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete any pending action for persistence change, if present. There
+ * should be at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * During abort, revert any changes to buffer persistence made made in
+ * this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fdcd09bc5e..ea750812cc 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -55,6 +55,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5661,6 +5662,189 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: perform in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Use ATRewriteTable instead of this function under the following
+ * condition.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * Initially, gather all relations that require a persistence change.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods don't support in-place persistence
+ * changes. GiST uses page LSNs to figure out whether a block has been
+ * modified. However, UNLOGGED GiST indexes use fake LSNs, which are
+ * incompatible with the real LSNs used for LOGGED indexes.
+ *
+ * Potentially, if gistGetFakeLSN behaved similarly for both permanent
+ * and unlogged indexes, we could avoid index rebuilds by emitting
+ * extra WAL records while the index is unlogged.
+ *
+ * Compare relam against a positive list to ensure the hard way is
+ * taken for unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ /* this doesn't fire REINDEX event triegger */
+ reindex_index(NULL, reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * If this relation becomes WAL-logged, immediately sync all files
+ * except the init-fork to establish the initial state on storage. The
+ * buffers should have already been flushed out by
+ * RelationCreate(Drop)InitFork called just above. The init-fork should
+ * already be synchronized as required.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * If wal_level >= replica, switching to LOGGED necessitates WAL-logging
+ * the relation content for later recovery. This is not emitted when
+ * wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5791,48 +5975,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..4de1db412c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3804,6 +3804,90 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages to disk (or kernel disk buffers) when
+ * switching to PERMANENT, ensuring the kernel has an up-to-date view of
+ * the relation.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation
+ * to ensure no other backend is busy dirtying more blocks.
+ *
+ * XXX currently it sequentially searches the buffer pool; consider
+ * implementing more efficient search methods. This routine isn't used in
+ * performance-critical code paths, so it's not worth additional overhead
+ * to make it go faster; see also DropRelationBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* The init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 525b98899f..c8c9cc361f 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,12 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 807c0f8235..b38909ceb3 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -14,6 +14,7 @@
#ifndef STORAGE_XLOG_H
#define STORAGE_XLOG_H
+#include "access/simpleundolog.h"
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
#include "storage/block.h"
@@ -30,6 +31,7 @@
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_BUFPERSISTENCE 0x40
typedef struct xl_smgr_create
{
@@ -44,6 +46,12 @@ typedef struct xl_smgr_unlink
ForkNumber forkNum;
} xl_smgr_unlink;
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -60,6 +68,8 @@ typedef struct xl_smgr_truncate
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..62f4fe430b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -224,6 +224,9 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
+
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index c57ae26b4c..746d3a910a 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -20,11 +20,11 @@
extern void ResetUnloggedRelations(int op);
-extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
extern bool parse_filename_for_nontemp_relation(const char *name,
RelFileNumber *relnumber,
ForkNumber *fork,
unsigned *segno);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
#define UNLOGGED_RELATION_CLEANUP 0x0001
#define UNLOGGED_RELATION_INIT 0x0002
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cf11358d8d..cf0b0dd51b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3968,6 +3968,7 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
xl_smgr_truncate
xl_smgr_unlink
--
2.39.3
2024-01 Commitfest.
Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/3461/, but it seems
there was a CFbot test failure last time it was run [2]https://cirrus-ci.com/task/6050020441456640. Please have a
look and post an updated version if necessary.
======
[1]: https://commitfest.postgresql.org/46/3461/
[2]: https://cirrus-ci.com/task/6050020441456640
Kind Regards,
Peter Smith.
At Mon, 22 Jan 2024 15:36:31 +1100, Peter Smith <smithpb2250@gmail.com> wrote in
2024-01 Commitfest.
Hi, This patch has a CF status of "Needs Review" [1], but it seems
there was a CFbot test failure last time it was run [2]. Please have a
look and post an updated version if necessary.
Thanks! I have added the necessary includes to the header file this
patch adds. With this change, "make headerscheck" now passes. However,
when I run "make cpluspluscheck" in my environment, it generates a
large number of errors in other areas, but I didn't find one related
to this patch.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v32-0001-Export-wal_sync_method-related-functions.patchtext/x-patch; charset=us-asciiDownload
From 9a2b6fbda882587c127d3e50bccf89508837d1a5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 15 Jan 2024 15:57:53 +0900
Subject: [PATCH v32 1/3] Export wal_sync_method related functions
Export several functions related to wal_sync_method for use in
subsequent commits. Since PG_O_DIRECT cannot be used in those commits,
the new function XLogGetSyncBit() will mask PG_O_DIRECT.
---
src/backend/access/transam/xlog.c | 73 +++++++++++++++++++++----------
src/include/access/xlog.h | 2 +
2 files changed, 52 insertions(+), 23 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..c5f51849ee 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8403,21 +8403,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra)
}
}
+/*
+ * Exported version of get_sync_bit()
+ *
+ * Do not expose PG_O_DIRECT for uses outside xlog.c.
+ */
+int
+XLogGetSyncBit(void)
+{
+ return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT;
+}
+
/*
- * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ * Issue appropriate kind of fsync (if any) according to wal_sync_method.
*
- * 'fd' is a file descriptor for the XLOG file to be fsync'd.
- * 'segno' is for error reporting purposes.
+ * 'fd' is a file descriptor for the file to be fsync'd.
*/
-void
-issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+const char *
+XLogFsyncFile(int fd)
{
- char *msg = NULL;
+ const char *msg = NULL;
instr_time start;
- Assert(tli != 0);
-
/*
* Quick exit if fsync is disabled or write() has already synced the WAL
* file.
@@ -8425,7 +8433,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
if (!enableFsync ||
wal_sync_method == WAL_SYNC_METHOD_OPEN ||
wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
- return;
+ return NULL;
/* Measure I/O timing to sync the WAL file */
if (track_wal_io_timing)
@@ -8460,19 +8468,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
break;
}
- /* PANIC if failed to fsync */
- if (msg)
- {
- char xlogfname[MAXFNAMELEN];
- int save_errno = errno;
-
- XLogFileName(xlogfname, tli, segno, wal_segment_size);
- errno = save_errno;
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg(msg, xlogfname)));
- }
-
pgstat_report_wait_end();
/*
@@ -8486,7 +8481,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
}
- PendingWalStats.wal_sync++;
+ if (msg != NULL)
+ PendingWalStats.wal_sync++;
+
+ return msg;
+}
+
+/*
+ * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to be fsync'd.
+ * 'segno' is for error reporting purposes.
+ */
+void
+issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+{
+ const char *msg;
+
+ Assert(tli != 0);
+
+ msg = XLogFsyncFile(fd);
+
+ /* PANIC if failed to fsync */
+ if (msg)
+ {
+ char xlogfname[MAXFNAMELEN];
+ int save_errno = errno;
+
+ XLogFileName(xlogfname, tli, segno, wal_segment_size);
+ errno = save_errno;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg(msg, xlogfname)));
+ }
}
/*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..2a0d65b537 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record);
extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
extern const char *xlog_identify(uint8 info);
+extern int XLogGetSyncBit(void);
+extern const char *XLogFsyncFile(int fd);
extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
extern bool RecoveryInProgress(void);
--
2.39.3
v32-0002-Introduce-undo-log-implementation.patchtext/x-patch; charset=us-asciiDownload
From c464013071dedc15b838e573ae828f150b3b60f7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v32 2/3] Introduce undo log implementation
This patch adds a simple implementation of UNDO log feature.
---
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/rmgr.c | 4 +-
src/backend/access/transam/simpleundolog.c | 362 +++++++++++++++++++++
src/backend/access/transam/twophase.c | 3 +
src/backend/access/transam/xact.c | 24 ++
src/backend/access/transam/xlog.c | 20 +-
src/backend/catalog/storage.c | 171 ++++++++++
src/backend/storage/file/reinit.c | 78 +++++
src/backend/storage/smgr/smgr.c | 9 +
src/bin/initdb/initdb.c | 17 +
src/bin/pg_rewind/parsexlog.c | 2 +-
src/bin/pg_waldump/rmgrdesc.c | 2 +-
src/include/access/rmgr.h | 2 +-
src/include/access/rmgrlist.h | 44 +--
src/include/access/simpleundolog.h | 36 ++
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_ulog.h | 38 +++
src/include/catalog/storage_xlog.h | 9 +
src/include/storage/reinit.h | 2 +
src/include/storage/smgr.h | 1 +
src/tools/pgindent/typedefs.list | 6 +
22 files changed, 803 insertions(+), 32 deletions(-)
create mode 100644 src/backend/access/transam/simpleundolog.c
create mode 100644 src/include/access/simpleundolog.h
create mode 100644 src/include/catalog/storage_ulog.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..531505cbbd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -21,6 +21,7 @@ OBJS = \
rmgr.o \
slru.o \
subtrans.o \
+ simpleundolog.o \
timeline.o \
transam.o \
twophase.o \
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557c..c1225636b5 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
'rmgr.c',
'slru.c',
'subtrans.c',
+ 'simpleundolog.c',
'timeline.c',
'transam.c',
'twophase.c',
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7d67eda5f7..840cbdecd3 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -35,8 +35,8 @@
#include "utils/relmapper.h"
/* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
- { name, redo, desc, identify, startup, cleanup, mask, decode },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
+ { name, redo, desc, identify, startup, cleanup, mask, decode},
RmgrData RmgrTable[RM_MAX_ID + 1] = {
#include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/simpleundolog.c b/src/backend/access/transam/simpleundolog.c
new file mode 100644
index 0000000000..e22ed67bae
--- /dev/null
+++ b/src/backend/access/transam/simpleundolog.c
@@ -0,0 +1,362 @@
+/*-------------------------------------------------------------------------
+ *
+ * simpleundolog.c
+ * Simple implementation of PostgreSQL transaction-undo-log manager
+ *
+ * In this module, procedures required during a transaction abort are
+ * logged. Persisting this information becomes crucial, particularly for
+ * ensuring reliable post-processing during the restart following a transaction
+ * crash. At present, in this module, logging of information is performed by
+ * simply appending data to a created file.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/clog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/simpleundolog.h"
+#include "access/twophase_rmgr.h"
+#include "access/parallel.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "catalog/storage_ulog.h"
+#include "storage/fd.h"
+
+#define ULOG_FILE_MAGIC 0x12345678
+
+typedef struct UndoLogFileHeader
+{
+ int32 magic;
+ bool prepared;
+} UndoLogFileHeader;
+
+typedef struct UndoDescData
+{
+ const char *name;
+ void (*rm_undo) (SimpleUndoLogRecord *record, bool prepared);
+} UndoDescData;
+
+/* must be kept in sync with RmgrData definition in xlog_internal.h */
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
+ { name, undo },
+
+UndoDescData UndoRoutines[RM_MAX_ID + 1] = {
+#include "access/rmgrlist.h"
+};
+#undef PG_RMGR
+
+static char current_ulogfile_name[MAXPGPATH];
+static int current_ulogfile_fd = -1;
+static int current_xid = InvalidTransactionId;
+static UndoLogFileHeader current_fhdr;
+
+static void
+undolog_check_file_header(void)
+{
+ if (read(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not read undolog file \"%s\": %m",
+ current_ulogfile_name));
+ if (current_fhdr.magic != ULOG_FILE_MAGIC)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("invalid undolog file \"%s\": magic don't match",
+ current_ulogfile_name));
+}
+
+static void
+undolog_sync_current_file(void)
+{
+ const char *msg;
+
+ msg = XLogFsyncFile(current_ulogfile_fd);
+
+ /* PANIC if failed to fsync */
+ if (msg)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg(msg, current_ulogfile_name)));
+ }
+}
+
+static bool
+undolog_open_current_file(TransactionId xid, bool forread, bool append)
+{
+ int omode;
+
+ if (current_ulogfile_fd >= 0)
+ {
+ /* use existing open file */
+ if (current_xid == xid)
+ {
+ if (append)
+ return true;
+
+ if (lseek(current_ulogfile_fd,
+ sizeof(UndoLogFileHeader), SEEK_SET) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+
+ close(current_ulogfile_fd);
+ current_ulogfile_fd = -1;
+ ReleaseExternalFD();
+ }
+
+ current_xid = xid;
+ if (!TransactionIdIsValid(xid))
+ return false;
+
+ omode = PG_BINARY | XLogGetSyncBit();
+
+ if (forread)
+ omode |= O_RDONLY;
+ else
+ {
+ omode |= O_RDWR;
+
+ if (!append)
+ omode |= O_TRUNC;
+ }
+
+ snprintf(current_ulogfile_name, MAXPGPATH, "%s/%08x",
+ SIMPLE_UNDOLOG_DIR, xid);
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name, omode);
+ if (current_ulogfile_fd >= 0)
+ undolog_check_file_header();
+ else
+ {
+ if (forread)
+ return false;
+
+ current_fhdr.magic = ULOG_FILE_MAGIC;
+ current_fhdr.prepared = false;
+
+ omode |= O_CREAT;
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name, omode);
+ if (current_ulogfile_fd < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not create undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ if (write(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not write undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+
+ /*
+ * move file pointer to the end of the file. we do this not using O_APPEND,
+ * to allow us to modify data at any location in the file. We already moved
+ * to the first record in the case of !append.
+ */
+ if (append)
+ {
+ if (lseek(current_ulogfile_fd, 0, SEEK_END) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+ ReserveExternalFD();
+
+ /* sync the file according to wal_sync_method */
+ undolog_sync_current_file();
+
+ return true;
+}
+
+/*
+ * Write an undolog record
+ */
+void
+SimpleUndoLogWrite(RmgrId rmgr, uint8 info,
+ TransactionId xid, void *data, int len)
+{
+ int reclen = sizeof(SimpleUndoLogRecord) + len;
+ SimpleUndoLogRecord *rec = palloc(reclen);
+ pg_crc32c undodata_crc;
+
+ Assert(!IsParallelWorker());
+ Assert(xid != InvalidTransactionId);
+
+ undolog_open_current_file(xid, false, true);
+
+ rec->ul_tot_len = reclen;
+ rec->ul_rmid = rmgr;
+ rec->ul_info = info;
+ rec->ul_xid = current_xid;
+
+ memcpy((char *)rec + sizeof(SimpleUndoLogRecord), data, len);
+
+ /* Calculate CRC of the data */
+ INIT_CRC32C(undodata_crc);
+ COMP_CRC32C(undodata_crc, rec,
+ reclen - offsetof(SimpleUndoLogRecord, ul_rmid));
+ rec->ul_crc = undodata_crc;
+
+
+ if (write(current_ulogfile_fd, rec, reclen) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not write to undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ undolog_sync_current_file();
+}
+
+static void
+SimpleUndoLogUndo(bool cleanup)
+{
+ int bufsize;
+ char *buf;
+
+ bufsize = 1024;
+ buf = palloc(bufsize);
+
+ Assert(current_ulogfile_fd >= 0);
+
+ while (read(current_ulogfile_fd, buf, sizeof(SimpleUndoLogRecord)) ==
+ sizeof(SimpleUndoLogRecord))
+ {
+ SimpleUndoLogRecord *rec = (SimpleUndoLogRecord *) buf;
+ int readlen = rec->ul_tot_len - sizeof(SimpleUndoLogRecord);
+ int ret;
+
+ if (rec->ul_tot_len > bufsize)
+ {
+ bufsize *= 2;
+ buf = repalloc(buf, bufsize);
+ }
+
+ ret = read(current_ulogfile_fd,
+ buf + sizeof(SimpleUndoLogRecord), readlen);
+ if (ret != readlen)
+ {
+ if (ret < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not read undo log file \"%s\": %m",
+ current_ulogfile_name));
+
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("reading undo log expected %d bytes, but actually %d: %s",
+ readlen, ret, current_ulogfile_name));
+
+ }
+
+ UndoRoutines[rec->ul_rmid].rm_undo(rec,
+ current_fhdr.prepared && cleanup);
+ }
+}
+
+void
+AtEOXact_SimpleUndoLog(bool isCommit, TransactionId xid)
+{
+ if (IsParallelWorker())
+ return;
+
+ if (!undolog_open_current_file(xid, true, false))
+ return;
+
+ if (!isCommit)
+ SimpleUndoLogUndo(false);
+
+ if (current_ulogfile_fd > 0)
+ {
+ if (close(current_ulogfile_fd) != 0)
+ ereport(PANIC, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ current_ulogfile_name));
+
+ current_ulogfile_fd = -1;
+ ReleaseExternalFD();
+ durable_unlink(current_ulogfile_name, FATAL);
+ }
+
+ return;
+}
+
+void
+UndoLogCleanup(void)
+{
+ DIR *dirdesc;
+ struct dirent *de;
+ char **loglist;
+ int loglistspace = 128;
+ int loglistlen = 0;
+ int i;
+
+ loglist = palloc(sizeof(char*) * loglistspace);
+
+ dirdesc = AllocateDir(SIMPLE_UNDOLOG_DIR);
+ while ((de = ReadDir(dirdesc, SIMPLE_UNDOLOG_DIR)) != NULL)
+ {
+ if (strspn(de->d_name, "01234567890abcdef") < strlen(de->d_name))
+ continue;
+
+ if (loglistlen >= loglistspace)
+ {
+ loglistspace *= 2;
+ loglist = repalloc(loglist, sizeof(char*) * loglistspace);
+ }
+ loglist[loglistlen++] = pstrdup(de->d_name);
+ }
+
+ for (i = 0 ; i < loglistlen ; i++)
+ {
+ snprintf(current_ulogfile_name, MAXPGPATH, "%s/%s",
+ SIMPLE_UNDOLOG_DIR, loglist[i]);
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name,
+ O_RDWR | PG_BINARY |
+ XLogGetSyncBit());
+ undolog_check_file_header();
+ SimpleUndoLogUndo(true);
+ if (close(current_ulogfile_fd) != 0)
+ ereport(PANIC, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ current_ulogfile_name));
+ current_ulogfile_fd = -1;
+
+ /* do not remove ulog files for prepared transactions */
+ if (!current_fhdr.prepared)
+ durable_unlink(current_ulogfile_name, FATAL);
+ }
+}
+
+/*
+ * Mark this xid as prepared
+ */
+void
+SimpleUndoLogSetPrpared(TransactionId xid, bool prepared)
+{
+ Assert(xid != InvalidTransactionId);
+
+ undolog_open_current_file(xid, false, true);
+ current_fhdr.prepared = prepared;
+ if (lseek(current_ulogfile_fd, 0, SEEK_SET) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ if (write(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not write undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ undolog_sync_current_file();
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8426458f7f..bc9cdfc41a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -78,6 +78,7 @@
#include "access/commit_ts.h"
#include "access/htup_details.h"
+#include "access/simpleundolog.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
@@ -1603,6 +1604,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
abortstats,
gid);
+ AtEOXact_SimpleUndoLog(isCommit, xid);
+
ProcArrayRemove(proc, latestXid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 464858117e..1371df44b2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -24,6 +24,7 @@
#include "access/multixact.h"
#include "access/parallel.h"
#include "access/subtrans.h"
+#include "access/simpleundolog.h"
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xact.h"
@@ -2224,6 +2225,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise perform uncommitted storage file deletion. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2366,6 +2370,7 @@ CommitTransaction(void)
AtEOXact_on_commit_actions(true);
AtEOXact_Namespace(true, is_parallel_worker);
AtEOXact_SMgr();
+ AtEOXact_SimpleUndoLog(true, GetCurrentTransactionIdIfAny());
AtEOXact_Files(true);
AtEOXact_ComboCid();
AtEOXact_HashTables(true);
@@ -2475,6 +2480,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise perform uncommitted storage file deletion. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2799,6 +2807,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
@@ -2866,6 +2875,7 @@ AbortTransaction(void)
AtEOXact_on_commit_actions(false);
AtEOXact_Namespace(false, is_parallel_worker);
AtEOXact_SMgr();
+ AtEOXact_SimpleUndoLog(false, GetCurrentTransactionIdIfAny());
AtEOXact_Files(false);
AtEOXact_ComboCid();
AtEOXact_HashTables(false);
@@ -5003,6 +5013,8 @@ CommitSubTransaction(void)
AtEOSubXact_Inval(true);
AtSubCommit_smgr();
+ AtEOXact_SimpleUndoLog(true, GetCurrentTransactionIdIfAny());
+
/*
* The only lock we actually release here is the subtransaction XID lock.
*/
@@ -5196,6 +5208,7 @@ AbortSubTransaction(void)
RESOURCE_RELEASE_AFTER_LOCKS,
false, false);
AtSubAbort_smgr();
+ AtEOXact_SimpleUndoLog(false, GetCurrentTransactionIdIfAny());
AtEOXact_GUC(false, s->gucNestLevel);
AtEOSubXact_SPI(false, s->subTransactionId);
@@ -5676,7 +5689,10 @@ XactLogCommitRecord(TimestampTz commit_time,
if (!TransactionIdIsValid(twophase_xid))
info = XLOG_XACT_COMMIT;
else
+ {
+ elog(LOG, "COMMIT PREPARED: %d", twophase_xid);
info = XLOG_XACT_COMMIT_PREPARED;
+ }
/* First figure out and collect all the information needed */
@@ -6076,6 +6092,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ AtEOXact_SimpleUndoLog(true, xid);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6187,6 +6205,8 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ AtEOXact_SimpleUndoLog(false, xid);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6252,6 +6272,10 @@ xact_redo(XLogReaderState *record)
}
else if (info == XLOG_XACT_PREPARE)
{
+ xl_xact_prepare *xlrec = (xl_xact_prepare *) XLogRecGetData(record);
+
+ AtEOXact_SimpleUndoLog(true, xlrec->xid);
+
/*
* Store xid and start/end pointers of the WAL record in TwoPhaseState
* gxact entry.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c5f51849ee..7bb712c0ae 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -51,6 +51,7 @@
#include "access/heaptoast.h"
#include "access/multixact.h"
#include "access/rewriteheap.h"
+#include "access/simpleundolog.h"
#include "access/subtrans.h"
#include "access/timeline.h"
#include "access/transam.h"
@@ -5539,6 +5540,12 @@ StartupXLOG(void)
/* Check that the GUCs used to generate the WAL allow recovery */
CheckRequiredParameterValues();
+ /*
+ * Perform undo processing. This must be done before resetting unlogged
+ * relations.
+ */
+ UndoLogCleanup();
+
/*
* We're in recovery, so unlogged relations may be trashed and must be
* reset. This should be done BEFORE allowing Hot Standby
@@ -5684,14 +5691,17 @@ StartupXLOG(void)
}
/*
- * Reset unlogged relations to the contents of their INIT fork. This is
- * done AFTER recovery is complete so as to include any unlogged relations
- * created during recovery, but BEFORE recovery is marked as having
- * completed successfully. Otherwise we'd not retry if any of the post
- * end-of-recovery steps fail.
+ * Process undo logs left ater recovery, then reset unlogged relations to
+ * the contents of their INIT fork. This is done AFTER recovery is complete
+ * so as to include any file creations during recovery, but BEFORE recovery
+ * is marked as having completed successfully. Otherwise we'd not retry if
+ * any of the post end-of-recovery steps fail.
*/
if (InRecovery)
+ {
+ UndoLogCleanup();
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+ }
/*
* Pre-scan prepared transactions to find out the range of XIDs present.
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index b155c03386..03553c4980 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,16 +19,20 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/parallel.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogutils.h"
+#include "access/simpleundolog.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "catalog/storage_ulog.h"
#include "miscadmin.h"
#include "storage/freespace.h"
+#include "storage/reinit.h"
#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
@@ -66,6 +70,19 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ BackendId backend; /* InvalidBackendId if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -73,6 +90,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -148,6 +166,19 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
srel = smgropen(rlocator, backend);
smgrcreate(srel, MAIN_FORKNUM, false);
+ /* Write undo log, this requires irrelevant to needs_wal */
+ if (register_delete)
+ {
+ ul_uncommitted_storage ul_storage;
+
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = MAIN_FORKNUM;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+ }
+
if (needs_wal)
log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -191,12 +222,32 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
*/
xlrec.rlocator = *rlocator;
xlrec.forkNum = forkNum;
+ xlrec.xid = GetTopTransactionId();
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(xlrec));
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -711,6 +762,75 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->backend);
+
+ Assert((pending->op & ~(PCOP_UNLINK_FORK)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ BlockNumber firstblock = 0;
+
+ /*
+ * Unlink the fork file. Currently this operation is
+ * applied only to init-forks. As it is not ceratin that
+ * the init-fork is not loaded on shared buffers, drop all
+ * buffers for it.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+ DropRelationBuffers(srel, &pending->unlink_forknum, 1,
+ &firstblock);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -920,6 +1040,9 @@ PostPrepare_smgr(void)
/* must explicitly free the list entry */
pfree(pending);
}
+
+ /* Mark undolog as prepared */
+ SimpleUndoLogSetPrpared(GetCurrentTransactionId(), true);
}
@@ -967,10 +1090,28 @@ smgr_redo(XLogReaderState *record)
{
xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
SMgrRelation reln;
+ ul_uncommitted_storage ul_storage;
+
+ /* write undo log */
+ ul_storage.rlocator = xlrec->rlocator;
+ ul_storage.forknum = xlrec->forkNum;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ xlrec->xid,
+ &ul_storage, sizeof(ul_storage));
reln = smgropen(xlrec->rlocator, InvalidBackendId);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1062,3 +1203,33 @@ smgr_redo(XLogReaderState *record)
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
+
+void
+smgr_undo(SimpleUndoLogRecord *record, bool crash_prepared)
+{
+ uint8 info = record->ul_info;
+
+
+ if (info == ULOG_SMGR_UNCOMMITED_STORAGE)
+ {
+ ul_uncommitted_storage *ul_storage =
+ (ul_uncommitted_storage *) ULogRecGetData(record);
+
+ if (!crash_prepared)
+ {
+ SMgrRelation reln;
+
+ reln = smgropen(ul_storage->rlocator, InvalidBackendId);
+ smgrunlink(reln, ul_storage->forknum, true);
+ smgrclose(reln);
+ }
+ else
+ {
+ /* Inform reinit to ignore this file during cleanup */
+ ResetUnloggedRelationIgnore(ul_storage->rlocator);
+ }
+
+ }
+ else
+ elog(PANIC, "smgr_undo: unknown op code %u", info);
+}
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f1cd1a38d9..5fb35bad77 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -34,6 +34,39 @@ typedef struct
RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
+static char **ignore_files = NULL;
+static int nignore_elems = 0;
+static int nignore_files = 0;
+
+/*
+ * identify the file should be ignored during resetting unlogged relations.
+ */
+static bool
+reinit_ignore_file(const char *dirname, const char *name)
+{
+ char fnamebuf[MAXPGPATH];
+ int len;
+
+ if (nignore_files == 0)
+ return false;
+
+ strncpy(fnamebuf, dirname, MAXPGPATH - 1);
+ strncat(fnamebuf, "/", MAXPGPATH - 1);
+ strncat(fnamebuf, name, MAXPGPATH - 1);
+ fnamebuf[MAXPGPATH - 1] = 0;
+
+ for (int i = 0 ; i < nignore_files ; i++)
+ {
+ /* match ignoring fork part */
+ len = strlen(ignore_files[i]);
+ if (strncmp(fnamebuf, ignore_files[i], len) == 0 &&
+ (fnamebuf[len] == 0 || fnamebuf[len] == '_'))
+ return true;
+ }
+
+ return false;
+}
+
/*
* Reset unlogged relations from before the last restart.
*
@@ -204,6 +237,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -243,6 +280,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* We never remove the init fork. */
if (forkNum == INIT_FORKNUM)
continue;
@@ -294,6 +335,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -337,6 +382,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -366,6 +415,35 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
}
}
+/*
+ * Record relfilenodes that should be left alone during reinitializing unlogged
+ * relations.
+ */
+void
+ResetUnloggedRelationIgnore(RelFileLocator rloc)
+{
+ RelFileLocatorBackend rbloc;
+
+ if (nignore_files >= nignore_elems)
+ {
+ if (ignore_files == NULL)
+ {
+ nignore_elems = 16;
+ ignore_files = palloc(sizeof(char *) * nignore_elems);
+ }
+ else
+ {
+ nignore_elems *= 2;
+ ignore_files = repalloc(ignore_files,
+ sizeof(char *) * nignore_elems);
+ }
+ }
+
+ rbloc.backend = InvalidBackendId;
+ rbloc.locator = rloc;
+ ignore_files[nignore_files++] = relpath(rbloc, MAIN_FORKNUM);
+}
+
/*
* Basic parsing of putative relation filenames.
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c7..52da360d32 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -729,6 +729,15 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+/*
+ * smgrunlink() -- unlink the storage file
+ */
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ac409b0006..31747b5db8 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -305,6 +305,7 @@ void setup_signals(void);
void setup_text_search(void);
void create_data_directory(void);
void create_xlog_or_symlink(void);
+void create_ulog(void);
void warn_on_mount_point(int error);
void initialize_data_directory(void);
@@ -2933,6 +2934,21 @@ create_xlog_or_symlink(void)
free(subdirloc);
}
+/* Create undo log directory */
+void
+create_ulog(void)
+{
+ char *subdirloc;
+
+ /* form name of the place for the subdirectory */
+ subdirloc = psprintf("%s/pg_ulog", pg_data);
+
+ if (mkdir(subdirloc, pg_dir_create_mode) < 0)
+ pg_fatal("could not create directory \"%s\": %m",
+ subdirloc);
+
+ free(subdirloc);
+}
void
warn_on_mount_point(int error)
@@ -2967,6 +2983,7 @@ initialize_data_directory(void)
create_data_directory();
create_xlog_or_symlink();
+ create_ulog();
/* Create required subdirectories (other than pg_wal) */
printf(_("creating subdirectories ... "));
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 22f7351fdc..525b98899f 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -28,7 +28,7 @@
* RmgrNames is an array of the built-in resource manager names, to make error
* messages a bit nicer.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
name,
static const char *const RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 6b8c17bb4c..a21009c5b8 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -32,7 +32,7 @@
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
{ name, desc, identify},
static const RmgrDescData RmgrDescTable[RM_N_BUILTIN_IDS] = {
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index 3b6a497e1b..d705de9256 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
* Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
* file format.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
symname,
typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 78e6b908c6..7f0abded93 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
*/
/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, smgr_undo)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode, NULL)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode, NULL)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL, NULL)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL, NULL)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL, NULL)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL, NULL)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL, NULL)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode, NULL)
diff --git a/src/include/access/simpleundolog.h b/src/include/access/simpleundolog.h
new file mode 100644
index 0000000000..3d3bd2f7e2
--- /dev/null
+++ b/src/include/access/simpleundolog.h
@@ -0,0 +1,36 @@
+#ifndef SIMPLE_UNDOLOG_H
+#define SIMPLE_UNDOLOG_H
+
+#include "access/rmgr.h"
+#include "port/pg_crc32c.h"
+
+#define SIMPLE_UNDOLOG_DIR "pg_ulog"
+
+typedef struct SimpleUndoLogRecord
+{
+ uint32 ul_tot_len; /* total length of entire record */
+ pg_crc32c ul_crc; /* CRC for this record */
+ RmgrId ul_rmid; /* resource manager for this record */
+ uint8 ul_info; /* record info */
+ TransactionId ul_xid; /* transaction id */
+ /* rmgr-specific data follow, no padding */
+} SimpleUndoLogRecord;
+
+extern void SimpleUndoLogWrite(RmgrId rmgr, uint8 info,
+ TransactionId xid, void *data, int len);
+extern void SimpleUndoLogSetPrpared(TransactionId xid, bool prepared);
+extern void AtEOXact_SimpleUndoLog(bool isCommit, TransactionId xid);
+extern void UndoLogCleanup(void);
+
+extern void AtPrepare_UndoLog(TransactionId xid);
+extern void PostPrepare_UndoLog(void);
+extern void undolog_twophase_recover(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_postcommit(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_postabort(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_standby_recover(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+
+#endif /* SIMPLE_UNDOLOG_H */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 72ef3ee92c..2a63eabcbd 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_ulog.h b/src/include/catalog/storage_ulog.h
new file mode 100644
index 0000000000..847f0403e2
--- /dev/null
+++ b/src/include/catalog/storage_ulog.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * storage_ulog.h
+ * prototypes for Undo Log support for backend/catalog/storage.c
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/storage_ulog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STORAGE_ULOG_H
+#define STORAGE_ULOG_H
+
+#include "access/simpleundolog.h"
+#include "storage/relfilelocator.h"
+
+/* ULOG gives us high 4 bits (just following xlog) */
+#define ULOG_SMGR_UNCOMMITED_STORAGE 0x10
+
+/* undo log entry for uncommitted storage files */
+typedef struct ul_uncommitted_storage
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ bool remove;
+} ul_uncommitted_storage;
+
+/* flags for xl_smgr_truncate */
+#define SMGR_TRUNCATE_HEAP 0x0001
+
+void smgr_undo(SimpleUndoLogRecord *record, bool crash_prepared);
+
+#define ULogRecGetData(record) ((char *)record + sizeof(SimpleUndoLogRecord))
+
+#endif /* STORAGE_XLOG_H */
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index a490e05f88..807c0f8235 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -29,13 +29,21 @@
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
typedef struct xl_smgr_create
{
RelFileLocator rlocator;
ForkNumber forkNum;
+ TransactionId xid;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +59,7 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index 1373d509df..c57ae26b4c 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,9 +16,11 @@
#define REINIT_H
#include "common/relpath.h"
+#include "storage/relfilelocator.h"
extern void ResetUnloggedRelations(int op);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
extern bool parse_filename_for_nontemp_relation(const char *name,
RelFileNumber *relnumber,
ForkNumber *fork,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a056..2eb1e3ed5e 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -88,6 +88,7 @@ extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7e866e3c3d..7bfa98d5aa 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2013,6 +2013,7 @@ PatternInfo
PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
+PendingCleanup
PendingFsyncEntry
PendingRelDelete
PendingRelSync
@@ -2577,6 +2578,7 @@ SimplePtrListCell
SimpleStats
SimpleStringList
SimpleStringListCell
+SimpleUndoLogRecord
SingleBoundSortItem
Size
SkipPages
@@ -2937,6 +2939,8 @@ ULONG
ULONG_PTR
UV
UVersionInfo
+UndoDescData
+UndoLogFileHeader
UnicodeNormalizationForm
UnicodeNormalizationQC
Unique
@@ -3858,6 +3862,7 @@ uint8
uint8_t
uint8x16_t
uintptr_t
+ul_uncommitted_storage
unicodeStyleBorderFormat
unicodeStyleColumnFormat
unicodeStyleFormat
@@ -3971,6 +3976,7 @@ xl_running_xacts
xl_seq_rec
xl_smgr_create
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.39.3
v32-0003-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From 2970b1e7fe6cbeae9a04d0b4644b2f7c1bac08b8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 4 Sep 2023 17:23:05 +0900
Subject: [PATCH v32 3/3] In-place table persistence change
Previously, the command caused a large amount of file I/O due to heap
rewrites, even though ALTER TABLE SET UNLOGGED does not require any
data rewrites. This patch eliminates the need for
rewrites. Additionally, ALTER TABLE SET LOGGED is updated to emit
XLOG_FPI records instead of numerous HEAP_INSERTs when wal_level >
minimal, reducing resource consumption.
---
src/backend/access/rmgrdesc/smgrdesc.c | 12 +
src/backend/catalog/storage.c | 338 ++++++++++++++++++++++++-
src/backend/commands/tablecmds.c | 269 +++++++++++++++++---
src/backend/storage/buffer/bufmgr.c | 84 ++++++
src/bin/pg_rewind/parsexlog.c | 6 +
src/include/catalog/storage_xlog.h | 10 +
src/include/storage/bufmgr.h | 3 +
src/include/storage/reinit.h | 2 +-
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 684 insertions(+), 41 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 71410e0a2d..77a8fdb045 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,15 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +64,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 03553c4980..6616466f61 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -71,11 +71,13 @@ typedef struct PendingRelDelete
} PendingRelDelete;
#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_SET_PERSISTENCE (1 << 1)
typedef struct PendingCleanup
{
RelFileLocator rlocator; /* relation that need a cleanup */
int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
ForkNumber unlink_forknum; /* forknum to unlink */
BackendId backend; /* InvalidBackendId if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
@@ -209,6 +211,208 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ ul_uncommitted_storage ul_storage;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If a pending-unlink exists for this relation's init-fork, it indicates
+ * the init-fork's existed before the current transaction; this function
+ * reverts the pending-unlink by removing the entry. See
+ * RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ /* write cancel log for preceding undo log entry */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create undo log entry, then the init fork */
+ srel = smgropen(rlocator, InvalidBackendId);
+
+ /* write undo log */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * For index relations, WAL-logging and file sync are handled by
+ * ambuildempty. In contrast, for heap relations, these tasks are performed
+ * directly.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file then revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->bufpersistence = true;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * Search for a pending-unlink associated with the init-fork of the
+ * relation. Its presence indicates that the init-fork was created within
+ * the current transaction.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ ul_uncommitted_storage ul_storage;
+
+ /* write cancel log for preceding undo log entry */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ /*
+ * If the init-fork was created in this transaction, remove the init-fork
+ * and cancel preceding undo log. Otherwise, register an at-commit
+ * pending-unlink for the existing init-fork. See RelationCreateInitFork.
+ */
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, InvalidBackendId);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+ ul_uncommitted_storage ul_storage;
+
+ /*
+ * Some AMs initialize init-fork via the buffer manager. To properly
+ * drop the init-fork, first drop all buffers for the init-fork, then
+ * unlink the init-fork and cancel preceding undo log.
+ */
+ DropRelationBuffers(srel, &forknum, 1, &firstblock);
+
+ /* cancel existing undo log */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -248,6 +452,25 @@ log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -800,7 +1023,14 @@ smgrDoPendingCleanups(bool isCommit)
srel = smgropen(pending->rlocator, pending->backend);
- Assert((pending->op & ~(PCOP_UNLINK_FORK)) == 0);
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
if (pending->op & PCOP_UNLINK_FORK)
{
@@ -1200,6 +1430,112 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete any pending action for persistence change, if present. There
+ * should be at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * During abort, revert any changes to buffer persistence made made in
+ * this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, InvalidBackendId);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete any pending action for persistence change, if present. There
+ * should be at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * During abort, revert any changes to buffer persistence made made in
+ * this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->backend = InvalidBackendId;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 2a56a4357c..aab0ddebd4 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -55,6 +55,7 @@
#include "commands/defrem.h"
#include "commands/event_trigger.h"
#include "commands/policy.h"
+#include "commands/progress.h"
#include "commands/sequence.h"
#include "commands/tablecmds.h"
#include "commands/tablespace.h"
@@ -5680,6 +5681,189 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: perform in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Use ATRewriteTable instead of this function under the following
+ * condition.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * Initially, gather all relations that require a persistence change.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods don't support in-place persistence
+ * changes. GiST uses page LSNs to figure out whether a block has been
+ * modified. However, UNLOGGED GiST indexes use fake LSNs, which are
+ * incompatible with the real LSNs used for LOGGED indexes.
+ *
+ * Potentially, if gistGetFakeLSN behaved similarly for both permanent
+ * and unlogged indexes, we could avoid index rebuilds by emitting
+ * extra WAL records while the index is unlogged.
+ *
+ * Compare relam against a positive list to ensure the hard way is
+ * taken for unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ /* this doesn't fire REINDEX event triegger */
+ reindex_index(NULL, reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * If this relation becomes WAL-logged, immediately sync all files
+ * except the init-fork to establish the initial state on storage. The
+ * buffers should have already been flushed out by
+ * RelationCreate(Drop)InitFork called just above. The init-fork should
+ * already be synchronized as required.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * If wal_level >= replica, switching to LOGGED necessitates WAL-logging
+ * the relation content for later recovery. This is not emitted when
+ * wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5810,48 +5994,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..4de1db412c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3804,6 +3804,90 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages to disk (or kernel disk buffers) when
+ * switching to PERMANENT, ensuring the kernel has an up-to-date view of
+ * the relation.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation
+ * to ensure no other backend is busy dirtying more blocks.
+ *
+ * XXX currently it sequentially searches the buffer pool; consider
+ * implementing more efficient search methods. This routine isn't used in
+ * performance-critical code paths, so it's not worth additional overhead
+ * to make it go faster; see also DropRelationBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* The init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 525b98899f..c8c9cc361f 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,12 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 807c0f8235..b38909ceb3 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -14,6 +14,7 @@
#ifndef STORAGE_XLOG_H
#define STORAGE_XLOG_H
+#include "access/simpleundolog.h"
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
#include "storage/block.h"
@@ -30,6 +31,7 @@
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_BUFPERSISTENCE 0x40
typedef struct xl_smgr_create
{
@@ -44,6 +46,12 @@ typedef struct xl_smgr_unlink
ForkNumber forkNum;
} xl_smgr_unlink;
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -60,6 +68,8 @@ typedef struct xl_smgr_truncate
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d335..62f4fe430b 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -224,6 +224,9 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
+
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index c57ae26b4c..746d3a910a 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -20,11 +20,11 @@
extern void ResetUnloggedRelations(int op);
-extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
extern bool parse_filename_for_nontemp_relation(const char *name,
RelFileNumber *relnumber,
ForkNumber *fork,
unsigned *segno);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
#define UNLOGGED_RELATION_CLEANUP 0x0001
#define UNLOGGED_RELATION_INIT 0x0002
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7bfa98d5aa..fff2b34ff5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3974,6 +3974,7 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
xl_smgr_truncate
xl_smgr_unlink
--
2.39.3
Rebased.
Along with rebasing, I changed the interface of XLogFsyncFile() to
return a boolean instead of an error message.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v33-0001-Export-wal_sync_method-related-functions.patchtext/x-patch; charset=us-asciiDownload
From bed74e638643d7491bbd86fe640c33db1e16f0e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 15 Jan 2024 15:57:53 +0900
Subject: [PATCH v33 1/3] Export wal_sync_method related functions
Export several functions related to wal_sync_method for use in
subsequent commits. Since PG_O_DIRECT cannot be used in those commits,
the new function XLogGetSyncBit() will mask PG_O_DIRECT.
---
src/backend/access/transam/xlog.c | 73 +++++++++++++++++++++----------
src/include/access/xlog.h | 2 +
2 files changed, 52 insertions(+), 23 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f..492ababd9c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8592,21 +8592,29 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra)
}
}
+/*
+ * Exported version of get_sync_bit()
+ *
+ * Do not expose PG_O_DIRECT for uses outside xlog.c.
+ */
+int
+XLogGetSyncBit(void)
+{
+ return get_sync_bit(wal_sync_method) & ~PG_O_DIRECT;
+}
+
/*
- * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ * Issue appropriate kind of fsync (if any) according to wal_sync_method.
*
- * 'fd' is a file descriptor for the XLOG file to be fsync'd.
- * 'segno' is for error reporting purposes.
+ * 'fd' is a file descriptor for the file to be fsync'd.
*/
-void
-issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+const char *
+XLogFsyncFile(int fd)
{
- char *msg = NULL;
+ const char *msg = NULL;
instr_time start;
- Assert(tli != 0);
-
/*
* Quick exit if fsync is disabled or write() has already synced the WAL
* file.
@@ -8614,7 +8622,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
if (!enableFsync ||
wal_sync_method == WAL_SYNC_METHOD_OPEN ||
wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
- return;
+ return NULL;
/* Measure I/O timing to sync the WAL file */
if (track_wal_io_timing)
@@ -8651,19 +8659,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
break;
}
- /* PANIC if failed to fsync */
- if (msg)
- {
- char xlogfname[MAXFNAMELEN];
- int save_errno = errno;
-
- XLogFileName(xlogfname, tli, segno, wal_segment_size);
- errno = save_errno;
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg(msg, xlogfname)));
- }
-
pgstat_report_wait_end();
/*
@@ -8677,7 +8672,39 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
}
- PendingWalStats.wal_sync++;
+ if (msg != NULL)
+ PendingWalStats.wal_sync++;
+
+ return msg;
+}
+
+/*
+ * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to be fsync'd.
+ * 'segno' is for error reporting purposes.
+ */
+void
+issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+{
+ const char *msg;
+
+ Assert(tli != 0);
+
+ msg = XLogFsyncFile(fd);
+
+ /* PANIC if failed to fsync */
+ if (msg)
+ {
+ char xlogfname[MAXFNAMELEN];
+ int save_errno = errno;
+
+ XLogFileName(xlogfname, tli, segno, wal_segment_size);
+ errno = save_errno;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg(msg, xlogfname)));
+ }
}
/*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1a1f11a943..badfe4abd6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -217,6 +217,8 @@ extern void xlog_redo(struct XLogReaderState *record);
extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
extern const char *xlog_identify(uint8 info);
+extern int XLogGetSyncBit(void);
+extern const char *XLogFsyncFile(int fd);
extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
extern bool RecoveryInProgress(void);
--
2.43.0
v33-0002-Introduce-undo-log-implementation.patchtext/x-patch; charset=us-asciiDownload
From c200b85c1311f97bdae2ed20e2746c44d5c4aadb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 31 Aug 2023 11:49:10 +0900
Subject: [PATCH v33 2/3] Introduce undo log implementation
This patch adds a simple implementation of UNDO log feature.
---
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/rmgr.c | 4 +-
src/backend/access/transam/simpleundolog.c | 362 +++++++++++++++++++++
src/backend/access/transam/twophase.c | 3 +
src/backend/access/transam/xact.c | 24 ++
src/backend/access/transam/xlog.c | 42 ++-
src/backend/catalog/storage.c | 171 ++++++++++
src/backend/storage/file/reinit.c | 78 +++++
src/backend/storage/smgr/smgr.c | 9 +
src/bin/initdb/initdb.c | 17 +
src/bin/pg_rewind/parsexlog.c | 2 +-
src/bin/pg_waldump/rmgrdesc.c | 2 +-
src/include/access/rmgr.h | 2 +-
src/include/access/rmgrlist.h | 44 +--
src/include/access/simpleundolog.h | 36 ++
src/include/access/xlog.h | 2 +-
src/include/catalog/storage.h | 3 +
src/include/catalog/storage_ulog.h | 38 +++
src/include/catalog/storage_xlog.h | 9 +
src/include/storage/reinit.h | 2 +
src/include/storage/smgr.h | 1 +
src/tools/pgindent/typedefs.list | 6 +
23 files changed, 818 insertions(+), 41 deletions(-)
create mode 100644 src/backend/access/transam/simpleundolog.c
create mode 100644 src/include/access/simpleundolog.h
create mode 100644 src/include/catalog/storage_ulog.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..531505cbbd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -21,6 +21,7 @@ OBJS = \
rmgr.o \
slru.o \
subtrans.o \
+ simpleundolog.o \
timeline.o \
transam.o \
twophase.o \
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557c..c1225636b5 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
'rmgr.c',
'slru.c',
'subtrans.c',
+ 'simpleundolog.c',
'timeline.c',
'transam.c',
'twophase.c',
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 1b7499726e..8fe3e71a0c 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -44,8 +44,8 @@
/* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
- { name, redo, desc, identify, startup, cleanup, mask, decode },
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
+ { name, redo, desc, identify, startup, cleanup, mask, decode},
RmgrData RmgrTable[RM_MAX_ID + 1] = {
#include "access/rmgrlist.h"
diff --git a/src/backend/access/transam/simpleundolog.c b/src/backend/access/transam/simpleundolog.c
new file mode 100644
index 0000000000..e22ed67bae
--- /dev/null
+++ b/src/backend/access/transam/simpleundolog.c
@@ -0,0 +1,362 @@
+/*-------------------------------------------------------------------------
+ *
+ * simpleundolog.c
+ * Simple implementation of PostgreSQL transaction-undo-log manager
+ *
+ * In this module, procedures required during a transaction abort are
+ * logged. Persisting this information becomes crucial, particularly for
+ * ensuring reliable post-processing during the restart following a transaction
+ * crash. At present, in this module, logging of information is performed by
+ * simply appending data to a created file.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/clog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/simpleundolog.h"
+#include "access/twophase_rmgr.h"
+#include "access/parallel.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "catalog/storage_ulog.h"
+#include "storage/fd.h"
+
+#define ULOG_FILE_MAGIC 0x12345678
+
+typedef struct UndoLogFileHeader
+{
+ int32 magic;
+ bool prepared;
+} UndoLogFileHeader;
+
+typedef struct UndoDescData
+{
+ const char *name;
+ void (*rm_undo) (SimpleUndoLogRecord *record, bool prepared);
+} UndoDescData;
+
+/* must be kept in sync with RmgrData definition in xlog_internal.h */
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
+ { name, undo },
+
+UndoDescData UndoRoutines[RM_MAX_ID + 1] = {
+#include "access/rmgrlist.h"
+};
+#undef PG_RMGR
+
+static char current_ulogfile_name[MAXPGPATH];
+static int current_ulogfile_fd = -1;
+static int current_xid = InvalidTransactionId;
+static UndoLogFileHeader current_fhdr;
+
+static void
+undolog_check_file_header(void)
+{
+ if (read(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not read undolog file \"%s\": %m",
+ current_ulogfile_name));
+ if (current_fhdr.magic != ULOG_FILE_MAGIC)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("invalid undolog file \"%s\": magic don't match",
+ current_ulogfile_name));
+}
+
+static void
+undolog_sync_current_file(void)
+{
+ const char *msg;
+
+ msg = XLogFsyncFile(current_ulogfile_fd);
+
+ /* PANIC if failed to fsync */
+ if (msg)
+ {
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg(msg, current_ulogfile_name)));
+ }
+}
+
+static bool
+undolog_open_current_file(TransactionId xid, bool forread, bool append)
+{
+ int omode;
+
+ if (current_ulogfile_fd >= 0)
+ {
+ /* use existing open file */
+ if (current_xid == xid)
+ {
+ if (append)
+ return true;
+
+ if (lseek(current_ulogfile_fd,
+ sizeof(UndoLogFileHeader), SEEK_SET) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+
+ close(current_ulogfile_fd);
+ current_ulogfile_fd = -1;
+ ReleaseExternalFD();
+ }
+
+ current_xid = xid;
+ if (!TransactionIdIsValid(xid))
+ return false;
+
+ omode = PG_BINARY | XLogGetSyncBit();
+
+ if (forread)
+ omode |= O_RDONLY;
+ else
+ {
+ omode |= O_RDWR;
+
+ if (!append)
+ omode |= O_TRUNC;
+ }
+
+ snprintf(current_ulogfile_name, MAXPGPATH, "%s/%08x",
+ SIMPLE_UNDOLOG_DIR, xid);
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name, omode);
+ if (current_ulogfile_fd >= 0)
+ undolog_check_file_header();
+ else
+ {
+ if (forread)
+ return false;
+
+ current_fhdr.magic = ULOG_FILE_MAGIC;
+ current_fhdr.prepared = false;
+
+ omode |= O_CREAT;
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name, omode);
+ if (current_ulogfile_fd < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not create undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ if (write(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not write undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+
+ /*
+ * move file pointer to the end of the file. we do this not using O_APPEND,
+ * to allow us to modify data at any location in the file. We already moved
+ * to the first record in the case of !append.
+ */
+ if (append)
+ {
+ if (lseek(current_ulogfile_fd, 0, SEEK_END) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+ }
+ ReserveExternalFD();
+
+ /* sync the file according to wal_sync_method */
+ undolog_sync_current_file();
+
+ return true;
+}
+
+/*
+ * Write an undolog record
+ */
+void
+SimpleUndoLogWrite(RmgrId rmgr, uint8 info,
+ TransactionId xid, void *data, int len)
+{
+ int reclen = sizeof(SimpleUndoLogRecord) + len;
+ SimpleUndoLogRecord *rec = palloc(reclen);
+ pg_crc32c undodata_crc;
+
+ Assert(!IsParallelWorker());
+ Assert(xid != InvalidTransactionId);
+
+ undolog_open_current_file(xid, false, true);
+
+ rec->ul_tot_len = reclen;
+ rec->ul_rmid = rmgr;
+ rec->ul_info = info;
+ rec->ul_xid = current_xid;
+
+ memcpy((char *)rec + sizeof(SimpleUndoLogRecord), data, len);
+
+ /* Calculate CRC of the data */
+ INIT_CRC32C(undodata_crc);
+ COMP_CRC32C(undodata_crc, rec,
+ reclen - offsetof(SimpleUndoLogRecord, ul_rmid));
+ rec->ul_crc = undodata_crc;
+
+
+ if (write(current_ulogfile_fd, rec, reclen) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not write to undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ undolog_sync_current_file();
+}
+
+static void
+SimpleUndoLogUndo(bool cleanup)
+{
+ int bufsize;
+ char *buf;
+
+ bufsize = 1024;
+ buf = palloc(bufsize);
+
+ Assert(current_ulogfile_fd >= 0);
+
+ while (read(current_ulogfile_fd, buf, sizeof(SimpleUndoLogRecord)) ==
+ sizeof(SimpleUndoLogRecord))
+ {
+ SimpleUndoLogRecord *rec = (SimpleUndoLogRecord *) buf;
+ int readlen = rec->ul_tot_len - sizeof(SimpleUndoLogRecord);
+ int ret;
+
+ if (rec->ul_tot_len > bufsize)
+ {
+ bufsize *= 2;
+ buf = repalloc(buf, bufsize);
+ }
+
+ ret = read(current_ulogfile_fd,
+ buf + sizeof(SimpleUndoLogRecord), readlen);
+ if (ret != readlen)
+ {
+ if (ret < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not read undo log file \"%s\": %m",
+ current_ulogfile_name));
+
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("reading undo log expected %d bytes, but actually %d: %s",
+ readlen, ret, current_ulogfile_name));
+
+ }
+
+ UndoRoutines[rec->ul_rmid].rm_undo(rec,
+ current_fhdr.prepared && cleanup);
+ }
+}
+
+void
+AtEOXact_SimpleUndoLog(bool isCommit, TransactionId xid)
+{
+ if (IsParallelWorker())
+ return;
+
+ if (!undolog_open_current_file(xid, true, false))
+ return;
+
+ if (!isCommit)
+ SimpleUndoLogUndo(false);
+
+ if (current_ulogfile_fd > 0)
+ {
+ if (close(current_ulogfile_fd) != 0)
+ ereport(PANIC, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ current_ulogfile_name));
+
+ current_ulogfile_fd = -1;
+ ReleaseExternalFD();
+ durable_unlink(current_ulogfile_name, FATAL);
+ }
+
+ return;
+}
+
+void
+UndoLogCleanup(void)
+{
+ DIR *dirdesc;
+ struct dirent *de;
+ char **loglist;
+ int loglistspace = 128;
+ int loglistlen = 0;
+ int i;
+
+ loglist = palloc(sizeof(char*) * loglistspace);
+
+ dirdesc = AllocateDir(SIMPLE_UNDOLOG_DIR);
+ while ((de = ReadDir(dirdesc, SIMPLE_UNDOLOG_DIR)) != NULL)
+ {
+ if (strspn(de->d_name, "01234567890abcdef") < strlen(de->d_name))
+ continue;
+
+ if (loglistlen >= loglistspace)
+ {
+ loglistspace *= 2;
+ loglist = repalloc(loglist, sizeof(char*) * loglistspace);
+ }
+ loglist[loglistlen++] = pstrdup(de->d_name);
+ }
+
+ for (i = 0 ; i < loglistlen ; i++)
+ {
+ snprintf(current_ulogfile_name, MAXPGPATH, "%s/%s",
+ SIMPLE_UNDOLOG_DIR, loglist[i]);
+ current_ulogfile_fd = BasicOpenFile(current_ulogfile_name,
+ O_RDWR | PG_BINARY |
+ XLogGetSyncBit());
+ undolog_check_file_header();
+ SimpleUndoLogUndo(true);
+ if (close(current_ulogfile_fd) != 0)
+ ereport(PANIC, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ current_ulogfile_name));
+ current_ulogfile_fd = -1;
+
+ /* do not remove ulog files for prepared transactions */
+ if (!current_fhdr.prepared)
+ durable_unlink(current_ulogfile_name, FATAL);
+ }
+}
+
+/*
+ * Mark this xid as prepared
+ */
+void
+SimpleUndoLogSetPrpared(TransactionId xid, bool prepared)
+{
+ Assert(xid != InvalidTransactionId);
+
+ undolog_open_current_file(xid, false, true);
+ current_fhdr.prepared = prepared;
+ if (lseek(current_ulogfile_fd, 0, SEEK_SET) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ if (write(current_ulogfile_fd, ¤t_fhdr, sizeof(current_fhdr)) < 0)
+ ereport(PANIC,
+ errcode_for_file_access(),
+ errmsg("could not write undolog file \"%s\": %m",
+ current_ulogfile_name));
+
+ undolog_sync_current_file();
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index bf451d42ff..db3b227111 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -78,6 +78,7 @@
#include "access/commit_ts.h"
#include "access/htup_details.h"
+#include "access/simpleundolog.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
@@ -1587,6 +1588,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
abortstats,
gid);
+ AtEOXact_SimpleUndoLog(isCommit, xid);
+
ProcArrayRemove(proc, latestXid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 4f4ce75762..d81e51746b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -24,6 +24,7 @@
#include "access/multixact.h"
#include "access/parallel.h"
#include "access/subtrans.h"
+#include "access/simpleundolog.h"
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xact.h"
@@ -2267,6 +2268,9 @@ CommitTransaction(void)
*/
smgrDoPendingSyncs(true, is_parallel_worker);
+ /* Likewise perform uncommitted storage file deletion. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2414,6 +2418,7 @@ CommitTransaction(void)
AtEOXact_on_commit_actions(true);
AtEOXact_Namespace(true, is_parallel_worker);
AtEOXact_SMgr();
+ AtEOXact_SimpleUndoLog(true, GetCurrentTransactionIdIfAny());
AtEOXact_Files(true);
AtEOXact_ComboCid();
AtEOXact_HashTables(true);
@@ -2523,6 +2528,9 @@ PrepareTransaction(void)
*/
smgrDoPendingSyncs(true, false);
+ /* Likewise perform uncommitted storage file deletion. */
+ smgrDoPendingCleanups(true);
+
/* close large objects before lower-level cleanup */
AtEOXact_LargeObject(true);
@@ -2856,6 +2864,7 @@ AbortTransaction(void)
AfterTriggerEndXact(false); /* 'false' means it's abort */
AtAbort_Portals();
smgrDoPendingSyncs(false, is_parallel_worker);
+ smgrDoPendingCleanups(false);
AtEOXact_LargeObject(false);
AtAbort_Notify();
AtEOXact_RelationMap(false, is_parallel_worker);
@@ -2923,6 +2932,7 @@ AbortTransaction(void)
AtEOXact_on_commit_actions(false);
AtEOXact_Namespace(false, is_parallel_worker);
AtEOXact_SMgr();
+ AtEOXact_SimpleUndoLog(false, GetCurrentTransactionIdIfAny());
AtEOXact_Files(false);
AtEOXact_ComboCid();
AtEOXact_HashTables(false);
@@ -5107,6 +5117,8 @@ CommitSubTransaction(void)
AtEOSubXact_Inval(true);
AtSubCommit_smgr();
+ AtEOXact_SimpleUndoLog(true, GetCurrentTransactionIdIfAny());
+
/*
* The only lock we actually release here is the subtransaction XID lock.
*/
@@ -5300,6 +5312,7 @@ AbortSubTransaction(void)
RESOURCE_RELEASE_AFTER_LOCKS,
false, false);
AtSubAbort_smgr();
+ AtEOXact_SimpleUndoLog(false, GetCurrentTransactionIdIfAny());
AtEOXact_GUC(false, s->gucNestLevel);
AtEOSubXact_SPI(false, s->subTransactionId);
@@ -5790,7 +5803,10 @@ XactLogCommitRecord(TimestampTz commit_time,
if (!TransactionIdIsValid(twophase_xid))
info = XLOG_XACT_COMMIT;
else
+ {
+ elog(LOG, "COMMIT PREPARED: %d", twophase_xid);
info = XLOG_XACT_COMMIT_PREPARED;
+ }
/* First figure out and collect all the information needed */
@@ -6190,6 +6206,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ AtEOXact_SimpleUndoLog(true, xid);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6301,6 +6319,8 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ AtEOXact_SimpleUndoLog(false, xid);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6366,6 +6386,10 @@ xact_redo(XLogReaderState *record)
}
else if (info == XLOG_XACT_PREPARE)
{
+ xl_xact_prepare *xlrec = (xl_xact_prepare *) XLogRecGetData(record);
+
+ AtEOXact_SimpleUndoLog(true, xlrec->xid);
+
/*
* Store xid and start/end pointers of the WAL record in TwoPhaseState
* gxact entry.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 492ababd9c..ce5e299b36 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -51,6 +51,7 @@
#include "access/heaptoast.h"
#include "access/multixact.h"
#include "access/rewriteheap.h"
+#include "access/simpleundolog.h"
#include "access/subtrans.h"
#include "access/timeline.h"
#include "access/transam.h"
@@ -5720,6 +5721,12 @@ StartupXLOG(void)
/* Check that the GUCs used to generate the WAL allow recovery */
CheckRequiredParameterValues();
+ /*
+ * Perform undo processing. This must be done before resetting unlogged
+ * relations.
+ */
+ UndoLogCleanup();
+
/*
* We're in recovery, so unlogged relations may be trashed and must be
* reset. This should be done BEFORE allowing Hot Standby
@@ -5867,14 +5874,17 @@ StartupXLOG(void)
}
/*
- * Reset unlogged relations to the contents of their INIT fork. This is
- * done AFTER recovery is complete so as to include any unlogged relations
- * created during recovery, but BEFORE recovery is marked as having
- * completed successfully. Otherwise we'd not retry if any of the post
- * end-of-recovery steps fail.
+ * Process undo logs left ater recovery, then reset unlogged relations to
+ * the contents of their INIT fork. This is done AFTER recovery is complete
+ * so as to include any file creations during recovery, but BEFORE recovery
+ * is marked as having completed successfully. Otherwise we'd not retry if
+ * any of the post end-of-recovery steps fail.
*/
if (InRecovery)
+ {
+ UndoLogCleanup();
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+ }
/*
* Pre-scan prepared transactions to find out the range of XIDs present.
@@ -8607,10 +8617,12 @@ XLogGetSyncBit(void)
/*
* Issue appropriate kind of fsync (if any) according to wal_sync_method.
*
+ * Returns true if sucessfully fsync'ed, otherwise returns false and sets
+ * errmsg if it is not NULL.
* 'fd' is a file descriptor for the file to be fsync'd.
*/
-const char *
-XLogFsyncFile(int fd)
+bool
+XLogFsyncFile(int fd, const char **errmsg)
{
const char *msg = NULL;
instr_time start;
@@ -8622,7 +8634,7 @@ XLogFsyncFile(int fd)
if (!enableFsync ||
wal_sync_method == WAL_SYNC_METHOD_OPEN ||
wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
- return NULL;
+ return true;
/* Measure I/O timing to sync the WAL file */
if (track_wal_io_timing)
@@ -8672,10 +8684,16 @@ XLogFsyncFile(int fd)
INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
}
- if (msg != NULL)
+ if (msg == NULL)
+ {
PendingWalStats.wal_sync++;
+ if (errmsg)
+ *errmsg = msg;
- return msg;
+ return false;
+ }
+
+ return true;
}
/*
@@ -8691,10 +8709,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
Assert(tli != 0);
- msg = XLogFsyncFile(fd);
-
/* PANIC if failed to fsync */
- if (msg)
+ if (!XLogFsyncFile(fd, &msg))
{
char xlogfname[MAXFNAMELEN];
int save_errno = errno;
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f56b3cc0f2..ae1bf597bd 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,17 +19,21 @@
#include "postgres.h"
+#include "access/amapi.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogutils.h"
+#include "access/simpleundolog.h"
#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
+#include "catalog/storage_ulog.h"
#include "miscadmin.h"
#include "storage/bulk_write.h"
#include "storage/freespace.h"
#include "storage/proc.h"
+#include "storage/reinit.h"
#include "storage/smgr.h"
#include "utils/hsearch.h"
#include "utils/memutils.h"
@@ -67,6 +71,19 @@ typedef struct PendingRelDelete
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
+#define PCOP_UNLINK_FORK (1 << 0)
+
+typedef struct PendingCleanup
+{
+ RelFileLocator rlocator; /* relation that need a cleanup */
+ int op; /* operation mask */
+ ForkNumber unlink_forknum; /* forknum to unlink */
+ ProcNumber procNumber; /* INVALID_PROC_NUMBER if not a temp rel */
+ bool atCommit; /* T=delete at commit; F=delete at abort */
+ int nestLevel; /* xact nesting level of request */
+ struct PendingCleanup *next; /* linked-list link */
+} PendingCleanup;
+
typedef struct PendingRelSync
{
RelFileLocator rlocator;
@@ -74,6 +91,7 @@ typedef struct PendingRelSync
} PendingRelSync;
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
+static PendingCleanup * pendingCleanups = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
@@ -149,6 +167,19 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
srel = smgropen(rlocator, procNumber);
smgrcreate(srel, MAIN_FORKNUM, false);
+ /* Write undo log, this is required irrelevantly to needs_wal */
+ if (register_delete)
+ {
+ ul_uncommitted_storage ul_storage;
+
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = MAIN_FORKNUM;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+ }
+
if (needs_wal)
log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
@@ -192,12 +223,32 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
*/
xlrec.rlocator = *rlocator;
xlrec.forkNum = forkNum;
+ xlrec.xid = GetTopTransactionId();
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, sizeof(xlrec));
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL.
+ */
+void
+log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
+{
+ xl_smgr_unlink xlrec;
+
+ /*
+ * Make an XLOG entry reporting the file unlink.
+ */
+ xlrec.rlocator = *rlocator;
+ xlrec.forkNum = forkNum;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -693,6 +744,75 @@ smgrDoPendingDeletes(bool isCommit)
}
}
+/*
+ * smgrDoPendingUnmark() -- Clean up work that emits WAL records
+ *
+ * The operations handled in the function emits WAL records, which must be
+ * part of the current transaction.
+ */
+void
+smgrDoPendingCleanups(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+ if (pending->nestLevel < nestLevel)
+ {
+ /* outer-level entries should not be processed yet */
+ prev = pending;
+ }
+ else
+ {
+ /* unlink list entry first, so we don't retry on failure */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ /* do cleanup if called for */
+ if (pending->atCommit == isCommit)
+ {
+ SMgrRelation srel;
+
+ srel = smgropen(pending->rlocator, pending->procNumber);
+
+ Assert((pending->op & ~(PCOP_UNLINK_FORK)) == 0);
+
+ if (pending->op & PCOP_UNLINK_FORK)
+ {
+ BlockNumber firstblock = 0;
+
+ /*
+ * Unlink the fork file. Currently this operation is
+ * applied only to init-forks. As it is not ceratin that
+ * the init-fork is not loaded on shared buffers, drop all
+ * buffers for it.
+ */
+ Assert(pending->unlink_forknum == INIT_FORKNUM);
+ DropRelationBuffers(srel, &pending->unlink_forknum, 1,
+ &firstblock);
+
+ /* Don't emit wal while recovery. */
+ if (!InRecovery)
+ log_smgrunlink(&pending->rlocator,
+ pending->unlink_forknum);
+ smgrunlink(srel, pending->unlink_forknum, false);
+ }
+ }
+
+ /* must explicitly free the list entry */
+ pfree(pending);
+ /* prev does not change */
+ }
+ }
+}
+
/*
* smgrDoPendingSyncs() -- Take care of relation syncs at end of xact.
*/
@@ -902,6 +1022,9 @@ PostPrepare_smgr(void)
/* must explicitly free the list entry */
pfree(pending);
}
+
+ /* Mark undolog as prepared */
+ SimpleUndoLogSetPrpared(GetCurrentTransactionId(), true);
}
@@ -949,10 +1072,28 @@ smgr_redo(XLogReaderState *record)
{
xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
SMgrRelation reln;
+ ul_uncommitted_storage ul_storage;
+
+ /* write undo log */
+ ul_storage.rlocator = xlrec->rlocator;
+ ul_storage.forknum = xlrec->forkNum;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ xlrec->xid,
+ &ul_storage, sizeof(ul_storage));
reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
smgrcreate(reln, xlrec->forkNum, true);
}
+ else if (info == XLOG_SMGR_UNLINK)
+ {
+ xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+ smgrunlink(reln, xlrec->forkNum, true);
+ smgrclose(reln);
+ }
else if (info == XLOG_SMGR_TRUNCATE)
{
xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
@@ -1044,3 +1185,33 @@ smgr_redo(XLogReaderState *record)
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
+
+void
+smgr_undo(SimpleUndoLogRecord *record, bool crash_prepared)
+{
+ uint8 info = record->ul_info;
+
+
+ if (info == ULOG_SMGR_UNCOMMITED_STORAGE)
+ {
+ ul_uncommitted_storage *ul_storage =
+ (ul_uncommitted_storage *) ULogRecGetData(record);
+
+ if (!crash_prepared)
+ {
+ SMgrRelation reln;
+
+ reln = smgropen(ul_storage->rlocator, INVALID_PROC_NUMBER);
+ smgrunlink(reln, ul_storage->forknum, true);
+ smgrclose(reln);
+ }
+ else
+ {
+ /* Inform reinit to ignore this file during cleanup */
+ ResetUnloggedRelationIgnore(ul_storage->rlocator);
+ }
+
+ }
+ else
+ elog(PANIC, "smgr_undo: unknown op code %u", info);
+}
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f1cd1a38d9..58ad350ec2 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -34,6 +34,39 @@ typedef struct
RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
+static char **ignore_files = NULL;
+static int nignore_elems = 0;
+static int nignore_files = 0;
+
+/*
+ * identify the file should be ignored during resetting unlogged relations.
+ */
+static bool
+reinit_ignore_file(const char *dirname, const char *name)
+{
+ char fnamebuf[MAXPGPATH];
+ int len;
+
+ if (nignore_files == 0)
+ return false;
+
+ strncpy(fnamebuf, dirname, MAXPGPATH - 1);
+ strncat(fnamebuf, "/", MAXPGPATH - 1);
+ strncat(fnamebuf, name, MAXPGPATH - 1);
+ fnamebuf[MAXPGPATH - 1] = 0;
+
+ for (int i = 0 ; i < nignore_files ; i++)
+ {
+ /* match ignoring fork part */
+ len = strlen(ignore_files[i]);
+ if (strncmp(fnamebuf, ignore_files[i], len) == 0 &&
+ (fnamebuf[len] == 0 || fnamebuf[len] == '_'))
+ return true;
+ }
+
+ return false;
+}
+
/*
* Reset unlogged relations from before the last restart.
*
@@ -204,6 +237,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -243,6 +280,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* We never remove the init fork. */
if (forkNum == INIT_FORKNUM)
continue;
@@ -294,6 +335,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -337,6 +382,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -366,6 +415,35 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
}
}
+/*
+ * Record relfilenodes that should be left alone during reinitializing unlogged
+ * relations.
+ */
+void
+ResetUnloggedRelationIgnore(RelFileLocator rloc)
+{
+ RelFileLocatorBackend rbloc;
+
+ if (nignore_files >= nignore_elems)
+ {
+ if (ignore_files == NULL)
+ {
+ nignore_elems = 16;
+ ignore_files = palloc(sizeof(char *) * nignore_elems);
+ }
+ else
+ {
+ nignore_elems *= 2;
+ ignore_files = repalloc(ignore_files,
+ sizeof(char *) * nignore_elems);
+ }
+ }
+
+ rbloc.backend = INVALID_PROC_NUMBER;
+ rbloc.locator = rloc;
+ ignore_files[nignore_files++] = relpath(rbloc, MAIN_FORKNUM);
+}
+
/*
* Basic parsing of putative relation filenames.
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index a691aed1f4..d3773a2b8a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -791,6 +791,15 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+/*
+ * smgrunlink() -- unlink the storage file
+ */
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 12ae194067..9f9d6511e6 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -307,6 +307,7 @@ void setup_signals(void);
void setup_text_search(void);
void create_data_directory(void);
void create_xlog_or_symlink(void);
+void create_ulog(void);
void warn_on_mount_point(int error);
void initialize_data_directory(void);
@@ -2958,6 +2959,21 @@ create_xlog_or_symlink(void)
free(subdirloc);
}
+/* Create undo log directory */
+void
+create_ulog(void)
+{
+ char *subdirloc;
+
+ /* form name of the place for the subdirectory */
+ subdirloc = psprintf("%s/pg_ulog", pg_data);
+
+ if (mkdir(subdirloc, pg_dir_create_mode) < 0)
+ pg_fatal("could not create directory \"%s\": %m",
+ subdirloc);
+
+ free(subdirloc);
+}
void
warn_on_mount_point(int error)
@@ -2992,6 +3008,7 @@ initialize_data_directory(void)
create_data_directory();
create_xlog_or_symlink();
+ create_ulog();
/* Create required subdirectories (other than pg_wal) */
printf(_("creating subdirectories ... "));
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 22f7351fdc..525b98899f 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -28,7 +28,7 @@
* RmgrNames is an array of the built-in resource manager names, to make error
* messages a bit nicer.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
name,
static const char *const RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 6b8c17bb4c..a21009c5b8 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -32,7 +32,7 @@
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
{ name, desc, identify},
static const RmgrDescData RmgrDescTable[RM_N_BUILTIN_IDS] = {
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index 3b6a497e1b..d705de9256 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
* Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
* file format.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
symname,
typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 78e6b908c6..7f0abded93 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -25,25 +25,25 @@
*/
/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, smgr_undo)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode, NULL)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode, NULL)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL, NULL)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL, NULL)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL, NULL)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL, NULL)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL, NULL)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode, NULL)
diff --git a/src/include/access/simpleundolog.h b/src/include/access/simpleundolog.h
new file mode 100644
index 0000000000..3d3bd2f7e2
--- /dev/null
+++ b/src/include/access/simpleundolog.h
@@ -0,0 +1,36 @@
+#ifndef SIMPLE_UNDOLOG_H
+#define SIMPLE_UNDOLOG_H
+
+#include "access/rmgr.h"
+#include "port/pg_crc32c.h"
+
+#define SIMPLE_UNDOLOG_DIR "pg_ulog"
+
+typedef struct SimpleUndoLogRecord
+{
+ uint32 ul_tot_len; /* total length of entire record */
+ pg_crc32c ul_crc; /* CRC for this record */
+ RmgrId ul_rmid; /* resource manager for this record */
+ uint8 ul_info; /* record info */
+ TransactionId ul_xid; /* transaction id */
+ /* rmgr-specific data follow, no padding */
+} SimpleUndoLogRecord;
+
+extern void SimpleUndoLogWrite(RmgrId rmgr, uint8 info,
+ TransactionId xid, void *data, int len);
+extern void SimpleUndoLogSetPrpared(TransactionId xid, bool prepared);
+extern void AtEOXact_SimpleUndoLog(bool isCommit, TransactionId xid);
+extern void UndoLogCleanup(void);
+
+extern void AtPrepare_UndoLog(TransactionId xid);
+extern void PostPrepare_UndoLog(void);
+extern void undolog_twophase_recover(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_postcommit(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_postabort(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+extern void undolog_twophase_standby_recover(TransactionId xid, uint16 info,
+ void *recdata, uint32 len);
+
+#endif /* SIMPLE_UNDOLOG_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index badfe4abd6..00ee01af68 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -218,7 +218,7 @@ extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
extern const char *xlog_identify(uint8 info);
extern int XLogGetSyncBit(void);
-extern const char *XLogFsyncFile(int fd);
+extern bool XLogFsyncFile(int fd, const char **errmsg);
extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
extern bool RecoveryInProgress(void);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 72ef3ee92c..2a63eabcbd 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateInitFork(Relation rel);
+extern void RelationDropInitFork(Relation rel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
@@ -43,6 +45,7 @@ extern void RestorePendingSyncs(char *startAddress);
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern void smgrDoPendingCleanups(bool isCommit);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
diff --git a/src/include/catalog/storage_ulog.h b/src/include/catalog/storage_ulog.h
new file mode 100644
index 0000000000..847f0403e2
--- /dev/null
+++ b/src/include/catalog/storage_ulog.h
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * storage_ulog.h
+ * prototypes for Undo Log support for backend/catalog/storage.c
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/storage_ulog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STORAGE_ULOG_H
+#define STORAGE_ULOG_H
+
+#include "access/simpleundolog.h"
+#include "storage/relfilelocator.h"
+
+/* ULOG gives us high 4 bits (just following xlog) */
+#define ULOG_SMGR_UNCOMMITED_STORAGE 0x10
+
+/* undo log entry for uncommitted storage files */
+typedef struct ul_uncommitted_storage
+{
+ RelFileLocator rlocator;
+ ForkNumber forknum;
+ bool remove;
+} ul_uncommitted_storage;
+
+/* flags for xl_smgr_truncate */
+#define SMGR_TRUNCATE_HEAP 0x0001
+
+void smgr_undo(SimpleUndoLogRecord *record, bool crash_prepared);
+
+#define ULogRecGetData(record) ((char *)record + sizeof(SimpleUndoLogRecord))
+
+#endif /* STORAGE_XLOG_H */
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index a490e05f88..807c0f8235 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -29,13 +29,21 @@
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_UNLINK 0x30
typedef struct xl_smgr_create
{
RelFileLocator rlocator;
ForkNumber forkNum;
+ TransactionId xid;
} xl_smgr_create;
+typedef struct xl_smgr_unlink
+{
+ RelFileLocator rlocator;
+ ForkNumber forkNum;
+} xl_smgr_unlink;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +59,7 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index 1373d509df..c57ae26b4c 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,9 +16,11 @@
#define REINIT_H
#include "common/relpath.h"
+#include "storage/relfilelocator.h"
extern void ResetUnloggedRelations(int op);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
extern bool parse_filename_for_nontemp_relation(const char *name,
RelFileNumber *relnumber,
ForkNumber *fork,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index fc5f883ce1..3428e5233b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -84,6 +84,7 @@ extern void smgrrelease(SMgrRelation reln);
extern void smgrreleaseall(void);
extern void smgrreleaserellocator(RelFileLocatorBackend rlocator);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 46a84c5714..29a3e52dbf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2073,6 +2073,7 @@ PatternInfo
PatternInfoArray
Pattern_Prefix_Status
Pattern_Type
+PendingCleanup
PendingFsyncEntry
PendingRelDelete
PendingRelSync
@@ -2645,6 +2646,7 @@ SimplePtrListCell
SimpleStats
SimpleStringList
SimpleStringListCell
+SimpleUndoLogRecord
SingleBoundSortItem
SinglePartitionSpec
Size
@@ -3017,6 +3019,8 @@ ULONG
ULONG_PTR
UV
UVersionInfo
+UndoDescData
+UndoLogFileHeader
UnicodeNormalizationForm
UnicodeNormalizationQC
Unique
@@ -3983,6 +3987,7 @@ uint8
uint8_t
uint8x16_t
uintptr_t
+ul_uncommitted_storage
unicodeStyleBorderFormat
unicodeStyleColumnFormat
unicodeStyleFormat
@@ -4095,6 +4100,7 @@ xl_running_xacts
xl_seq_rec
xl_smgr_create
xl_smgr_truncate
+xl_smgr_unlink
xl_standby_lock
xl_standby_locks
xl_tblspc_create_rec
--
2.43.0
v33-0003-In-place-table-persistence-change.patchtext/x-patch; charset=us-asciiDownload
From cef243b95fe49cbe753731eab55508e32800ac8f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 4 Sep 2023 17:23:05 +0900
Subject: [PATCH v33 3/3] In-place table persistence change
Previously, the command caused a large amount of file I/O due to heap
rewrites, even though ALTER TABLE SET UNLOGGED does not require any
data rewrites. This patch eliminates the need for
rewrites. Additionally, ALTER TABLE SET LOGGED is updated to emit
XLOG_FPI records instead of numerous HEAP_INSERTs when wal_level >
minimal, reducing resource consumption.
---
src/backend/access/rmgrdesc/smgrdesc.c | 12 +
src/backend/access/transam/simpleundolog.c | 4 +-
src/backend/catalog/storage.c | 338 ++++++++++++++++++++-
src/backend/commands/tablecmds.c | 268 +++++++++++++---
src/backend/storage/buffer/bufmgr.c | 84 +++++
src/bin/pg_rewind/parsexlog.c | 6 +
src/include/catalog/storage_xlog.h | 10 +
src/include/storage/bufmgr.h | 3 +
src/include/storage/reinit.h | 2 +-
src/tools/pgindent/typedefs.list | 1 +
10 files changed, 684 insertions(+), 44 deletions(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 71410e0a2d..77a8fdb045 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,15 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence %d", xlrec->persistence);
+ pfree(path);
+ }
}
const char *
@@ -55,6 +64,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/simpleundolog.c b/src/backend/access/transam/simpleundolog.c
index e22ed67bae..ec26c95b32 100644
--- a/src/backend/access/transam/simpleundolog.c
+++ b/src/backend/access/transam/simpleundolog.c
@@ -75,10 +75,8 @@ undolog_sync_current_file(void)
{
const char *msg;
- msg = XLogFsyncFile(current_ulogfile_fd);
-
/* PANIC if failed to fsync */
- if (msg)
+ if (!XLogFsyncFile(current_ulogfile_fd, &msg))
{
ereport(PANIC,
(errcode_for_file_access(),
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index ae1bf597bd..3de229b53d 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -72,11 +72,13 @@ typedef struct PendingRelDelete
} PendingRelDelete;
#define PCOP_UNLINK_FORK (1 << 0)
+#define PCOP_SET_PERSISTENCE (1 << 1)
typedef struct PendingCleanup
{
RelFileLocator rlocator; /* relation that need a cleanup */
int op; /* operation mask */
+ bool bufpersistence; /* buffer persistence to set */
ForkNumber unlink_forknum; /* forknum to unlink */
ProcNumber procNumber; /* INVALID_PROC_NUMBER if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
@@ -210,6 +212,208 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateInitFork
+ * Create physical storage for the init fork of a relation.
+ *
+ * Create the init fork for the relation.
+ *
+ * This function is transactional. The creation is WAL-logged, and if the
+ * transaction aborts later on, the init fork will be removed.
+ */
+void
+RelationCreateInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ SMgrRelation srel;
+ ul_uncommitted_storage ul_storage;
+ bool create = true;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false);
+
+ /*
+ * If a pending-unlink exists for this relation's init-fork, it indicates
+ * the init-fork's existed before the current transaction; this function
+ * reverts the pending-unlink by removing the entry. See
+ * RelationDropInitFork.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ /* write cancel log for preceding undo log entry */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+ /* prev does not change */
+
+ create = false;
+ }
+ else
+ prev = pending;
+ }
+
+ if (!create)
+ return;
+
+ /* create undo log entry, then the init fork */
+ srel = smgropen(rlocator, INVALID_PROC_NUMBER);
+
+ /* write undo log */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = true;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ /* We don't have existing init fork, create it. */
+ smgrcreate(srel, INIT_FORKNUM, false);
+
+ /*
+ * For index relations, WAL-logging and file sync are handled by
+ * ambuildempty. In contrast, for heap relations, these tasks are performed
+ * directly.
+ */
+ if (rel->rd_rel->relkind == RELKIND_INDEX)
+ rel->rd_indam->ambuildempty(rel);
+ else
+ {
+ log_smgrcreate(&rlocator, INIT_FORKNUM);
+ smgrimmedsync(srel, INIT_FORKNUM);
+ }
+
+ /* drop the init fork, mark file then revert persistence at abort */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK | PCOP_SET_PERSISTENCE;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->bufpersistence = true;
+ pending->procNumber = INVALID_PROC_NUMBER;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ */
+void
+RelationDropInitFork(Relation rel)
+{
+ RelFileLocator rlocator = rel->rd_locator;
+ PendingCleanup *pending;
+ PendingCleanup *prev;
+ PendingCleanup *next;
+ bool inxact_created = false;
+
+ /* switch buffer persistence */
+ SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false);
+
+ /*
+ * Search for a pending-unlink associated with the init-fork of the
+ * relation. Its presence indicates that the init-fork was created within
+ * the current transaction.
+ */
+ prev = NULL;
+ for (pending = pendingCleanups; pending != NULL; pending = next)
+ {
+ next = pending->next;
+
+ if (RelFileLocatorEquals(rlocator, pending->rlocator) &&
+ pending->unlink_forknum == INIT_FORKNUM)
+ {
+ ul_uncommitted_storage ul_storage;
+
+ /* write cancel log for preceding undo log entry */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+
+ /* unlink list entry */
+ if (prev)
+ prev->next = next;
+ else
+ pendingCleanups = next;
+
+ pfree(pending);
+
+ /* prev does not change */
+
+ inxact_created = true;
+ }
+ else
+ prev = pending;
+ }
+
+ /*
+ * If the init-fork was created in this transaction, remove the init-fork
+ * and cancel preceding undo log. Otherwise, register an at-commit
+ * pending-unlink for the existing init-fork. See RelationCreateInitFork.
+ */
+ if (inxact_created)
+ {
+ SMgrRelation srel = smgropen(rlocator, INVALID_PROC_NUMBER);
+ ForkNumber forknum = INIT_FORKNUM;
+ BlockNumber firstblock = 0;
+ ul_uncommitted_storage ul_storage;
+
+ /*
+ * Some AMs initialize init-fork via the buffer manager. To properly
+ * drop the init-fork, first drop all buffers for the init-fork, then
+ * unlink the init-fork and cancel preceding undo log.
+ */
+ DropRelationBuffers(srel, &forknum, 1, &firstblock);
+
+ /* cancel existing undo log */
+ ul_storage.rlocator = rlocator;
+ ul_storage.forknum = INIT_FORKNUM;
+ ul_storage.remove = false;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_UNCOMMITED_STORAGE,
+ GetCurrentTransactionId(),
+ &ul_storage, sizeof(ul_storage));
+ log_smgrunlink(&rlocator, INIT_FORKNUM);
+ smgrunlink(srel, INIT_FORKNUM, false);
+ return;
+ }
+
+ /* register drop of this init fork file at commit */
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = rlocator;
+ pending->op = PCOP_UNLINK_FORK;
+ pending->unlink_forknum = INIT_FORKNUM;
+ pending->procNumber = INVALID_PROC_NUMBER;
+ pending->atCommit = true;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -249,6 +453,25 @@ log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = rlocator;
+ xlrec.persistence = persistence;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -782,7 +1005,14 @@ smgrDoPendingCleanups(bool isCommit)
srel = smgropen(pending->rlocator, pending->procNumber);
- Assert((pending->op & ~(PCOP_UNLINK_FORK)) == 0);
+ Assert((pending->op &
+ ~(PCOP_UNLINK_FORK | PCOP_SET_PERSISTENCE)) == 0);
+
+ if (pending->op & PCOP_SET_PERSISTENCE)
+ {
+ SetRelationBuffersPersistence(srel, pending->bufpersistence,
+ InRecovery);
+ }
if (pending->op & PCOP_UNLINK_FORK)
{
@@ -1182,6 +1412,112 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete any pending action for persistence change, if present. There
+ * should be at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * During abort, revert any changes to buffer persistence made made in
+ * this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->procNumber = INVALID_PROC_NUMBER;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+ PendingCleanup *pending;
+ PendingCleanup *prev = NULL;
+
+ reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+ SetRelationBuffersPersistence(reln, xlrec->persistence, true);
+
+ /*
+ * Delete any pending action for persistence change, if present. There
+ * should be at most one entry for this action.
+ */
+ for (pending = pendingCleanups; pending != NULL;
+ pending = pending->next)
+ {
+ if (RelFileLocatorEquals(xlrec->rlocator, pending->rlocator) &&
+ (pending->op & PCOP_SET_PERSISTENCE) != 0)
+ {
+ Assert(pending->bufpersistence == xlrec->persistence);
+
+ if (prev)
+ prev->next = pending->next;
+ else
+ pendingCleanups = pending->next;
+
+ pfree(pending);
+ break;
+ }
+
+ prev = pending;
+ }
+
+ /*
+ * During abort, revert any changes to buffer persistence made made in
+ * this transaction.
+ */
+ if (!pending)
+ {
+ pending = (PendingCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup));
+ pending->rlocator = xlrec->rlocator;
+ pending->op = PCOP_SET_PERSISTENCE;
+ pending->bufpersistence = !xlrec->persistence;
+ pending->procNumber = INVALID_PROC_NUMBER;
+ pending->atCommit = false;
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingCleanups;
+ pendingCleanups = pending;
+ }
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 313c782cae..47c46b13cd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5667,6 +5667,189 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: perform in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ int i;
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Use ATRewriteTable instead of this function under the following
+ * condition.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * Initially, gather all relations that require a persistence change.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+
+ /*
+ * XXXX: Some access methods don't support in-place persistence
+ * changes. GiST uses page LSNs to figure out whether a block has been
+ * modified. However, UNLOGGED GiST indexes use fake LSNs, which are
+ * incompatible with the real LSNs used for LOGGED indexes.
+ *
+ * Potentially, if gistGetFakeLSN behaved similarly for both permanent
+ * and unlogged indexes, we could avoid index rebuilds by emitting
+ * extra WAL records while the index is unlogged.
+ *
+ * Compare relam against a positive list to ensure the hard way is
+ * taken for unknown AMs.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+
+ /* this doesn't fire REINDEX event triegger */
+ reindex_index(NULL, reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Create or drop init fork */
+ if (persistence == RELPERSISTENCE_UNLOGGED)
+ RelationCreateInitFork(r);
+ else
+ RelationDropInitFork(r);
+
+ /*
+ * If this relation becomes WAL-logged, immediately sync all files
+ * except the init-fork to establish the initial state on storage. The
+ * buffers should have already been flushed out by
+ * RelationCreate(Drop)InitFork called just above. The init-fork should
+ * already be synchronized as required.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT)
+ {
+ for (i = 0; i < INIT_FORKNUM; i++)
+ {
+ if (smgrexists(RelationGetSmgr(r), i))
+ smgrimmedsync(RelationGetSmgr(r), i);
+ }
+ }
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ /*
+ * If wal_level >= replica, switching to LOGGED necessitates WAL-logging
+ * the relation content for later recovery. This is not emitted when
+ * wal_level = minimal.
+ */
+ if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded())
+ {
+ ForkNumber fork;
+ xl_smgr_truncate xlrec;
+
+ xlrec.blkno = 0;
+ xlrec.rlocator = r->rd_locator;
+ xlrec.flags = SMGR_TRUNCATE_ALL;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
+ }
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5797,48 +5980,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
+ RelationChangePersistence(tab, persistence, lockmode);
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that
+ * can't be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting
+ * this code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 49637284f9..ed933e7b9e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4085,6 +4085,90 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * then writes all dirty pages to disk (or kernel disk buffers) when
+ * switching to PERMANENT, ensuring the kernel has an up-to-date view of
+ * the relation.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation
+ * to ensure no other backend is busy dirtying more blocks.
+ *
+ * XXX currently it sequentially searches the buffer pool; consider
+ * implementing more efficient search methods. This routine isn't used in
+ * performance-critical code paths, so it's not worth additional overhead
+ * to make it go faster; see also DropRelationBuffers.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo)
+{
+ int i;
+ RelFileLocatorBackend rlocator = srel->smgr_rlocator;
+
+ Assert(!RelFileLocatorBackendIsTemp(rlocator));
+
+ if (!isRedo)
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, permanent);
+
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator.locator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* The init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /* flush this buffer when switching to PERMANENT */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 525b98899f..c8c9cc361f 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,12 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 807c0f8235..b38909ceb3 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -14,6 +14,7 @@
#ifndef STORAGE_XLOG_H
#define STORAGE_XLOG_H
+#include "access/simpleundolog.h"
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
#include "storage/block.h"
@@ -30,6 +31,7 @@
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
#define XLOG_SMGR_UNLINK 0x30
+#define XLOG_SMGR_BUFPERSISTENCE 0x40
typedef struct xl_smgr_create
{
@@ -44,6 +46,12 @@ typedef struct xl_smgr_unlink
ForkNumber forkNum;
} xl_smgr_unlink;
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -60,6 +68,8 @@ typedef struct xl_smgr_truncate
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
extern void log_smgrunlink(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 08364447c7..552fa609c2 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -272,6 +272,9 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent, bool isRedo);
+
extern void DropDatabaseBuffers(Oid dbid);
#define RelationGetNumberOfBlocks(reln) \
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index c57ae26b4c..746d3a910a 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -20,11 +20,11 @@
extern void ResetUnloggedRelations(int op);
-extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
extern bool parse_filename_for_nontemp_relation(const char *name,
RelFileNumber *relnumber,
ForkNumber *fork,
unsigned *segno);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc);
#define UNLOGGED_RELATION_CLEANUP 0x0001
#define UNLOGGED_RELATION_INIT 0x0002
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29a3e52dbf..c477f5fac6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4098,6 +4098,7 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
xl_smgr_truncate
xl_smgr_unlink
--
2.43.0
On Fri, 24 May 2024 at 00:09, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Along with rebasing, I changed the interface of XLogFsyncFile() to
return a boolean instead of an error message.
Two notes after looking at this quickly during the advanced patch
feedback session:
1. I would maybe split 0003 into two separate patches. One to make SET
UNLOGGED fast, which seems quite easy to do because no WAL is needed.
And then a follow up to make SET LOGGED fast, which does all the
XLOG_FPI stuff.
2. When wal_level = minital, still some WAL logging is needed. The
pages that were changed since the last still need to be made available
for crash recovery.
On Tue, May 28, 2024 at 04:49:45PM -0700, Jelte Fennema-Nio wrote:
Two notes after looking at this quickly during the advanced patch
feedback session:1. I would maybe split 0003 into two separate patches. One to make SET
UNLOGGED fast, which seems quite easy to do because no WAL is needed.
And then a follow up to make SET LOGGED fast, which does all the
XLOG_FPI stuff.
Yeah, that would make sense. The LOGGED->UNLOGGED part is
straight-forward because we only care about the init fork. The
UNLOGGED->LOGGED case bugs me, though, a lot.
2. When wal_level = minitam, still some WAL logging is needed. The
pages that were changed since the last still need to be made available
for crash recovery.
More notes from me, as I was part of this session.
+ * XXXX: Some access methods don't support in-place persistence
+ * changes. GiST uses page LSNs to figure out whether a block has been
[...]
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ /* GiST is excluded */
+ r->rd_rel->relam != BTREE_AM_OID &&
+ r->rd_rel->relam != HASH_AM_OID &&
+ r->rd_rel->relam != GIN_AM_OID &&
+ r->rd_rel->relam != SPGIST_AM_OID &&
+ r->rd_rel->relam != BRIN_AM_OID)
This knowledge should not be encapsulated in the backend code. The
index AMs should be able to tell, instead, if they are able to support
this code path so as any out-of-core index AM can decide things on its
own. This ought to be split in its own patch, simple enough as of a
boolean or a routine telling how this backend path should behave.
+ for (fork = 0; fork < INIT_FORKNUM; fork++)
+ {
+ if (smgrexists(RelationGetSmgr(r), fork))
+ log_newpage_range(r, fork, 0,
+ smgrnblocks(RelationGetSmgr(r), fork),
+ false);
+ }
A simple copy of the blocks means that we keep anything bloated in
them, while a rewrite in ALTER TABLE means that we would start afresh
by deforming the tuples from the origin before giving them to the
target, without any bloat. The compression of the FPWs and the
removal of the holes in the pages would surely limit the impact, but
this has not been discussed on this thread, and this is a nice
property of the existing implementation that would get silently
removed by this patch set.
Another point that Nathan has made is that it may be more appealling
to study how this is better than an integration with the multi-INSERT
APIs into AMs, so as it is possible to group the inserts in batches
rather than process them one-at-a-time, see [1]https://commitfest.postgresql.org/48/4777/ -- Michael. I am ready to accept
that what this patch does is more efficient as long as everything is
block-based in some cases. Still there is a risk-vs-gain argument
here, and I am not sure whether what we have here is a good tradeoff
compared to the potential risk of breaking things. The amount of new
infrastructure is large for this code path. Grouping the inserts in
large batches may finish by being more efficient than a WAL stream
full of FPWs, as well, even if toast values are deformed? So perhaps
there is an argument for making that optional at query level, instead.
As a hole, I can say that grouping the INSERTs will be always more
efficient, while what we have here can be less efficient in some
cases. I'm OK to be outvoted, but the level of complications created
by this block-based copy and WAL-logging concerns me when it comes to
tweaking the relpersistence like that.
[1]: https://commitfest.postgresql.org/48/4777/ -- Michael
--
Michael
Thank you for the comments.
# The most significant feedback I received was that this approach is
# not misdirected..
At Tue, 4 Jun 2024 09:09:12 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Tue, May 28, 2024 at 04:49:45PM -0700, Jelte Fennema-Nio wrote:
Two notes after looking at this quickly during the advanced patch
feedback session:1. I would maybe split 0003 into two separate patches. One to make SET
UNLOGGED fast, which seems quite easy to do because no WAL is needed.
And then a follow up to make SET LOGGED fast, which does all the
XLOG_FPI stuff.Yeah, that would make sense. The LOGGED->UNLOGGED part is
straight-forward because we only care about the init fork. The
UNLOGGED->LOGGED case bugs me, though, a lot.
I indeed agree with that. Will do that in the next version.
2. When wal_level = minitam, still some WAL logging is needed. The
pages that were changed since the last still need to be made available
for crash recovery.
I don't quite understand this. It seems that you are reffering to the
LOGGED to UNLOGGED case. UNLOGGED tables are emptied after a crash,
and the newly created INIT fork does that trick. Maybe I'm
misunderstanding something, though.
More notes from me, as I was part of this session.
+ * XXXX: Some access methods don't support in-place persistence + * changes. GiST uses page LSNs to figure out whether a block has been [...] + if (r->rd_rel->relkind == RELKIND_INDEX && + /* GiST is excluded */ + r->rd_rel->relam != BTREE_AM_OID && + r->rd_rel->relam != HASH_AM_OID && + r->rd_rel->relam != GIN_AM_OID && + r->rd_rel->relam != SPGIST_AM_OID && + r->rd_rel->relam != BRIN_AM_OID)This knowledge should not be encapsulated in the backend code. The
index AMs should be able to tell, instead, if they are able to support
this code path so as any out-of-core index AM can decide things on its
own. This ought to be split in its own patch, simple enough as of a
boolean or a routine telling how this backend path should behave.
Right. I was hesitant to expand the scope before being certain that I
can proceed in this direction without significant objections. Now I
can include that in the next version.
+ for (fork = 0; fork < INIT_FORKNUM; fork++) + { + if (smgrexists(RelationGetSmgr(r), fork)) + log_newpage_range(r, fork, 0, + smgrnblocks(RelationGetSmgr(r), fork), + false); + }A simple copy of the blocks means that we keep anything bloated in
them, while a rewrite in ALTER TABLE means that we would start afresh
by deforming the tuples from the origin before giving them to the
target, without any bloat. The compression of the FPWs and the
removal of the holes in the pages would surely limit the impact, but
this has not been discussed on this thread, and this is a nice
property of the existing implementation that would get silently
removed by this patch set.
Sure. That bloat can be removed beforehand by explicitly running
VACUUM on the table if needed, but it would be ideal if the same
compression occurred automatically. Alternatively, it might be an
option to fall back to the existing path when the target table is
found to have excessive bloat (though I'm not sure how much should be
considered excessive). We could also allow users to decide by adding a
command option.
Another point that Nathan has made is that it may be more appealling
to study how this is better than an integration with the multi-INSERT
APIs into AMs, so as it is possible to group the inserts in batches
rather than process them one-at-a-time, see [1]. I am ready to accept
that what this patch does is more efficient as long as everything is
block-based in some cases. Still there is a risk-vs-gain argument
here, and I am not sure whether what we have here is a good tradeoff
compared to the potential risk of breaking things. The amount of new
infrastructure is large for this code path. Grouping the inserts in
large batches may finish by being more efficient than a WAL stream
full of FPWs, as well, even if toast values are deformed? So perhaps
there is an argument for making that optional at query level, instead.
I agree about the uncertainties. With the switching feature mentioned
above, it might be sufficient to use the multi-insert stuff in the
existing path. However, the uncertainties regarding performance would
still remain.
As a hole, I can say that grouping the INSERTs will be always more
efficient, while what we have here can be less efficient in some
cases. I'm OK to be outvoted, but the level of complications created
by this block-based copy and WAL-logging concerns me when it comes to
tweaking the relpersistence like that.
Of course, it is a promising option to move away from the
block-logging and fall back to the existing path using the
multi-insert stuff in the UNLOGGED to LOGGED case. Let me consider
that point.
Besides the above, even though this discussion might become
unnecessary, there was a concern that the blockwise logging might
result in unexpected outcomes due to unflushed buffer data. (although
I could be mistaken). I believe that is not the case because all
buffer blocks are flushed out beforehand.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
+Bharath
On Tue, Jun 04, 2024 at 04:00:32PM +0900, Kyotaro Horiguchi wrote:
At Tue, 4 Jun 2024 09:09:12 +0900, Michael Paquier <michael@paquier.xyz> wrote in
Another point that Nathan has made is that it may be more appealling
to study how this is better than an integration with the multi-INSERT
APIs into AMs, so as it is possible to group the inserts in batches
rather than process them one-at-a-time, see [1]. I am ready to accept
that what this patch does is more efficient as long as everything is
block-based in some cases. Still there is a risk-vs-gain argument
here, and I am not sure whether what we have here is a good tradeoff
compared to the potential risk of breaking things. The amount of new
infrastructure is large for this code path. Grouping the inserts in
large batches may finish by being more efficient than a WAL stream
full of FPWs, as well, even if toast values are deformed? So perhaps
there is an argument for making that optional at query level, instead.I agree about the uncertainties. With the switching feature mentioned
above, it might be sufficient to use the multi-insert stuff in the
existing path. However, the uncertainties regarding performance would
still remain.
Bharath, does the multi-INSERT stuff apply when changing a table to be
LOGGED? If so, I think it would be interesting to compare it with the FPI
approach being discussed here.
--
nathan
On Tue, Jun 04, 2024 at 03:50:51PM -0500, Nathan Bossart wrote:
Bharath, does the multi-INSERT stuff apply when changing a table to be
LOGGED? If so, I think it would be interesting to compare it with the FPI
approach being discussed here.
The answer to this question is yes AFAIK. Look at patch 0002 in the
latest series posted here, that touches ATRewriteTable() in
tablecmds.c where the rewrite happens should a relation's
relpersistence, AM, column or default requires a switch (particularly
if more than one property is changed in a single command, grep for
AT_REWRITE_*):
/messages/by-id/CALj2ACUz5+_YNEa4ZY-XG960_oXefM50MjD71VgSCAVDkF3bzQ@mail.gmail.com
I've just read through the patch set, and they are rather pleasant to
the eye. I have comments about them, actually, but that's a topic for
the other thread.
--
Michael
Hello.
It's been a while. Based on our previous face-to-face discussions, I
have been restructuring the patch set. During this process, I found
several missing parts and issues, which led to almost everything being
rewritten. However, I believe the updates are now better organized and
more understandable.
The current patch set broadly consists of the following elements:
- Core feature: Switching buffer persistence (0007) remains mostly the
same as before, but the creation and deletion of INIT fork files
have undergone significant modifications. Part of this functionality
has been moved to commit records.
- UNDO log(0002): This handles file deletion during transaction aborts,
which was previously managed, in part, by the commit XLOG record at
the end of a transaction.
- Prevent orphan files after a crash (0005): This is another use-case
of the UNDO log system.
- Extension of smgr (0012), pendingDeletes (0014), and commit XLOG
records (0013): These have been extended to handle file deletion at
the fork level instead of the relfilenumber level. While this
extension applies to both commit and abort operations, only the file
deletion process for aborts has been moved to the UNDO log. As a
result, file deletions during commits continue to be managed by
commit records.
Here are some issues. Depending on how these points are addressed,
this patch set might be dropped. (Or, this patch might already be too
large for its intended effect.)
- Consecutive changes to the persistence of the same table within a
single transaction are prohibited (0007). Allowing this would
complicate pendingDeletes and a similar mechanism added to
bufmgr. Also, due to the append-only nature of the UNDO log, the
entire process, including subtransaction handling, could not be made
consistent easily.
- PREPARE is prohibited for transactions that have altered table
persistence(0009). This is because I haven't found a simple way to
ensure consistent switching of buffer persistence if the server
crashes after PREPARE and then commits the transaction after
recovery.
- Data updates within a single transaction after changing the table's
persistence are also prohibited(0008). This restriction is necessary
because if an index update triggers page splits after changing the
persistence to UNLOGGED, WAL might become inapplicable.
The last point, in particular, has a significant impact on usability,
but it seems to be fundamentally unavoidable. Since heap updates
appear to be fine, one possible approach could be to give up on
in-place persistence changes for indexes.
Regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v34-0001-Export-wal_sync_method-related-functions.patchtext/x-patch; charset=us-asciiDownload
From aa000decc71a87552e81850182a3c2854fb4851a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 4 Jul 2024 17:24:16 +0900
Subject: [PATCH v34 01/16] Export wal_sync_method related functions
Export XLogGetSyncBit() and XLogFsyncFile() for use in subsequent
commits.
---
src/backend/access/transam/xlog.c | 81 ++++++++++++++++++++++---------
src/include/access/xlog.h | 2 +
2 files changed, 61 insertions(+), 22 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ee0fb0e28f..b3d38cfaf8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8642,21 +8642,34 @@ assign_wal_sync_method(int new_wal_sync_method, void *extra)
}
}
+/*
+ * Exported version of get_sync_bit()
+ *
+ * Note: The returned value may have the PG_O_DIRECT bit set.
+ */
+int
+XLogGetSyncBit(void)
+{
+ return get_sync_bit(wal_sync_method);
+}
+
/*
- * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ * Issue appropriate kind of fsync (if any) according to wal_sync_method.
+ *
+ * Returns true if sucessfully fsync'ed, otherwise returns false and sets
+ * errmsg if it is not NULL.
+ * 'fd' is a file descriptor for the file to be fsync'd.
*
- * 'fd' is a file descriptor for the XLOG file to be fsync'd.
- * 'segno' is for error reporting purposes.
+ * Returns true if successfully synced. Returns false if failed and sets the
+ * error message to *errmsg.
*/
-void
-issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+bool
+XLogFsyncFile(int fd, const char **errmsg)
{
- char *msg = NULL;
+ const char *msg = NULL;
instr_time start;
- Assert(tli != 0);
-
/*
* Quick exit if fsync is disabled or write() has already synced the WAL
* file.
@@ -8664,7 +8677,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
if (!enableFsync ||
wal_sync_method == WAL_SYNC_METHOD_OPEN ||
wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
- return;
+ return true;
/* Measure I/O timing to sync the WAL file */
if (track_wal_io_timing)
@@ -8701,19 +8714,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
break;
}
- /* PANIC if failed to fsync */
- if (msg)
- {
- char xlogfname[MAXFNAMELEN];
- int save_errno = errno;
-
- XLogFileName(xlogfname, tli, segno, wal_segment_size);
- errno = save_errno;
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg(msg, xlogfname)));
- }
-
pgstat_report_wait_end();
/*
@@ -8727,7 +8727,44 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
}
+ if (msg)
+ {
+ Assert (errmsg);
+
+ *errmsg = msg;
+ return false;
+ }
+
PendingWalStats.wal_sync++;
+
+ return true;
+}
+
+/*
+ * Issue appropriate kind of fsync (if any) for an XLOG output file.
+ *
+ * 'fd' is a file descriptor for the XLOG file to be fsync'd.
+ * 'segno' is for error reporting purposes.
+ */
+void
+issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
+{
+ const char *msg;
+
+ Assert(tli != 0);
+
+ /* PANIC if failed to fsync */
+ if (!XLogFsyncFile(fd, &msg))
+ {
+ char xlogfname[MAXFNAMELEN];
+ int save_errno = errno;
+
+ XLogFileName(xlogfname, tli, segno, wal_segment_size);
+ errno = save_errno;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg(msg, xlogfname)));
+ }
}
/*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 083810f5b4..095eb26a61 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -219,6 +219,8 @@ extern void xlog_redo(struct XLogReaderState *record);
extern void xlog_desc(StringInfo buf, struct XLogReaderState *record);
extern const char *xlog_identify(uint8 info);
+extern int XLogGetSyncBit(void);
+extern bool XLogFsyncFile(int fd, const char **errmsg);
extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
extern bool RecoveryInProgress(void);
--
2.43.5
v34-0002-Introduce-undo-log-implementation.patchtext/x-patch; charset=us-asciiDownload
From 5655e65ae7eac64e7342b841df2fe82a52cd2594 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 5 Jul 2024 14:32:46 +0900
Subject: [PATCH v34 02/16] Introduce undo log implementation
This implementation creates UNDO log files, named with XID values and
stored in the pg_ulog directory. These files have a record format
similar to XLOG files, with one file for each sub-transaction if
needed. Each file consists of multiple UNDO records that capture
in-transaction changes requiring cleanup at abort time. If a server
crash occurs while UNDO log files exist, they trigger post-crash
cleanup before entering the REDO loop of crash recovery. The REDO loop
may then generate new UNDO logs. Most of these are removed during the
processing of commit/abort records, but some files may remain after
the REDO loop finishes. These remaining files also trigger abort-time
cleanups. The UNDO log files associated with prepared transactions are
preserved and not processed during recovery. They are processed when
the prepared transactions are finalized.
The creation of UNDO files is tracked in-memory using a linked list of
ActiveULog in a process for quick lookups. For a normal session, all
elements in the list pertain to one top-level transaction and are
removed at commit or abort time. In the startup process, the list may
contain elements for multiple top-level transactions.
---
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/rmgr.c | 2 +-
src/backend/access/transam/simpleundolog.c | 991 +++++++++++++++++++++
src/backend/access/transam/twophase.c | 3 +
src/backend/access/transam/xact.c | 23 +
src/backend/access/transam/xlog.c | 22 +-
src/bin/initdb/initdb.c | 17 +
src/bin/pg_rewind/parsexlog.c | 2 +-
src/bin/pg_waldump/rmgrdesc.c | 2 +-
src/include/access/rmgr.h | 2 +-
src/include/access/rmgrlist.h | 46 +-
src/include/access/simpleundolog.h | 44 +
src/tools/pgindent/typedefs.list | 5 +
14 files changed, 1129 insertions(+), 32 deletions(-)
create mode 100644 src/backend/access/transam/simpleundolog.c
create mode 100644 src/include/access/simpleundolog.h
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db..531505cbbd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -21,6 +21,7 @@ OBJS = \
rmgr.o \
slru.o \
subtrans.o \
+ simpleundolog.o \
timeline.o \
transam.o \
twophase.o \
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557c..c1225636b5 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -9,6 +9,7 @@ backend_sources += files(
'rmgr.c',
'slru.c',
'subtrans.c',
+ 'simpleundolog.c',
'timeline.c',
'transam.c',
'twophase.c',
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 1b7499726e..81206a64f8 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -44,7 +44,7 @@
/* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
{ name, redo, desc, identify, startup, cleanup, mask, decode },
RmgrData RmgrTable[RM_MAX_ID + 1] = {
diff --git a/src/backend/access/transam/simpleundolog.c b/src/backend/access/transam/simpleundolog.c
new file mode 100644
index 0000000000..4bba27e340
--- /dev/null
+++ b/src/backend/access/transam/simpleundolog.c
@@ -0,0 +1,991 @@
+/*-------------------------------------------------------------------------
+ *
+ * simpleundolog.c
+ * Simple implementation of PostgreSQL transaction-undo-log manager
+ *
+ * This module logs the cleanup procedures required during a transaction abort.
+ * The information is recorded in files to ensure post-crash recovery runs the
+ * necessary cleanup procedures.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/simpleundolog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <sys/stat.h>
+
+#include "lib/stringinfo.h"
+#include "access/parallel.h"
+#include "access/simpleundolog.h"
+#include "access/twophase_rmgr.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "utils/memutils.h"
+
+
+#define ULOG_FILE_MAGIC 0x474f4c55 /* 'ULOG' in big-endian */
+
+/*
+ * Struct for undo-log disk data
+ *
+ * Each undo-log file is named in the format "<topxid>-<subxid>" and contains
+ * undo data for a subtransaction. The file begins with a header followed by
+ * undo log records. An undo file is created when the first undo log is issued
+ * during a transaction and is removed upon the top transaction's commit or
+ * processed then removed during each subtransaction's rollback. When a
+ * transaction is prepared, this state is marked in the file header. The
+ * prepared undo files are processed by subsequent COMMIT/ROLLBACK PREPARED
+ * commands in the same manner as non-prepared files. In the event of a server
+ * crash during a transaction, non-prepared undo files left behind are handled
+ * before recovery starts. The recovery process may create new undo files,
+ * which are processed at the end of recovery. Prepared undo files are
+ * preserved throughout the recovery process.
+ */
+typedef struct UndoLogFileHeader
+{
+ int32 magic; /* fixed ULOG file magic number */
+ UndoLogFileState state; /* state of this file */
+ /* SimpleUndoLogRecord follows */
+} UndoLogFileHeader;
+
+typedef struct UndoDescData
+{
+ const char *rm_name;
+ void (*rm_undo) (SimpleUndoLogRecord *record,
+ UndoLogFileState state, bool isCommit, bool cleanup);
+} UndoDescData;
+
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
+ { name, undo },
+
+UndoDescData UndoRoutines[RM_MAX_ID + 1] = {
+#include "access/rmgrlist.h"
+};
+#undef PG_RMGR
+
+/*
+ * During a transaction, all undo logs are managed by a linked list. This
+ * linked list is used for quick lookup of existing undo log files. We expect
+ * the number of files to be relatively small, so more efficient algorithms are
+ * not used here. During recovery, this list may contain undo log files for
+ * multiple top transactions.
+ */
+/* Undo log wroking state */
+typedef struct ActiveULog
+{
+ TransactionId xid;
+ struct ActiveULog *next;
+} ActiveULog;
+
+/*
+ * Struct for top-level management variables.
+ *
+ * active_ulogs holds the xid-subxid pairs for all subtransactions that issued
+ * undo logs during a top-level transaction. Once an undo log file is opened,
+ * topxid, subxid, filename, and fd are set according to the currently open
+ * file.
+ */
+typedef struct ULogStateData
+{
+ ActiveULog *active_ulogs; /* list of subxacts with ulogs */
+ ActiveULog *current_ulog; /* current open entry */
+ ActiveULog *prev_ulog; /* previous entry for removal */
+ TransactionId xid; /* xid of the current ulog */
+ char file_name[MAXPGPATH]; /* current ulog file name */
+ int fd; /* file descriptor */
+ UndoLogFileHeader file_header; /* current ulog file header */
+} ULogStateData;
+
+static ULogStateData ULogState =
+{NULL, NULL, NULL, InvalidTransactionId, "", -1, {0}};
+
+/* ULOG uses the same sync mode as XLOG, except for the PG_O_DIRECT bit. */
+static int
+ULogGetSyncBit(void)
+{
+ return XLogGetSyncBit() & ~PG_O_DIRECT;
+}
+
+/*
+ * undolog_set_filename()
+ *
+ * Generates undo log file name for the xid pair.
+ */
+static void
+undolog_set_filename(char *buf, TransactionId xid)
+{
+ snprintf(buf, MAXPGPATH, "%s/%08x", SIMPLE_UNDOLOG_DIR, xid);
+}
+
+
+/*
+ * undolog_load_file_header()
+ *
+ * Loads the header of the currently open file into the global buffer. The file
+ * pointer will point to the beginning of the first record after this function
+ * returns.
+ */
+static void
+undolog_load_file_header(void)
+{
+ if (lseek(ULogState.fd, 0, SEEK_SET) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ ULogState.file_name));
+
+ if (read(ULogState.fd,
+ &ULogState.file_header, sizeof(ULogState.file_header)) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not read undolog file \"%s\": %m",
+ ULogState.file_name));
+ if (ULogState.file_header.magic != ULOG_FILE_MAGIC)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("invalid undolog file \"%s\": magic don't match",
+ ULogState.file_name));
+}
+
+/*
+ * undolog_write_file_header()
+ *
+ * Writes the header of the currently open file from the global buffer. The file
+ * pointer will point to the beginning of the first record after this function
+ * returns.
+ */
+static void
+undolog_write_file_header(void)
+{
+ if (lseek(ULogState.fd, 0, SEEK_SET) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not seek undolog file \"%s\": %m",
+ ULogState.file_name));
+
+ if (write(ULogState.fd,
+ &ULogState.file_header, sizeof(ULogState.file_header)) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not write undolog file \"%s\": %m",
+ ULogState.file_name));
+}
+
+/*
+ * undolog_sync_file()
+ *
+ * Sync the currently open ULOG file.
+ */
+static void
+undolog_sync_file(void)
+{
+ const char *msg;
+
+ /*
+ * This function counts ULOG sync operation stats as part of WAL
+ * operations. In the future, we may want to separate ULOG stats from WAL
+ * stats.
+ */
+ if (!XLogFsyncFile(ULogState.fd, &msg))
+ {
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg(msg, ULogState.file_name)));
+ }
+}
+
+/*
+ * undolog_select_entry()
+ *
+ * Finds and selects the in-memory entry for the given xid pair.
+ *
+ * If create is false, returns false if not found, and no entry is selected.
+ * If create is true, returns true if found; otherwise, create one and returns
+ * false.
+ */
+static bool
+undolog_select_entry(TransactionId xid, bool create)
+{
+ ActiveULog *prev;
+
+ Assert (TransactionIdIsValid(xid));
+
+ /* short cut when the entry is already selected */
+ if (ULogState.current_ulog &&
+ ULogState.current_ulog->xid == xid)
+ return true;
+
+ ULogState.current_ulog = ULogState.prev_ulog = NULL;
+
+ /* we no longer use this file, close it */
+ if (ULogState.fd >= 0)
+ {
+ /* Switched between subtransactions, close the current file */
+ if (close(ULogState.fd) != 0)
+ ereport(ERROR, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ ULogState.file_name));
+
+ ULogState.xid = InvalidTransactionId;
+ ULogState.fd = -1;
+ ReleaseExternalFD();
+ }
+
+ /* search for the existing entry */
+ prev = NULL;
+ for (ActiveULog *p = ULogState.active_ulogs ; p ; p = p->next)
+ {
+ if (p->xid == xid)
+ {
+ ULogState.current_ulog = p;
+ ULogState.prev_ulog = prev;
+ return true;
+ }
+
+ prev = p;
+ }
+
+ /* no existing entry found; create a new one */
+ if (create)
+ {
+ ActiveULog *newlog = (ActiveULog *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(ActiveULog));
+
+ newlog->xid = xid;
+ newlog->next = ULogState.active_ulogs;
+ ULogState.active_ulogs = newlog;
+ ULogState.current_ulog = newlog;
+ ULogState.prev_ulog = NULL;
+ }
+
+ return false;
+}
+
+/*
+ * undolog_remove_entry()
+ *
+ * Removes the currently selected entry.
+ *
+ * The entry to be deleted must have been previously selected using
+ * undolog_select_entry(), and the corresponding log file is expected to have
+ * already been deleted if it exists.
+ */
+static bool
+undolog_remove_entry()
+{
+ ActiveULog *p = ULogState.current_ulog;
+
+ if (ULogState.prev_ulog)
+ ULogState.prev_ulog = p->next;
+ else
+ ULogState.active_ulogs = p->next;
+
+ ULogState.current_ulog = ULogState.prev_ulog = NULL;
+
+ pfree(p);
+
+ return true;
+}
+
+/*
+ * undolog_init_file()
+ *
+ * Sets the initial header of an already-opened ulog file.
+ *
+ * The file pointer will point to just after the header after this function
+ * returns.
+ */
+static void
+undolog_init_file(void)
+{
+ ULogState.file_header.magic = ULOG_FILE_MAGIC;
+ ULogState.file_header.state = ULOG_FILE_DEFAULT;
+
+ if (write(ULogState.fd, &ULogState.file_header,
+ sizeof(ULogState.file_header)) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not write undolog file \"%s\": %m",
+ ULogState.file_name));
+}
+
+/*
+ * undolog_open_file_by_name()
+ *
+ * Opens a ULOG file specified by ULogState.file_name.
+ *
+ * Returns true if the file was found, and false if not found.
+ *
+ * If create is true, this function errors out if the file already
+ * exists. During recovery, this function may attempt to create
+ * already-existing ULOG file for an uncommitted prepared transaction. In this
+ * case, the existing file is opened instead of causing an error.
+ *
+ * Note that xid values in ULogState are set to invalid even after a sccessful
+ * return. They will be set by undolog_open_fild().
+ */
+static bool
+undolog_open_file_by_name(bool create)
+{
+ int omode = 0;
+ int cmode = 0;
+
+ Assert (ULogState.fd < 0);
+
+ omode = PG_BINARY | O_RDWR | ULogGetSyncBit();
+
+ if (create)
+ cmode = O_CREAT | O_EXCL;
+
+ ULogState.fd = BasicOpenFile(ULogState.file_name, omode | cmode);
+
+ if (ULogState.fd < 0)
+ {
+ if (!create)
+ {
+ if (errno == ENOENT)
+ return false;
+
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not open undolog file \"%s\": %m",
+ ULogState.file_name));
+ }
+
+ /*
+ * ULOG files for prepared transactions are preserved throughout the
+ * recovery process. Therefore, recovery may attempt to create an
+ * already existing file. If the file is confirmed to be prepared, we
+ * should continue the recovery and will ignore all ULOG writes to this
+ * file. See UndoLogCleanup() for details.
+
+ */
+ if (errno == EEXIST && RecoveryInProgress())
+ {
+ ULogState.fd = BasicOpenFile(ULogState.file_name, omode);
+ if (ULogState.fd >= 0)
+ {
+ undolog_load_file_header();
+ if (ULogState.file_header.state == ULOG_FILE_DEFAULT)
+ elog(PANIC, "non prepared file found: %s",
+ ULogState.file_name);
+
+ elog(LOG, "ulog file for prepared transaction found: %s",
+ ULogState.file_name);
+ }
+
+ /* restore the orignal error number */
+ errno = EEXIST;
+ }
+
+ if (ULogState.fd < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not create undolog file \"%s\": %m",
+ ULogState.file_name));
+ }
+ else if (create)
+ {
+ undolog_init_file();
+ undolog_sync_file();
+ }
+ else
+ undolog_load_file_header();
+
+ ReserveExternalFD();
+ ULogState.xid = InvalidTransactionId;
+
+ /* in create mode, return false since the file was not found */
+ if (create)
+ return false;
+
+ return true;
+}
+
+/*
+ * undolog_open_file() - Opens a ulog file for the specified xid pair.
+ *
+ * See undolog_open_file_by_name() for more details.
+ *
+ * XID values and file_name in ULogState are set after a successful
+ * return. Otherwise, they are set to invalid values.
+ */
+static bool
+undolog_open_file(TransactionId xid, bool create)
+{
+ bool ret;
+
+ /* shortcut for repeated usage of the same file */
+ if (ULogState.xid == xid)
+ {
+ Assert(ULogState.fd >= 0);
+ return true;
+ }
+
+ /* Switched between subtransactions, close the current file if any */
+ if (ULogState.fd >= 0)
+ {
+ if (close(ULogState.fd) != 0)
+ ereport(ERROR, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ ULogState.file_name));
+
+ ULogState.xid = InvalidTransactionId;
+ ULogState.fd = -1;
+ ReleaseExternalFD();
+ }
+
+ Assert (ULogState.xid == InvalidTransactionId &&
+ ULogState.fd == -1);
+
+ /* Set the file name */
+ undolog_set_filename(ULogState.file_name, xid);
+
+ /* Do the task */
+ ret = undolog_open_file_by_name(create);
+
+ /* Set the xid pair for this file if opened */
+ if (ret || create)
+ ULogState.xid = xid;
+
+ return ret;
+}
+
+/*
+ * undolog_close_file() - Closes the curerntly opened ulog file, if any.
+ */
+static void
+undolog_close_file(void)
+{
+ if (ULogState.fd < 0)
+ return;
+
+ if (close(ULogState.fd) != 0)
+ ereport(ERROR, errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ ULogState.file_name));
+
+ ULogState.xid = InvalidTransactionId;
+ ULogState.fd = -1;
+ ReleaseExternalFD();
+}
+
+/*
+ * undolog_remove_file() - Removes a file specified by ULogState.file_name.
+ *
+ * The file must already be closed.
+ */
+static void
+undolog_remove_file(void)
+{
+ durable_unlink(ULogState.file_name, FATAL);
+ ULogState.file_name[0] = 0;
+}
+
+/*
+ * undolog_remove_file_by_xid()
+ *
+ * Removes a file specified by an xid pair. Closes the file if it is open.
+ */
+static void
+undolog_remove_file_by_xid(TransactionId xid)
+{
+ char file_name[MAXPGPATH];
+
+ if (ULogState.xid == xid)
+ undolog_close_file();
+
+ undolog_set_filename(file_name, xid);
+ durable_unlink(file_name, FATAL);
+}
+
+/*
+ * SimpleUndoLogWrite() - Write an undolog record using current xid
+ */
+void
+SimpleUndoLogWrite(RmgrId rmgr, uint8 info, void *data, int len)
+{
+ /*
+ * The following lines may assign a new transaction ID. This is somewhat
+ * clumsy, but the caller needs to assign it soon.
+ */
+ TransactionId xid = GetCurrentTransactionId();
+
+ SimpleUndoLogWriteRedo(rmgr, info, data, len, xid);
+}
+
+/*
+ * SimpleUndoLogWriteRedo() - Writes an undolog record
+ *
+ * topxid is the XID of the top-level transaction. subxid is the assigned
+ * TransactionId of the current transaction, not a SubTransactionId.
+ * InvalidTransactionId indicates that the current transaction is the top-level
+ * transaction. See SimpleUndoLogWrite().
+ *
+ * This function is exposed for use during recovery.
+ */
+void
+SimpleUndoLogWriteRedo(RmgrId rmgr, uint8 info, void *data, int len,
+ TransactionId xid)
+{
+ int reclen = sizeof(SimpleUndoLogRecord) + len;
+ SimpleUndoLogRecord *rec;
+ pg_crc32c undodata_crc;
+
+ Assert(!IsParallelWorker());
+
+ /* Inacitvate undo system during bootprocessing mode */
+ if (IsBootstrapProcessingMode())
+ return;
+
+ /* We must be in xid-assigned transactions */
+ Assert(TransactionIdIsValid(xid));
+
+ /* The caller can set rmgr bits only. */
+ if ((info & ~ULR_RMGR_INFO_MASK) != 0)
+ elog(PANIC, "invalid ulog info mask %02X", info);
+
+ if (!undolog_select_entry(xid, true))
+ {
+ /* new entry created, create the corresponding file */
+ undolog_open_file(xid, true);
+ }
+ else
+ {
+ /* entry exists, open existing file */
+ if (!undolog_open_file(xid, false))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not open undolog file \"%s\": %m",
+ ULogState.file_name));
+ }
+
+ /*
+ * Writes to files for prepared transactions are ignored during
+ * recovery. See undolog_open_file_by_name() for more details.
+ */
+ if (ULogState.file_header.state != ULOG_FILE_DEFAULT)
+ {
+ Assert (RecoveryInProgress());
+ return;
+ }
+
+ rec = palloc(reclen);
+ rec->ul_tot_len = reclen;
+ rec->ul_rmid = rmgr;
+ rec->ul_info = info;
+
+ memcpy((char *)rec + sizeof(SimpleUndoLogRecord), data, len);
+
+ /* Calculate CRC of the data */
+ INIT_CRC32C(undodata_crc);
+ COMP_CRC32C(undodata_crc, &rec->ul_rmid,
+ reclen - offsetof(SimpleUndoLogRecord, ul_rmid));
+ rec->ul_crc = undodata_crc;
+
+
+ if (write(ULogState.fd, rec, reclen) < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not write to undolog file \"%s\": %m",
+ ULogState.file_name));
+
+ pfree(rec);
+ undolog_sync_file();
+}
+
+/*
+ * ulog_process_ulogfile() - Processes the currently open undo log.
+ *
+ * The file must be opened beforehand.
+ * If cleanup is true, this function informs the undo callback functions that
+ * it is called during recovery cleanup and the transaction is prepared. In
+ * this case, the undo callback may need to behave differently.
+ */
+#define ULOG_READBUF_INIT_SIZE 32
+static void
+undolog_process_ulogfile(bool isCommit, bool cleanup, MemoryContext outercxt)
+{
+ static int bufsize = 0;
+ static char *buf = NULL;
+ int ret;
+
+ StaticAssertDecl(sizeof(SimpleUndoLogRecord) <= ULOG_READBUF_INIT_SIZE,
+ "initial buffer size too small");
+
+ Assert (ULogState.fd >= 0);
+ Assert (outercxt);
+
+ undolog_load_file_header();
+
+ bufsize = ULOG_READBUF_INIT_SIZE;
+ buf = palloc(bufsize);
+
+ while ((ret = read(ULogState.fd, buf, sizeof(SimpleUndoLogRecord))) ==
+ sizeof(SimpleUndoLogRecord))
+ {
+ SimpleUndoLogRecord *rec = (SimpleUndoLogRecord *) buf;
+ int readlen = rec->ul_tot_len - sizeof(SimpleUndoLogRecord);
+ MemoryContext oldcxt;
+ pg_crc32c undodata_crc;
+
+ if (rec->ul_tot_len > bufsize)
+ {
+ bufsize *= 2;
+ buf = repalloc(buf, bufsize);
+ rec = (SimpleUndoLogRecord *) buf;
+ }
+
+ ret = read(ULogState.fd,
+ buf + sizeof(SimpleUndoLogRecord), readlen);
+ if (ret != readlen)
+ {
+ if (ret < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not read undo log file \"%s\": %m",
+ ULogState.file_name));
+
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("reading undo log expected %d bytes, but actually %d: %s",
+ readlen, ret, ULogState.file_name));
+
+ }
+
+ /* CRC check */
+ INIT_CRC32C(undodata_crc);
+ COMP_CRC32C(undodata_crc, &rec->ul_rmid,
+ rec->ul_tot_len - offsetof(SimpleUndoLogRecord, ul_rmid));
+ if (!EQ_CRC32C(rec->ul_crc, undodata_crc))
+ {
+ /*
+ * The location is the byte immediately following the just-read
+ * record. We cannot issue ERROR because this function is called
+ * during abort processing.
+ */
+ off_t off = lseek(ULogState.fd, 0, SEEK_CUR);
+ ereport(WARNING,
+ errmsg("incorrect undolog record checksum at %lld in %s",
+ (long long int) off, ULogState.file_name),
+ errdetail("Aborted undo processing of the corresponding transaction."));
+ }
+
+ /* The undo routines may want to allcoate memory in the outer context */
+ oldcxt = MemoryContextSwitchTo(outercxt);
+ UndoRoutines[rec->ul_rmid].rm_undo(rec,
+ ULogState.file_header.state,
+ isCommit, cleanup);
+ MemoryContextSwitchTo(oldcxt);
+ }
+
+ if (ret != 0)
+ {
+ if (ret < 0)
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("could not read undo log file \"%s\": %m",
+ ULogState.file_name));
+ if (ret != sizeof(SimpleUndoLogRecord))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("reading undo log expected %d bytes, but actually %d: %s",
+ (int) sizeof(SimpleUndoLogRecord), ret, ULogState.file_name));
+ }
+}
+
+/*
+ * ulog_undo() - Processes undo log for the specified xid pair
+ *
+ * The undo log file for the xid pair is removed before this function returns,
+ * regardless whether it is processed or not. Therefore, the in-memory entry
+ * for this xid pair must be removed afterwards, if any.
+ */
+static void
+undolog_undo(bool isCommit, TransactionId xid)
+{
+ Assert(!IsParallelWorker());
+
+ /* Return if no undo log exists for the xid pair */
+ if (!undolog_open_file(xid, false))
+ return;
+
+ undolog_process_ulogfile(isCommit, false, CurrentMemoryContext);
+
+ undolog_remove_file_by_xid(xid);
+}
+
+/*
+ * ulog_exists() - Return true if the log file for the xid pair exists.
+ *
+ * This function has no side effects.
+ */
+static bool
+undolog_exists(TransactionId xid)
+{
+ char fname[MAXPGPATH];
+ struct stat statbuf;
+
+ undolog_set_filename(fname, xid);
+
+ if (stat(fname, &statbuf) < 0)
+ {
+ if (errno == ENOENT)
+ return false;
+ ereport(ERROR, errmsg("stat failed for undo file \"%s\": %m", fname));
+ }
+
+ return true;
+}
+
+/*
+ * SimpleUndoLog_UndoByXid()
+ *
+ * Processes undo logs for the specified transaction, intended for usein
+ * finishing prepared transactins or recovery.
+ *
+ * children is the list of subtransaction IDs of the topxid, with a length of
+ * nchildren.
+ */
+void
+SimpleUndoLog_UndoByXid(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children)
+{
+ ActiveULog *p;
+ ActiveULog *prev;
+
+ Assert (RecoveryInProgress() || ULogState.active_ulogs == NULL);
+
+ /* process undo logs */
+ if (undolog_exists(xid))
+ undolog_undo(isCommit, xid);
+
+ for (int i = 0 ; i < nchildren ; i++)
+ {
+ if (undolog_exists(children[i]))
+ undolog_undo(isCommit, children[i]);
+ }
+
+ /*
+ * Remove in-memory entries for this transaction tree if any.
+ * We have these entries only during recovery.
+ */
+ Assert (RecoveryInProgress() || ULogState.active_ulogs == NULL);
+ prev = NULL;
+ for (p = ULogState.active_ulogs ; p ; p = p->next)
+ {
+ bool match = false;
+
+ if (p->xid == xid)
+ match = true;
+ else
+ {
+ for (int i = 0 ; i < nchildren ; i++)
+ {
+ if (p->xid == children[i])
+ {
+ match = true;
+ break;
+ }
+ }
+ }
+
+ if (!match)
+ {
+ prev = p;
+ continue;
+ }
+
+ /* remove this entry */
+ if (prev)
+ prev->next = p->next;
+ else
+ ULogState.active_ulogs = p->next;
+ pfree(p);
+
+ break;
+ }
+}
+
+/*
+ * AtEOXact_SimpleUndoLog() - At end-of-xact processing of undo logs.
+ *
+ * Processes all existing undo log files, leaving none remaining after this
+ * function returns.
+ */
+void
+AtEOXact_SimpleUndoLog(bool isCommit)
+{
+ ActiveULog *p = ULogState.active_ulogs;
+
+ if (!p)
+ return;
+
+ while(p)
+ {
+ ActiveULog *prev = p;
+
+ undolog_undo(isCommit, p->xid);
+
+ ULogState.active_ulogs = p = p->next;
+ pfree(prev);
+ }
+
+ ULogState.active_ulogs = ULogState.current_ulog = NULL;
+}
+
+/*
+ * AtEOXact_SimpleUndoLog() - At end-of-subxact processing of undo logs.
+ *
+ * The undo log for the subtransaction will be removed on abort. It will remain
+ * on commit and be processed at the end of the top-level transaction.
+ */
+void
+AtEOSubXact_SimpleUndoLog(bool isCommit)
+{
+ ActiveULog *p = ULogState.current_ulog;
+ TransactionId xid;
+
+ /*
+ * Undo logs of committed subtransactions are processed at the end of the
+ * top-level transaction.
+ */
+ if (isCommit || !p)
+ return;
+
+ xid = GetCurrentTransactionIdIfAny();
+
+ /* Return if the innermost subxid is not assigned. */
+ if (!TransactionIdIsValid(xid))
+ return;
+
+ if (!undolog_select_entry(xid, false))
+ return;
+
+ undolog_undo(isCommit, xid);
+ undolog_remove_entry();
+}
+
+/*
+ * UndoLogCleanup() - On-recovery cleanup of undo log
+ *
+ * This function is called once before the redo process of recovery to remove
+ * files created by uncommitted transactions before the server crash. It is
+ * then called again after the redo process to clean up any leftover garbage
+ * after the redo process.
+ */
+void
+UndoLogCleanup(void)
+{
+ DIR *dirdesc;
+ struct dirent *de;
+ MemoryContext mcxt, outercxt;
+ ActiveULog *next;
+
+ /*
+ * Some memory allocation occurs during this process. Use a separate memory
+ * context to avoid memory leaks.
+ */
+ mcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "UndologContext",
+ ALLOCSET_DEFAULT_SIZES);
+ outercxt = MemoryContextSwitchTo(mcxt);
+
+ undolog_close_file();
+
+ /* scan through all undo log files */
+ dirdesc = AllocateDir(SIMPLE_UNDOLOG_DIR);
+ while ((de = ReadDir(dirdesc, SIMPLE_UNDOLOG_DIR)) != NULL)
+ {
+ if (strspn(de->d_name, "01234567890abcdef") < strlen(de->d_name))
+ continue;
+
+ snprintf(ULogState.file_name, MAXPGPATH, "%s/%s",
+ SIMPLE_UNDOLOG_DIR, de->d_name);
+
+ undolog_open_file_by_name(false);
+ undolog_process_ulogfile(false, true, outercxt);
+
+ /* Mark this log as crashed after prepared if not yet done */
+ if (ULogState.file_header.state == ULOG_FILE_PREPARED)
+ {
+ ULogState.file_header.state = ULOG_FILE_CRASH_AFTER_PREPARED;
+ undolog_write_file_header();
+ undolog_sync_file();
+ }
+ undolog_close_file();
+
+ /*
+ * Do not remove ULOG files for prepared transactions. We cannot
+ * remove them and let recovery recreate them, because an existing file
+ * for a prepared transaction may contain logs from before the latest
+ * checkpoint, which would be lost in the newly created ulog files.
+ */
+ if (ULogState.file_header.state == ULOG_FILE_DEFAULT)
+ undolog_remove_file();
+ }
+
+ MemoryContextSwitchTo(outercxt);
+ MemoryContextDelete(mcxt);
+
+ /* Clean up in-memory data */
+ for (ActiveULog *p = ULogState.active_ulogs ; p ; p = next)
+ {
+ next = p->next;
+ pfree(p);
+ }
+
+ ULogState.active_ulogs = ULogState.current_ulog = NULL;
+ ULogState.xid = InvalidTransactionId;
+ ULogState.file_name[0] = 0;
+}
+
+/*
+ * AtPrepare_SimpleUndoLog()
+ *
+ * Mark all undo logs as prepared.
+ *
+ * This mark is referenced by crash recovery to determine that each UNDO log
+ * file needs to be preserved.
+ */
+void
+AtPrepare_SimpleUndoLog(void)
+{
+ ActiveULog *p = ULogState.active_ulogs;
+ ActiveULog *prev = NULL;
+ ActiveULog *tmp;
+
+ Assert (!RecoveryInProgress());
+
+ while(p)
+ {
+ /* return no undo log exists for the transaction */
+ if (!undolog_open_file(p->xid, false))
+ ereport(ERROR,
+ errcode_for_file_access(),
+ errmsg("failed to open undolog file \"%s\": %m",
+ ULogState.file_name));
+
+ undolog_load_file_header();
+ ULogState.file_header.state = ULOG_FILE_PREPARED;
+ undolog_write_file_header();
+ undolog_sync_file();
+ undolog_close_file();
+
+ if (prev)
+ prev->next = p->next;
+ else
+ ULogState.active_ulogs = p->next;
+
+ tmp = p->next;
+ pfree(p);
+ p = tmp;
+ }
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e98286d768..bd31b77906 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -78,6 +78,7 @@
#include "access/commit_ts.h"
#include "access/htup_details.h"
+#include "access/simpleundolog.h"
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
@@ -1587,6 +1588,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
abortstats,
gid);
+ SimpleUndoLog_UndoByXid(isCommit, xid, hdr->nsubxacts, children);
+
ProcArrayRemove(proc, latestXid);
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0fe1630fca..19223d35de 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -24,6 +24,7 @@
#include "access/multixact.h"
#include "access/parallel.h"
#include "access/subtrans.h"
+#include "access/simpleundolog.h"
#include "access/transam.h"
#include "access/twophase.h"
#include "access/xact.h"
@@ -2418,6 +2419,14 @@ CommitTransaction(void)
AtEOXact_MultiXact();
+ /*
+ * Drop storage files. This has to happen after buffer pins are dropped,
+ * required by DropRelationBuffers(). This is mainly for a requirement by
+ * abort-time cleanup, but place this at the same place for commit for
+ * consistency.
+ */
+ AtEOXact_SimpleUndoLog(true);
+
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_LOCKS,
true, true);
@@ -2656,6 +2665,7 @@ PrepareTransaction(void)
AtPrepare_PgStat();
AtPrepare_MultiXact();
AtPrepare_RelationMap();
+ AtPrepare_SimpleUndoLog();
/*
* Here is where we really truly prepare.
@@ -2953,6 +2963,13 @@ AbortTransaction(void)
AtEOXact_RelationCache(false);
AtEOXact_Inval(false);
AtEOXact_MultiXact();
+
+ /*
+ * Drop storage files. This has to happen after buffer pins are
+ * dropped, required by DropRelationBuffers().
+ */
+ AtEOXact_SimpleUndoLog(false);
+
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_LOCKS,
false, true);
@@ -5155,6 +5172,7 @@ CommitSubTransaction(void)
s->parent->subTransactionId);
AtEOSubXact_Inval(true);
AtSubCommit_smgr();
+ AtEOSubXact_SimpleUndoLog(true);
/*
* The only lock we actually release here is the subtransaction XID lock.
@@ -5336,6 +5354,7 @@ AbortSubTransaction(void)
RESOURCE_RELEASE_AFTER_LOCKS,
false, false);
AtSubAbort_smgr();
+ AtEOSubXact_SimpleUndoLog(false);
AtEOXact_GUC(false, s->gucNestLevel);
AtEOSubXact_SPI(false, s->subTransactionId);
@@ -6226,6 +6245,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ SimpleUndoLog_UndoByXid(true, xid, parsed->nsubxacts, parsed->subxacts);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6337,6 +6358,8 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ SimpleUndoLog_UndoByXid(false, xid, parsed->nsubxacts, parsed->subxacts);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b3d38cfaf8..2809ce5014 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -51,6 +51,7 @@
#include "access/heaptoast.h"
#include "access/multixact.h"
#include "access/rewriteheap.h"
+#include "access/simpleundolog.h"
#include "access/subtrans.h"
#include "access/timeline.h"
#include "access/transam.h"
@@ -5732,6 +5733,13 @@ StartupXLOG(void)
/* Check that the GUCs used to generate the WAL allow recovery */
CheckRequiredParameterValues();
+ /*
+ * Perform undo processing. To prevent uncommitted INIT forks from
+ * mistakenly deleting data, this must be done before resetting
+ * UNLOGGED relations.
+ */
+ UndoLogCleanup();
+
/*
* We're in recovery, so unlogged relations may be trashed and must be
* reset. This should be done BEFORE allowing Hot Standby
@@ -5880,14 +5888,18 @@ StartupXLOG(void)
}
/*
- * Reset unlogged relations to the contents of their INIT fork. This is
- * done AFTER recovery is complete so as to include any unlogged relations
- * created during recovery, but BEFORE recovery is marked as having
- * completed successfully. Otherwise we'd not retry if any of the post
- * end-of-recovery steps fail.
+ * Process undo logs left ater recovery to clean up uncommitted storage
+ * files, including INIT forks, then reset unlogged relations to the
+ * contents of their INIT fork. This is done AFTER recovery is complete so
+ * as to include any file creations during recovery, but BEFORE recovery is
+ * marked as having completed successfully. Otherwise we'd not retry if any
+ * of the post end-of-recovery steps fail.
*/
if (InRecovery)
+ {
+ UndoLogCleanup();
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+ }
/*
* Pre-scan prepared transactions to find out the range of XIDs present.
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f00718a015..1881ef06ff 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -307,6 +307,7 @@ void setup_signals(void);
void setup_text_search(void);
void create_data_directory(void);
void create_xlog_or_symlink(void);
+void create_ulog(void);
void warn_on_mount_point(int error);
void initialize_data_directory(void);
@@ -2958,6 +2959,21 @@ create_xlog_or_symlink(void)
free(subdirloc);
}
+/* Create undo log directory */
+void
+create_ulog(void)
+{
+ char *subdirloc;
+
+ /* form name of the place for the subdirectory */
+ subdirloc = psprintf("%s/pg_ulog", pg_data);
+
+ if (mkdir(subdirloc, pg_dir_create_mode) < 0)
+ pg_fatal("could not create directory \"%s\": %m",
+ subdirloc);
+
+ free(subdirloc);
+}
void
warn_on_mount_point(int error)
@@ -2992,6 +3008,7 @@ initialize_data_directory(void)
create_data_directory();
create_xlog_or_symlink();
+ create_ulog();
/* Create required subdirectories (other than pg_wal) */
printf(_("creating subdirectories ... "));
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 22f7351fdc..525b98899f 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -28,7 +28,7 @@
* RmgrNames is an array of the built-in resource manager names, to make error
* messages a bit nicer.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
name,
static const char *const RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 6b8c17bb4c..a21009c5b8 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -32,7 +32,7 @@
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
{ name, desc, identify},
static const RmgrDescData RmgrDescTable[RM_N_BUILTIN_IDS] = {
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index 3b6a497e1b..d705de9256 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
* Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
* file format.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo) \
symname,
typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 78e6b908c6..9afb9bafcc 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -24,26 +24,26 @@
* Changes to this list possibly need an XLOG_PAGE_MAGIC bump.
*/
-/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
+/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode, undo */
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode, NULL)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode, NULL)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL, NULL)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL, NULL)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL, NULL)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL, NULL)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL, NULL)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode, NULL)
diff --git a/src/include/access/simpleundolog.h b/src/include/access/simpleundolog.h
new file mode 100644
index 0000000000..93aacfb73c
--- /dev/null
+++ b/src/include/access/simpleundolog.h
@@ -0,0 +1,44 @@
+#ifndef SIMPLE_UNDOLOG_H
+#define SIMPLE_UNDOLOG_H
+
+#include "access/rmgr.h"
+#include "port/pg_crc32c.h"
+
+#define SIMPLE_UNDOLOG_DIR "pg_ulog"
+
+typedef struct SimpleUndoLogRecord
+{
+ uint32 ul_tot_len; /* total length of entire record */
+ pg_crc32c ul_crc; /* CRC for this record */
+ RmgrId ul_rmid; /* resource manager for this record */
+ uint8 ul_info; /* record info */
+ /* rmgr-specific data follow, no padding */
+} SimpleUndoLogRecord;
+
+/* State of the undo log file */
+typedef enum UndoLogFileState
+{
+ ULOG_FILE_DEFAULT, /* normal state */
+ ULOG_FILE_PREPARED, /* ulog file is prepared */
+ ULOG_FILE_CRASH_AFTER_PREPARED /* experienced a crash after prepared */
+} UndoLogFileState;
+
+/*
+ * The high 4 bits in ul_info may be used freely by rmgr. The lower 4 bits are
+ * not used for now.
+ */
+#define ULR_INFO_MASK 0x0F
+#define ULR_RMGR_INFO_MASK 0xF0
+
+extern void SimpleUndoLogWrite(RmgrId rmgr, uint8 info, void *data, int len);
+extern void SimpleUndoLogWriteRedo(RmgrId rmgr, uint8 info, void *data, int len,
+ TransactionId xid);
+extern void SimpleUndoLogSetPreparedRedo(void);
+extern void AtEOXact_SimpleUndoLog(bool isCommit);
+extern void AtEOSubXact_SimpleUndoLog(bool isCommit);
+extern void AtPrepare_SimpleUndoLog(void);
+extern void SimpleUndoLog_UndoByXid(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children);
+extern void UndoLogCleanup(void);
+
+#endif /* SIMPLE_UNDOLOG_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e951a9e6f..04f3eca550 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -25,6 +25,7 @@ AclResult
AcquireSampleRowsFunc
ActionList
ActiveSnapshotElt
+ActiveULog
AddForeignUpdateTargets_function
AddrInfo
AffixNode
@@ -2658,6 +2659,7 @@ SimplePtrListCell
SimpleStats
SimpleStringList
SimpleStringListCell
+SimpleUndoLogRecord
SingleBoundSortItem
Size
SkipPages
@@ -3026,8 +3028,11 @@ UINT
ULARGE_INTEGER
ULONG
ULONG_PTR
+ULogStateData
UV
UVersionInfo
+UndoDescData
+UndoLogFileHeader
UnicodeNormalizationForm
UnicodeNormalizationQC
Unique
--
2.43.5
v34-0003-Remove-function-for-retaining-files-on-outer-tra.patchtext/x-patch; charset=us-asciiDownload
From e46c105da5e21773ecd743472d9b202dcefc3655 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 26 Jul 2024 09:40:17 +0900
Subject: [PATCH v34 03/16] Remove function for retaining files on outer
transaction aborts
The function RelationPreserveStorage() was initially created to keep
storage files committed in a subtransaction on the abort of outer
transactions. It was introduced by commit b9b8831ad6 in 2010, but no
use case for this behavior has emerged since then. If we move the
at-commit removal feature of storage files from pendingDeletes to the
UNDO log system, the UNDO system would need to accept the cancellation
of already logged entries, which makes the system overly complex with
no benefit. Therefore, remove the feature.
---
src/backend/catalog/storage.c | 16 +++++++++++++++
src/backend/utils/cache/relmapper.c | 30 +++++++++--------------------
2 files changed, 25 insertions(+), 21 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f56b3cc0f2..bdbed9fba3 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -254,6 +254,22 @@ RelationPreserveStorage(RelFileLocator rlocator, bool atCommit)
PendingRelDelete *prev;
PendingRelDelete *next;
+ /*
+ * There is no caller that passes false for atCommit.
+ *
+ * The only caller that used to pass false for atCommit was
+ * write_relmapper_file(), which intended to preserve committed storage
+ * files for mapped relations if outer transactions aborted. However, this
+ * has not occurred for more than ten years, and it is unlikely to be
+ * needed in the future. The code to let storage files committed in
+ * subtransactions survive after the top transaction aborts makes the UNDO
+ * log system overly complex and inefficient. Therefore, this feature has
+ * been removed. The function signature is left unchanged to make this
+ * change less invasive and to prevent the function from being mistakenly
+ * called during transaction aborts.
+ */
+ Assert (atCommit);
+
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 48d344ae3f..8907262712 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -1001,29 +1001,17 @@ write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
CacheInvalidateRelmap(dbid);
/*
- * Make sure that the files listed in the map are not deleted if the outer
- * transaction aborts. This had better be within the critical section
- * too: it's not likely to fail, but if it did, we'd arrive at transaction
- * abort with the files still vulnerable. PANICing will leave things in a
- * good state on-disk.
+ * There was a call to RelationPreserveStorage(). It was originally
+ * intended to ensure that storage files committed in subtransactions would
+ * survive an outer transaction's abort. This was introduced by commit
+ * b9b8831ad6 in 2010, but no use case has emerged since then. To simplify
+ * the UNDO log system, this code has been removed. See
+ * RelationMapUpdateMap() for more details. Now, we only check that this
+ * function is called in a top transaction.
*
- * Note: we're cheating a little bit here by assuming that mapped files
- * are either in pg_global or the database's default tablespace.
+ * During boot processing or recovery, the nest level will be zero.
*/
- if (preserve_files)
- {
- int32 i;
-
- for (i = 0; i < newmap->num_mappings; i++)
- {
- RelFileLocator rlocator;
-
- rlocator.spcOid = tsid;
- rlocator.dbOid = dbid;
- rlocator.relNumber = newmap->mappings[i].mapfilenumber;
- RelationPreserveStorage(rlocator, false);
- }
- }
+ Assert(!preserve_files || GetCurrentTransactionNestLevel() <= 1);
/* Critical section done */
if (write_wal)
--
2.43.5
v34-0004-Remove-function-for-retaining-files-on-outer-tra.patchtext/x-patch; charset=us-asciiDownload
From 92de2b22eb87db317f34372724d6a6fb9f247b83 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 31 Jul 2024 17:49:24 +0900
Subject: [PATCH v34 04/16] Remove function for retaining files on outer
transaction aborts - phase 2
Remove function parameters made unnecessary by the previous commit.
---
src/backend/utils/cache/relmapper.c | 39 ++++++++++++++---------------
1 file changed, 19 insertions(+), 20 deletions(-)
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 8907262712..75a1f55050 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -143,7 +143,7 @@ static void load_relmap_file(bool shared, bool lock_held);
static void read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
int elevel);
static void write_relmap_file(RelMapFile *newmap, bool write_wal,
- bool send_sinval, bool preserve_files,
+ bool send_sinval,
Oid dbid, Oid tsid, const char *dbpath);
static void perform_relmap_update(bool shared, const RelMapFile *updates);
@@ -309,7 +309,7 @@ RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
* file.
*/
LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
- write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+ write_relmap_file(&map, true, false, dbid, tsid, dstdbpath);
LWLockRelease(RelationMappingLock);
}
@@ -634,9 +634,9 @@ RelationMapFinishBootstrap(void)
/* Write the files; no WAL or sinval needed */
LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
- write_relmap_file(&shared_map, false, false, false,
+ write_relmap_file(&shared_map, false, false,
InvalidOid, GLOBALTABLESPACE_OID, "global");
- write_relmap_file(&local_map, false, false, false,
+ write_relmap_file(&local_map, false, false,
MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
LWLockRelease(RelationMappingLock);
}
@@ -887,7 +887,7 @@ read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
*/
static void
write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
- bool preserve_files, Oid dbid, Oid tsid, const char *dbpath)
+ Oid dbid, Oid tsid, const char *dbpath)
{
int fd;
char mapfilename[MAXPGPATH];
@@ -1000,19 +1000,6 @@ write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
if (send_sinval)
CacheInvalidateRelmap(dbid);
- /*
- * There was a call to RelationPreserveStorage(). It was originally
- * intended to ensure that storage files committed in subtransactions would
- * survive an outer transaction's abort. This was introduced by commit
- * b9b8831ad6 in 2010, but no use case has emerged since then. To simplify
- * the UNDO log system, this code has been removed. See
- * RelationMapUpdateMap() for more details. Now, we only check that this
- * function is called in a top transaction.
- *
- * During boot processing or recovery, the nest level will be zero.
- */
- Assert(!preserve_files || GetCurrentTransactionNestLevel() <= 1);
-
/* Critical section done */
if (write_wal)
END_CRIT_SECTION();
@@ -1058,8 +1045,20 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
*/
merge_map_updates(&newmap, updates, allowSystemTableMods);
+ /*
+ * write_relmap_file() had a feature to allow storage files committed in
+ * subtransactions to survive the aborts of outer transactions. This was
+ * introduced by commit b9b8831ad6 in 2010, but no use case has emerged
+ * since then. To keep the UNDO log system straightforward, this code has
+ * been removed. See `RelationMapUpdateMap()` for more details. Now, we
+ * only check that this function is called in a top-level transaction.
+ *
+ * During boot processing or recovery, the nest level will be zero.
+ */
+ Assert (GetCurrentTransactionNestLevel() <= 1);
+
/* Write out the updated map and do other necessary tasks */
- write_relmap_file(&newmap, true, true, true,
+ write_relmap_file(&newmap, true, true,
(shared ? InvalidOid : MyDatabaseId),
(shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
(shared ? "global" : DatabasePath));
@@ -1118,7 +1117,7 @@ relmap_redo(XLogReaderState *record)
* performed.
*/
LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
- write_relmap_file(&newmap, false, true, false,
+ write_relmap_file(&newmap, false, true,
xlrec->dbid, xlrec->tsid, dbpath);
LWLockRelease(RelationMappingLock);
--
2.43.5
v34-0005-Prevent-orphan-storage-files-after-server-crash.patchtext/x-patch; charset=us-asciiDownload
From e3ff0c077cb669f55cf77ab5c7ab86bf201e68a0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 1 Aug 2024 11:44:12 +0900
Subject: [PATCH v34 05/16] Prevent orphan storage files after server crash
When a server crashes during a transaction that created tables, newly
created but unused storage files are not removed. This patch prevents
such orphan files by utilizing the UNDO log system for storage files.
The behavior of this feature overlaps with the existing functionality,
which handles the removal of unnecessary files during rollback using
pendingDeletes. Therefore, that part will be removed. On the other
hand, the commit-time file deletions are not within the scope of the
UNDO log functionality, so that part will remain used. As a result,
the isCommit flag of the entries in the pendingDeletes list is now
always true. However, to avoid non-essential changes to the code, the
flag will be retained.
---
src/backend/access/heap/heapam_handler.c | 22 +--
src/backend/access/transam/simpleundolog.c | 1 +
src/backend/catalog/index.c | 4 +-
src/backend/catalog/storage.c | 179 +++++++++++++++++----
src/backend/commands/sequence.c | 4 +-
src/backend/commands/tablecmds.c | 19 +--
src/backend/storage/buffer/bufmgr.c | 4 +-
src/backend/storage/file/reinit.c | 78 +++++++++
src/backend/storage/smgr/smgr.c | 9 ++
src/include/access/rmgrlist.h | 2 +-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_ulog.h | 24 +++
src/include/storage/reinit.h | 3 +
src/include/storage/smgr.h | 1 +
src/tools/pgindent/typedefs.list | 1 +
15 files changed, 298 insertions(+), 55 deletions(-)
create mode 100644 src/include/catalog/storage_ulog.h
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 1c6da286d4..e9daa1d59e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -612,8 +612,7 @@ heapam_relation_set_new_filelocator(Relation rel,
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_MATVIEW ||
rel->rd_rel->relkind == RELKIND_TOASTVALUE);
- smgrcreate(srel, INIT_FORKNUM, false);
- log_smgrcreate(newrlocator, INIT_FORKNUM);
+ RelationCreateFork(srel, INIT_FORKNUM, true, true);
}
smgrclose(srel);
@@ -657,16 +656,17 @@ heapam_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
{
if (smgrexists(RelationGetSmgr(rel), forkNum))
{
- smgrcreate(dstrel, forkNum, false);
-
- /*
- * WAL log creation if the relation is persistent, or this is the
- * init fork of an unlogged relation.
- */
- if (RelationIsPermanent(rel) ||
+ bool wal_log = RelationIsPermanent(rel) |
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
- forkNum == INIT_FORKNUM))
- log_smgrcreate(newrlocator, forkNum);
+ forkNum == INIT_FORKNUM);
+
+ /*
+ * Usually, we don't use UNDO log for FSM or VM forks, as their
+ * creation is not transactional. However, we're currently copying
+ * the entire relation in a transactional manner, which requires
+ * after-crash cleanup.
+ */
+ RelationCreateFork(dstrel, forkNum, wal_log, true);
RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
rel->rd_rel->relpersistence);
}
diff --git a/src/backend/access/transam/simpleundolog.c b/src/backend/access/transam/simpleundolog.c
index 4bba27e340..f8a7fd367f 100644
--- a/src/backend/access/transam/simpleundolog.c
+++ b/src/backend/access/transam/simpleundolog.c
@@ -25,6 +25,7 @@
#include "access/twophase_rmgr.h"
#include "access/xact.h"
#include "access/xlog.h"
+#include "catalog/storage_ulog.h"
#include "miscadmin.h"
#include "storage/fd.h"
#include "utils/memutils.h"
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 33759056e3..a573f1a702 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3030,8 +3030,8 @@ index_build(Relation heapRelation,
if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
!smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
{
- smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
- log_smgrcreate(&indexRelation->rd_locator, INIT_FORKNUM);
+ RelationCreateFork(RelationGetSmgr(indexRelation),
+ INIT_FORKNUM, true, true);
indexRelation->rd_indam->ambuildempty(indexRelation);
}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index bdbed9fba3..31400c514f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,22 +19,41 @@
#include "postgres.h"
+#include "access/amapi.h"
+#include "access/simpleundolog.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogutils.h"
#include "catalog/storage.h"
+#include "catalog/storage_ulog.h"
#include "catalog/storage_xlog.h"
+#include "common/file_utils.h"
#include "miscadmin.h"
#include "storage/bulk_write.h"
+#include "storage/copydir.h"
+#include "storage/fd.h"
#include "storage/freespace.h"
#include "storage/proc.h"
+#include "storage/reinit.h"
#include "storage/smgr.h"
#include "utils/hsearch.h"
+#include "utils/inval.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+/* ULOG gives us high 4 bits (just following xlog) */
+#define ULOG_SMGR_CREATE 0x10
+
+/* undo log entry for storage file creation */
+typedef struct ul_smgr_create
+{
+ RelFileLocator rlocator;
+ ProcNumber backend;
+ ForkNumber forknum;
+} ul_smgr_create;
+
/* GUC variables */
int wal_skip_threshold = 2048; /* in kilobytes */
@@ -76,6 +95,10 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
+/* local functions */
+static void ulog_smgrcreate(SMgrRelation srel, ForkNumber forkNum);
+static void ulog_smgrcreate_redo(SMgrRelation srel, ForkNumber forkNum,
+ TransactionId xid);
/*
* AddPendingSync
@@ -147,28 +170,8 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
}
srel = smgropen(rlocator, procNumber);
- smgrcreate(srel, MAIN_FORKNUM, false);
- if (needs_wal)
- log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
-
- /*
- * Add the relation to the list of stuff to delete at abort, if we are
- * asked to do so.
- */
- if (register_delete)
- {
- PendingRelDelete *pending;
-
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->procNumber = procNumber;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
- }
+ RelationCreateFork(srel, MAIN_FORKNUM, needs_wal, register_delete);
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -179,6 +182,32 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateFork
+ * Create physical storage for a fork of a relation.
+ *
+ * This function creates a relation fork in a transactional manner. When
+ * undo_log is true, the creation is UNDO-logged so that in case of transaction
+ * aborts or server crashes later on, the fork will be removed. If the caller
+ * plans to remove the fork in another way, it should pass false. Additionally,
+ * it is WAL-logged if wal_log is true.
+ */
+void
+RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
+ bool wal_log, bool undo_log)
+{
+ /* Schedule the removal of this init fork at abort if requested. */
+ if (undo_log)
+ ulog_smgrcreate(srel, forkNum);
+
+ /* WAL-log this creation if requested. */
+ if (wal_log)
+ log_smgrcreate(&srel->smgr_rlocator.locator, forkNum);
+
+ smgrcreate(srel, forkNum, false);
+
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -198,6 +227,38 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform SimpleUndoLogWrite of an XLOG_SMGR_CREATE record to UNDO log.
+ */
+void
+ulog_smgrcreate(SMgrRelation srel, ForkNumber forkNum)
+{
+ ul_smgr_create ulrec;
+
+ ulrec.rlocator = srel->smgr_rlocator.locator;
+ ulrec.backend = srel->smgr_rlocator.backend;
+ ulrec.forknum = forkNum;
+ SimpleUndoLogWrite(RM_SMGR_ID, ULOG_SMGR_CREATE,
+ &ulrec, sizeof(ulrec));
+}
+
+/*
+ * Perform SimpleUndoLogWrite of an XLOG_SMGR_CREATE record to UNDO log during
+ * recovery.
+ */
+void
+ulog_smgrcreate_redo(SMgrRelation srel, ForkNumber forkNum,
+ TransactionId xid)
+{
+ ul_smgr_create ulrec;
+
+ ulrec.rlocator = srel->smgr_rlocator.locator;
+ ulrec.backend = srel->smgr_rlocator.backend;
+ ulrec.forknum = forkNum;
+ SimpleUndoLogWriteRedo(RM_SMGR_ID, ULOG_SMGR_CREATE,
+ &ulrec, sizeof(ulrec), xid);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -218,13 +279,12 @@ RelationDropStorage(Relation rel)
pendingDeletes = pending;
/*
- * NOTE: if the relation was created in this transaction, it will now be
- * present in the pending-delete list twice, once with atCommit true and
- * once with atCommit false. Hence, it will be physically deleted at end
- * of xact in either case (and the other entry will be ignored by
- * smgrDoPendingDeletes, so no error will occur). We could instead remove
- * the existing list entry and delete the physical file immediately, but
- * for now I'll keep the logic simple.
+ * NOTE: If the relation was created in this transaction, it will now be
+ * present both in the pending-delete list for commit time and in a UNDO
+ * log file for abort time. Hence, it will be physically deleted at the end
+ * of the xact in either case. Although we could remove the existing UNDO
+ * log record, the current UNDO log file format makes it difficult to
+ * delete individual recoreds for now and maybe in the future.
*/
RelationCloseSmgr(rel);
@@ -967,6 +1027,7 @@ smgr_redo(XLogReaderState *record)
SMgrRelation reln;
reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+ ulog_smgrcreate_redo(reln, xlrec->forkNum, XLogRecGetXid(record));
smgrcreate(reln, xlrec->forkNum, true);
}
else if (info == XLOG_SMGR_TRUNCATE)
@@ -1060,3 +1121,65 @@ smgr_redo(XLogReaderState *record)
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
+
+void
+smgr_undo(SimpleUndoLogRecord *record, UndoLogFileState state,
+ bool isCommit, bool cleanup)
+{
+ uint8 info = record->ul_info & ~ULR_INFO_MASK;
+
+ if (info == ULOG_SMGR_CREATE)
+ {
+ ul_smgr_create *ulrec = (ul_smgr_create *) ULogRecGetData(record);
+
+ if (state != ULOG_FILE_DEFAULT && cleanup)
+ {
+ Assert (!isCommit);
+ /*
+ * During post-crash cleanup, if the transaction that created the
+ * fork was already prepared before the crash, the fate of the file
+ * should be determined by whether the prepared transaction will be
+ * committed or not. Tell reinit not to reset this relation.
+ */
+ ResetUnloggedRelationIgnore(ulrec->rlocator,
+ ulrec->backend);
+ }
+ else if (!isCommit)
+ {
+ /* Otherwise, remove the file immediately. */
+ SMgrRelation reln;
+ ForkNumber forks[3];
+ BlockNumber firstblocks[3] = {0};
+ int nforks = 0;
+
+ forks[nforks++] = ulrec->forknum;
+
+ /*
+ * If the MAIN fork was created in the transaction, the rollback
+ * should remove all forks of this relation. Although we could
+ * register an undo record individually for each fork, this may be
+ * more complex because VM and FSM can be created
+ * non-transactionally outside the transaction that created the
+ * MAIN fork.
+ */
+ if (ulrec->forknum == MAIN_FORKNUM)
+ {
+ forks[nforks++] = VISIBILITYMAP_FORKNUM;
+ forks[nforks++] = FSM_FORKNUM;
+ }
+
+ /*
+ * Drop buffers, then the files. This can be improved by using
+ * smgrdounlinkall(), but currently I take the simpler way.
+ */
+ reln = smgropen(ulrec->rlocator, ulrec->backend);
+ DropRelationBuffers(reln, forks, nforks, firstblocks);
+ for (int i = 0 ; i < nforks ; i++)
+ smgrunlink(reln, forks[i], true);
+
+ smgrclose(reln);
+ }
+ }
+ else
+ elog(PANIC, "smgr_undo: unknown op code %u", info);
+}
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index b37fd688d3..065bfbc1c9 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
#include "catalog/objectaccess.h"
#include "catalog/pg_sequence.h"
#include "catalog/pg_type.h"
+#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
#include "commands/defrem.h"
#include "commands/sequence.h"
@@ -344,8 +345,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
SMgrRelation srel;
srel = smgropen(rel->rd_locator, INVALID_PROC_NUMBER);
- smgrcreate(srel, INIT_FORKNUM, false);
- log_smgrcreate(&rel->rd_locator, INIT_FORKNUM);
+ RelationCreateFork(srel, INIT_FORKNUM, true, true);
fill_seq_fork_with_data(rel, tuple, INIT_FORKNUM);
FlushRelationBuffers(rel);
smgrclose(srel);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b3cc6f8f69..e9bba3aceb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15381,16 +15381,17 @@ index_copy_data(Relation rel, RelFileLocator newrlocator)
{
if (smgrexists(RelationGetSmgr(rel), forkNum))
{
- smgrcreate(dstrel, forkNum, false);
-
- /*
- * WAL log creation if the relation is persistent, or this is the
- * init fork of an unlogged relation.
- */
- if (RelationIsPermanent(rel) ||
+ bool wal_log = RelationIsPermanent(rel) |
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
- forkNum == INIT_FORKNUM))
- log_smgrcreate(&newrlocator, forkNum);
+ forkNum == INIT_FORKNUM);
+
+ /*
+ * Usually, we don't use UNDO log for FSM or VM forks, as their
+ * creation is not transactional. However, we're currently copying
+ * the entire relation in a transactional manner, which requires
+ * after-crash cleanup.
+ */
+ RelationCreateFork(dstrel, forkNum, wal_log, true);
RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
rel->rd_rel->relpersistence);
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5cdd2f10fc..2bc60f7295 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4821,8 +4821,7 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
/*
* Create and copy all forks of the relation. During create database we
* have a separate cleanup mechanism which deletes complete database
- * directory. Therefore, each individual relation doesn't need to be
- * registered for cleanup.
+ * directory. Therefore, do not issue an UNDO log for this relation.
*/
RelationCreateStorage(dst_rlocator, relpersistence, false);
@@ -4836,6 +4835,7 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
{
if (smgrexists(src_rel, forkNum))
{
+ /* Use smgrcreate() directly as no UNDO log is required. */
smgrcreate(dst_rel, forkNum, false);
/*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index f1cd1a38d9..c00f1aaa8b 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -34,6 +34,39 @@ typedef struct
RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
+static char **ignore_files = NULL;
+static int nignore_elems = 0;
+static int nignore_files = 0;
+
+/*
+ * identify the file should be ignored during resetting unlogged relations.
+ */
+static bool
+reinit_ignore_file(const char *dirname, const char *name)
+{
+ char fnamebuf[MAXPGPATH];
+ int len;
+
+ if (nignore_files == 0)
+ return false;
+
+ strncpy(fnamebuf, dirname, MAXPGPATH - 1);
+ strncat(fnamebuf, "/", MAXPGPATH - 1);
+ strncat(fnamebuf, name, MAXPGPATH - 1);
+ fnamebuf[MAXPGPATH - 1] = 0;
+
+ for (int i = 0 ; i < nignore_files ; i++)
+ {
+ /* match ignoring fork part */
+ len = strlen(ignore_files[i]);
+ if (strncmp(fnamebuf, ignore_files[i], len) == 0 &&
+ (fnamebuf[len] == 0 || fnamebuf[len] == '_'))
+ return true;
+ }
+
+ return false;
+}
+
/*
* Reset unlogged relations from before the last restart.
*
@@ -204,6 +237,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -243,6 +280,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* We never remove the init fork. */
if (forkNum == INIT_FORKNUM)
continue;
@@ -294,6 +335,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -337,6 +382,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -366,6 +415,35 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
}
}
+/*
+ * Record relfilenodes that should be left alone during reinitializing unlogged
+ * relations.
+ */
+void
+ResetUnloggedRelationIgnore(RelFileLocator rloc, ProcNumber backend)
+{
+ RelFileLocatorBackend rbloc;
+
+ if (nignore_files >= nignore_elems)
+ {
+ if (ignore_files == NULL)
+ {
+ nignore_elems = 16;
+ ignore_files = palloc(sizeof(char *) * nignore_elems);
+ }
+ else
+ {
+ nignore_elems *= 2;
+ ignore_files = repalloc(ignore_files,
+ sizeof(char *) * nignore_elems);
+ }
+ }
+
+ rbloc.backend = backend;
+ rbloc.locator = rloc;
+ ignore_files[nignore_files++] = relpath(rbloc, MAIN_FORKNUM);
+}
+
/*
* Basic parsing of putative relation filenames.
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 7b9fa103ef..eb01040772 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -791,6 +791,15 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+/*
+ * smgrunlink() -- unlink the storage file
+ */
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 9afb9bafcc..b856eac024 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -27,7 +27,7 @@
/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode, undo */
PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode, NULL)
PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, smgr_undo)
PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL, NULL)
PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL, NULL)
PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL, NULL)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 72ef3ee92c..3451d6ac80 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
+ bool wal_log, bool undo_log);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_ulog.h b/src/include/catalog/storage_ulog.h
new file mode 100644
index 0000000000..cc3d623afd
--- /dev/null
+++ b/src/include/catalog/storage_ulog.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * storage_ulog.h
+ * prototypes for Undo Log support for backend/catalog/storage.c
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/storage_ulog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STORAGE_ULOG_H
+#define STORAGE_ULOG_H
+
+#include "access/simpleundolog.h"
+#include "storage/smgr.h"
+
+extern void smgr_undo(SimpleUndoLogRecord *record, UndoLogFileState prepared,
+ bool isCommit, bool cleanup);
+#define ULogRecGetData(record) ((char *)record + sizeof(SimpleUndoLogRecord))
+
+#endif /* STORAGE_XLOG_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index 1373d509df..108cee160e 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,9 +16,12 @@
#define REINIT_H
#include "common/relpath.h"
+#include "storage/relfilelocator.h"
extern void ResetUnloggedRelations(int op);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc,
+ ProcNumber backend);
extern bool parse_filename_for_nontemp_relation(const char *name,
RelFileNumber *relnumber,
ForkNumber *fork,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index e15b20a566..e867ff92ab 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -107,6 +107,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
int nforks, BlockNumber *nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void AtEOXact_SMgr(void);
extern bool ProcessBarrierSmgrRelease(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 04f3eca550..c182a65d4d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4007,6 +4007,7 @@ uint8
uint8_t
uint8x16_t
uintptr_t
+ul_smgr_create
unicodeStyleBorderFormat
unicodeStyleColumnFormat
unicodeStyleFormat
--
2.43.5
v34-0006-new-indexam-bit-for-unlogged-storage-compatibili.patchtext/x-patch; charset=us-asciiDownload
From f790bf1d02cd1ea76357f7e3a5b938820d5a8f3f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 24 Jul 2024 19:31:39 +0900
Subject: [PATCH v34 06/16] new indexam bit for unlogged storage compatibility
To enable the core to identify whether storage files created by an
index access method for WAL-logged and unlogged relations are
binary-compatible, add a boolean property to the index AM interface.
---
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 8 ++++++++
src/backend/access/hash/hash.c | 1 +
src/backend/access/nbtree/nbtree.c | 1 +
src/backend/access/spgist/spgutils.c | 1 +
src/include/access/amapi.h | 2 ++
src/test/modules/dummy_index_am/dummy_index_am.c | 1 +
8 files changed, 16 insertions(+)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 6467bed604..f2eb10edee 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -269,6 +269,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = true;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 5747ae6a4c..9d948b441c 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -59,6 +59,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ed4ffa63a7..77230b3f1c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -81,6 +81,14 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP;
+
+ /*
+ * GiST uses page LSNs to figure out whether a block has been
+ * modified. UNLOGGED GiST indexes use fake LSNs, which are incompatible
+ * with the real LSNs used for LOGGED indexes.
+ */
+ amroutine->amunloggedstoragecompatible = false;
+
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 01d06b7c32..f141f2d45e 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = INT4OID;
amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f7..a9ed5fc134 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -123,6 +123,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 76b80146ff..5d9c759a64 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -66,6 +66,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = spgbuild;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index f25c9d58a7..c6cdf805d5 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -254,6 +254,8 @@ typedef struct IndexAmRoutine
bool amsummarizing;
/* OR of parallel vacuum flags. See vacuum.h for flags. */
uint8 amparallelvacuumoptions;
+ /* is AM storage data compatible between LOGGED and UNLOGGED states? */
+ bool amunloggedstoragecompatible;
/* type of data stored in index, or InvalidOid if variable */
Oid amkeytype;
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index 0b47711606..3c5cd43401 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -299,6 +299,7 @@ dihandler(PG_FUNCTION_ARGS)
amroutine->amusemaintenanceworkmem = false;
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions = VACUUM_OPTION_NO_PARALLEL;
+ amroutine->amunloggedstoragecompatible = false;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = dibuild;
--
2.43.5
v34-0007-Transactional-buffer-persistence-switching.patchtext/x-patch; charset=us-asciiDownload
From 52209e4e5969129ce3a1335559cafbb74dae2672 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 16 Aug 2024 17:59:38 +0900
Subject: [PATCH v34 07/16] Transactional buffer persistence switching
This commit introduces functionality for transactional buffer
persistence switching with no user-side code. The switching is
reverted if the transaction aborts, and both the switching and
reverting are WAL-logged. Repeated back-and-forth switching within and
across subtransactions is prohibited for simplicity.
---
src/backend/access/rmgrdesc/smgrdesc.c | 13 +
src/backend/access/transam/twophase.c | 2 +
src/backend/access/transam/xact.c | 16 +-
src/backend/access/transam/xlog.c | 1 +
src/backend/catalog/storage.c | 32 +++
src/backend/storage/buffer/bufmgr.c | 328 +++++++++++++++++++++++++
src/bin/pg_rewind/parsexlog.c | 6 +
src/include/catalog/storage_xlog.h | 11 +
src/include/storage/bufmgr.h | 10 +
src/tools/pgindent/typedefs.list | 2 +
10 files changed, 420 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 71410e0a2d..d7b763f529 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,16 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence \"%c\"",
+ xlrec->persistence ? 'p' : 'u');
+ pfree(path);
+ }
}
const char *
@@ -55,6 +65,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index bd31b77906..24285c7d20 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1588,6 +1588,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
abortstats,
gid);
+ /* Clean up buffer persistence changes and unecessary files. */
+ PreCommit_Buffers(isCommit);
SimpleUndoLog_UndoByXid(isCommit, xid, hdr->nsubxacts, children);
ProcArrayRemove(proc, latestXid);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 19223d35de..78ac4c7d5e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2270,6 +2270,9 @@ CommitTransaction(void)
CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
: XACT_EVENT_PRE_COMMIT);
+ /* Clean up buffer persistence changes */
+ PreCommit_Buffers(true);
+
/*
* If this xact has started any unfinished parallel operation, clean up
* its workers, warning about leaked resources. (But we don't actually
@@ -2860,6 +2863,9 @@ AbortTransaction(void)
*/
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ /* Clean up buffer persistence changes */
+ PreCommit_Buffers(false);
+
/*
* check the current transaction state
*/
@@ -5126,6 +5132,9 @@ CommitSubTransaction(void)
CallSubXactCallbacks(SUBXACT_EVENT_PRE_COMMIT_SUB, s->subTransactionId,
s->parent->subTransactionId);
+ /* Clean up buffer persistence changes. */
+ PreSubCommit_Buffers(true);
+
/*
* If this subxact has started any unfinished parallel operation, clean up
* its workers and exit parallel mode. Warn about leaked resources.
@@ -5273,6 +5282,9 @@ AbortSubTransaction(void)
*/
reschedule_timeouts();
+ /* Clean up buffer persistence changes */
+ PreSubCommit_Buffers(false);
+
/*
* Re-enable signals, in case we got here by longjmp'ing out of a signal
* handler. We do this fairly early in the sequence so that the timeout
@@ -6246,7 +6258,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
}
SimpleUndoLog_UndoByXid(true, xid, parsed->nsubxacts, parsed->subxacts);
-
+ AtEOXact_Buffers_Redo(true, xid, parsed->nsubxacts, parsed->subxacts);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6359,6 +6372,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
}
SimpleUndoLog_UndoByXid(false, xid, parsed->nsubxacts, parsed->subxacts);
+ AtEOXact_Buffers_Redo(false, xid, parsed->nsubxacts, parsed->subxacts);
if (parsed->nstats > 0)
{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2809ce5014..5d234db8f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5897,6 +5897,7 @@ StartupXLOG(void)
*/
if (InRecovery)
{
+ BufmgrDoCleanupRedo();
UndoLogCleanup();
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 31400c514f..a00c59a274 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -259,6 +259,29 @@ ulog_smgrcreate_redo(SMgrRelation srel, ForkNumber forkNum,
&ulrec, sizeof(ulrec), xid);
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ *
+ * XXX: This function essentially belongs in bufmgr.c, but is placed here to
+ * avoid adding a new rmgr type solely for this record type.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = rlocator;
+ xlrec.persistence = persistence;
+ xlrec.topxid = GetTopTransactionId();
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -1118,6 +1141,15 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+ SetRelationBuffersPersistence(reln, xlrec->persistence);
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2bc60f7295..14360f69b7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -58,6 +58,7 @@
#include "storage/smgr.h"
#include "storage/standby.h"
#include "utils/memdebug.h"
+#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/rel.h"
#include "utils/resowner.h"
@@ -136,6 +137,23 @@ typedef struct SMgrSortArray
SMgrRelation srel;
} SMgrSortArray;
+
+/*
+ * We keep a list of all relations whose buffer persistence has been switched
+ * in the current transaction. This allows us to properly revert the
+ * persistence if the transaction is aborted.
+ */
+typedef struct BufMgrCleanup
+{
+ RelFileLocator rlocator; /* relation that may need to be deleted */
+ bool bufpersistence; /* buffer persistence to set */
+ int nestLevel; /* xact nesting level of request */
+ TransactionId xid; /* used during recovery */
+ struct BufMgrCleanup *next; /* linked-list link */
+} BufMgrCleanup;
+
+static BufMgrCleanup * cleanups = NULL; /* head of linked list */
+
/*
* Helper struct for read stream object used in
* RelationCopyStorageUsingBuffer() function.
@@ -250,6 +268,8 @@ static char *ResOwnerPrintBufferIO(Datum res);
static void ResOwnerReleaseBufferPin(Datum res);
static char *ResOwnerPrintBufferPin(Datum res);
+static void set_relation_buffers_persistence(SMgrRelation srel, bool permanent);
+
const ResourceOwnerDesc buffer_io_resowner_desc =
{
.name = "buffer io",
@@ -3557,6 +3577,153 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
return result | BUF_WRITTEN;
}
+/*
+ * bufmgrDoCleanup() -- Take care of buffer persistence chages at end of xact
+ *
+ * This function is called at the end of both transactions and subtransactions,
+ * aiming to immediately clean up failed transactions.
+ */
+static void
+bufmgrDoCleanup(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ BufMgrCleanup *cu;
+ BufMgrCleanup *next;
+
+ for (cu = cleanups ; cu && cu->nestLevel <= nestLevel ; cu = next)
+ {
+ next = cu->next;
+ cleanups = next;
+
+ if (!isCommit)
+ {
+ SMgrRelation srel = smgropen(cu->rlocator, INVALID_PROC_NUMBER);
+ set_relation_buffers_persistence(srel, cu->bufpersistence);
+ }
+ pfree(cu);
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* All remaining entriespertain to upper levels. */
+ for (cu = cleanups ; cu ; cu = cu->next)
+ Assert(cu->nestLevel < nestLevel);
+#endif
+}
+
+/*
+ * AtEOXact_Buffers_Redo() -- End-of-transaction cleanup of buffer persistence
+ * chages during rcovery.
+ *
+ * Unlike normal operation, the cleanup entries are keyed by xid rather than by
+ * nestLevel. See SetRelationBuffersPersistenceRedo() for details on the
+ * registration of those entries.
+ */
+void
+AtEOXact_Buffers_Redo(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children)
+{
+ BufMgrCleanup *cu;
+ BufMgrCleanup *prev;
+ BufMgrCleanup *next;
+
+ prev = NULL;
+ for (cu = cleanups ; cu ; cu = next)
+ {
+ next = cu->next;
+
+ if (cu->xid != xid)
+ {
+ int i;
+
+ for (i = 0 ; i < nchildren && cu->xid != children[i] ; i++);
+
+ if (i == nchildren)
+ {
+ /* did not match, go to next */
+ prev = cu;
+ continue;
+ }
+ }
+
+ if (!isCommit)
+ {
+ /*
+ * Record this revert to WAL without re-registering a BufMgrCleanup
+ * entry.
+ */
+ SMgrRelation srel = smgropen(cu->rlocator, INVALID_PROC_NUMBER);
+ set_relation_buffers_persistence(srel, cu->bufpersistence);
+ }
+ if (prev)
+ prev->next = next;
+ else
+ cleanups = next;
+ pfree(cu);
+ }
+}
+
+/*
+ * BufmgrDoCleanupRedo() -- End-of-recovery cleanup of buffer persistence
+ * chages.
+ *
+ * Revert buffer persistence changes made in transactions that are not
+ * committed at the end of recovery.
+ */
+void
+BufmgrDoCleanupRedo(void)
+{
+ BufMgrCleanup *cu;
+ BufMgrCleanup *next;
+
+ for (cu = cleanups ; cu ; cu = next)
+ {
+ SMgrRelation srel = smgropen(cu->rlocator, INVALID_PROC_NUMBER);
+ set_relation_buffers_persistence(srel, cu->bufpersistence);
+
+ next = cu->next;
+ pfree(cu);
+ }
+
+ cleanups = NULL;
+}
+
+/*
+ * PreSubCommit_Buffers() -- Take care of buffer persistence changes at subxact
+ * end
+ */
+void
+PreSubCommit_Buffers(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+
+ if (!isCommit)
+ {
+ bufmgrDoCleanup(isCommit);
+ return;
+ }
+
+ /*
+ * Reassign all cleanup items at the current nestlevel to the parent
+ * transaction.
+ */
+
+ for (BufMgrCleanup *cu = cleanups ;
+ cu && cu->nestLevel >= nestLevel ;
+ cu = cu->next)
+ {
+ /* no lower-level entry is expected */
+ Assert(cu->nestLevel == nestLevel);
+
+ cu->nestLevel = nestLevel - 1;
+ }
+}
+
+void
+PreCommit_Buffers(bool isCommit)
+{
+ bufmgrDoCleanup(isCommit);
+}
+
/*
* AtEOXact_Buffers - clean up at end of transaction.
*
@@ -4151,6 +4318,167 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/*
+ * set_relation_buffers_persistence()
+ *
+ * The core function to change the persistence of all buffer pages of a
+ * relation then writes all dirty pages to disk (or kernel disk buffers) when
+ * switching to PERMANENT, ensuring the kernel has an up-to-date view of the
+ * relation.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation to
+ * ensure no other backend is busy loading more blocks.
+ *
+ * XXX currently it sequentially searches the buffer pool; consider
+ * implementing more efficient search methods. This routine isn't used in
+ * performance-critical code paths, so it's not worth additional overhead to
+ * make it go faster; see also DropRelationBuffers.
+ */
+static void
+set_relation_buffers_persistence(SMgrRelation srel, bool permanent)
+{
+ int i;
+ RelFileLocator rlocator = srel->smgr_rlocator.locator;
+
+ Assert(!RelFileLocatorBackendIsTemp(srel->smgr_rlocator));
+
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ /* try unlocked check to avoid locking irrelevant buffers */
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* The init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ /* Switch the buffer state to BM_PERMANENT before flushing it. */
+ Assert((buf_state & BM_PERMANENT) == 0);
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /*
+ * We haven't written WALs for this buffer. Flush this buffer to
+ * establish the epoch for subsequent WAL records.
+ */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork for this relation */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ Assert(buf_state & BM_PERMANENT);
+
+ /* Just switch the buffer state to non-permanent. */
+ buf_state &= ~BM_PERMANENT;
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a
+ * relation. See set_relation_buffers_persistence() for functionality
+ * details.
+ *
+ * This function's behavior is transactional, meaning that the changes it
+ * makes will be reverted if this or any higher-level transaction is
+ * aborted.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation
+ * to ensure no other backend is busy loading more blocks.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent)
+{
+ BufMgrCleanup *cu;
+ RelFileLocator rlocator = srel->smgr_rlocator.locator;
+
+ /*
+ * Double-flipping relation persistence within the same transaction
+ * significantly increases complexixty relative to its benefits. Therefore,
+ * error out if persistence has already flipped for this relation in the
+ * current transaction.
+ */
+ for (cu = cleanups ; cu ; cu = cu->next)
+ {
+ if (RelFileLocatorEquals(rlocator, cu->rlocator))
+ ereport(ERROR,
+ errmsg("persistence of this relation has been already changed in the current transaction"));
+ }
+
+ set_relation_buffers_persistence(srel, permanent);
+
+ /* Schedule reverting this change at abort */
+ cu = (BufMgrCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(BufMgrCleanup));
+ cu->rlocator = rlocator;
+ cu->bufpersistence = !permanent;
+ cu->nestLevel = GetCurrentTransactionNestLevel();
+ cu->next = cleanups;
+ cleanups = cu;
+}
+
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistenceRedo
+ *
+ * This function changes the persistence of all buffer pages of a relation
+ * during recovery. The cleanup entry is keyed by xid, not by nestLevel.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistenceRedo(SMgrRelation srel, bool permanent,
+ TransactionId xid)
+{
+ BufMgrCleanup *cu;
+
+ set_relation_buffers_persistence(srel, permanent);
+
+ /* Schedule reverting this change at abort */
+ cu = (BufMgrCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(BufMgrCleanup));
+ cu->rlocator = srel->smgr_rlocator.locator;
+ cu->bufpersistence = !permanent;
+ cu->xid = xid;
+ cu->next = cleanups;
+ cleanups = cu;
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 525b98899f..c8c9cc361f 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,12 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index a490e05f88..085b1bc1df 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -29,6 +29,7 @@
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_BUFPERSISTENCE 0x30
typedef struct xl_smgr_create
{
@@ -36,6 +37,14 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+ TransactionId topxid;
+ /* subxid is in the record header */
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +60,8 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230..4267098080 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -19,6 +19,7 @@
#include "storage/buf.h"
#include "storage/bufpage.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
#include "utils/relcache.h"
#include "utils/snapmgr.h"
@@ -250,7 +251,14 @@ extern Buffer ExtendBufferedRelTo(BufferManagerRelation bmr,
ReadBufferMode mode);
extern void InitBufferManagerAccess(void);
+extern void PreSubCommit_Buffers(bool isCommit);
+extern void PreCommit_Buffers(bool isCommit);
extern void AtEOXact_Buffers(bool isCommit);
+extern void SetRelationBuffersPersistenceRedo(SMgrRelation srel, bool permanent,
+ TransactionId xid);
+extern void AtEOXact_Buffers_Redo(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children);
+extern void BufmgrDoCleanupRedo(void);
extern char *DebugPrintBufferRefcount(Buffer buffer);
extern void CheckPointBuffers(int flags);
extern BlockNumber BufferGetBlockNumber(Buffer buffer);
@@ -269,6 +277,8 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
extern void DropDatabaseBuffers(Oid dbid);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent);
#define RelationGetNumberOfBlocks(reln) \
RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c182a65d4d..3975dd50ec 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -329,6 +329,7 @@ BtreeLastVisibleEntry
BtreeLevel
Bucket
BufFile
+BufMgrCleanup
Buffer
BufferAccessStrategy
BufferAccessStrategyType
@@ -4118,6 +4119,7 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
xl_smgr_truncate
xl_standby_lock
--
2.43.5
v34-0008-Prepare-for-preventing-DML-operations-on-relatio.patchtext/x-patch; charset=us-asciiDownload
From 896f974d64150d6b12c388d82a4fa7e8b4378411 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 15 Aug 2024 11:26:01 +0900
Subject: [PATCH v34 08/16] Prepare for preventing DML operations on relations.
Performing data manipulation on relations with in-place persistence
changes can lead to unrecoverable issues, particularly with
indexes. To prevent potential data corruption, this update sets up
mechanisms to inhibit DML operations in these cases rather than
attempting to accommodate them. No user-side code included.
---
src/backend/access/transam/xact.c | 7 +++++++
src/backend/executor/execMain.c | 5 ++++-
src/backend/tcop/utility.c | 18 ++++++++++++++++
src/backend/utils/cache/relcache.c | 33 +++++++++++++++++++++++++++---
src/include/access/xact.h | 2 ++
src/include/miscadmin.h | 1 +
src/include/utils/rel.h | 7 +++++++
src/include/utils/relcache.h | 1 +
8 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 78ac4c7d5e..17a8fd0e20 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -86,6 +86,12 @@ bool XactDeferrable;
int synchronous_commit = SYNCHRONOUS_COMMIT_ON;
+/*
+ * Indicate whether relation persistence flipping was performed in the current
+ * transacion.
+ */
+bool XactPersistenceChanged;
+
/*
* CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
* transaction. Currently, it is used in logical decoding. It's possible
@@ -2119,6 +2125,7 @@ StartTransaction(void)
s->startedInRecovery = false;
XactReadOnly = DefaultXactReadOnly;
}
+ XactPersistenceChanged = false;
XactDeferrable = DefaultXactDeferrable;
XactIsoLevel = DefaultXactIsoLevel;
forceSyncCommit = false;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 29e186fa73..cef6e1a4b5 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -160,7 +160,7 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
* against performing unsafe operations in parallel mode, but this gives a
* more user-friendly error message.
*/
- if ((XactReadOnly || IsInParallelMode()) &&
+ if ((XactReadOnly || XactPersistenceChanged || IsInParallelMode()) &&
!(eflags & EXEC_FLAG_EXPLAIN_ONLY))
ExecCheckXactReadOnly(queryDesc->plannedstmt);
@@ -810,6 +810,9 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
continue;
PreventCommandIfReadOnly(CreateCommandName((Node *) plannedstmt));
+
+ PreventCommandIfPersistenceChanged(
+ CreateCommandName((Node *) plannedstmt), perminfo->relid);
}
if (plannedstmt->commandType != CMD_SELECT || plannedstmt->hasModifyingCTE)
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index b2ea8125c9..94953e367a 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -411,6 +411,24 @@ PreventCommandIfReadOnly(const char *cmdname)
cmdname)));
}
+/*
+ * PreventCommandIfPersistenceChanged: throw error if persistence changed was
+ * performed
+ */
+void
+PreventCommandIfPersistenceChanged(const char *cmdname, Oid relid)
+{
+ Relation rel;
+
+ rel = RelationIdGetRelation(relid);
+ if (rel->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot execute %s on relation \"%s\" because of its persistence change in the current transaction",
+ cmdname, get_rel_name(relid)));
+ RelationClose(rel);
+}
+
/*
* PreventCommandIfParallelMode: throw error if current (sub)transaction is
* in parallel mode.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 66ed24e401..8040250a0c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1136,6 +1136,7 @@ retry:
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
@@ -1899,6 +1900,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
relation->rd_backend = INVALID_PROC_NUMBER;
relation->rd_islocaltemp = false;
@@ -2775,6 +2777,7 @@ RelationClearRelation(Relation relation, bool rebuild)
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilelocatorSubid);
SWAPFIELD(SubTransactionId, rd_firstRelfilelocatorSubid);
+ SWAPFIELD(SubTransactionId, rd_firstPersistenceChangeSubid);
SWAPFIELD(SubTransactionId, rd_droppedSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
@@ -2864,7 +2867,8 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId ||
+ relation->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2922,7 +2926,8 @@ RelationForgetRelation(Oid rid)
Assert(relation->rd_droppedSubid == InvalidSubTransactionId);
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId ||
+ relation->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
{
/*
* In the event of subtransaction rollback, we must not forget
@@ -3037,7 +3042,8 @@ RelationCacheInvalidate(bool debug_discard)
* applicable pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId ||
+ relation->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3351,6 +3357,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
if (clear_relcache)
@@ -3466,6 +3473,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
RelationClearRelation(relation, false);
return;
@@ -3512,6 +3520,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_droppedSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstPersistenceChangeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstPersistenceChangeSubid = parentSubid;
+ else
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
+ }
}
@@ -3602,6 +3618,7 @@ RelationBuildLocalRelation(const char *relname,
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
rel->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ rel->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
rel->rd_droppedSubid = InvalidSubTransactionId;
/*
@@ -3973,6 +3990,15 @@ RelationAssumeNewRelfilelocator(Relation relation)
EOXactListAdd(relation);
}
+void
+RelationAssumePersistenceChange(Relation relation)
+{
+ XactPersistenceChanged = true;
+ relation->rd_firstPersistenceChangeSubid = GetCurrentSubTransactionId();
+
+ /* Flag relation as needing eoxact cleanup (to clear this field) */
+ EOXactListAdd(relation);
+}
/*
* RelationCacheInitialize
@@ -6395,6 +6421,7 @@ load_relcache_init_file(bool shared)
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
rel->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ rel->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
rel->rd_droppedSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
rel->pgstat_info = NULL;
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6d4439f052..8f23611dcc 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -55,6 +55,8 @@ extern PGDLLIMPORT int XactIsoLevel;
extern PGDLLIMPORT bool DefaultXactReadOnly;
extern PGDLLIMPORT bool XactReadOnly;
+extern PGDLLIMPORT bool XactPersistenceChanged;
+
/* flag for logging statements in this transaction */
extern PGDLLIMPORT bool xact_is_sampled;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 25348e71eb..cddccd8654 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -301,6 +301,7 @@ extern bool stack_is_too_deep(void);
extern void PreventCommandIfReadOnly(const char *cmdname);
extern void PreventCommandIfParallelMode(const char *cmdname);
extern void PreventCommandDuringRecovery(const char *cmdname);
+extern void PreventCommandIfPersistenceChanged(const char *cmdname, Oid relid);
/*****************************************************************************
* pdir.h -- *
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 8700204953..a361e91050 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -108,6 +108,13 @@ typedef struct RelationData
* any value */
SubTransactionId rd_droppedSubid; /* dropped with another Subid set */
+ /*
+ * rd_firstPersistenceChangeSubid is the ID of the highest subtransaction
+ * ID the rel's persistence change has survived into.
+ */
+ SubTransactionId rd_firstPersistenceChangeSubid; /* highest subxact chaging
+ * persistence */
+
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
Oid rd_id; /* relation's object id */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 18c32ea700..f2f26433cd 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -119,6 +119,7 @@ extern Relation RelationBuildLocalRelation(const char *relname,
*/
extern void RelationSetNewRelfilenumber(Relation relation, char persistence);
extern void RelationAssumeNewRelfilelocator(Relation relation);
+extern void RelationAssumePersistenceChange(Relation relation);
/*
* Routines for flushing/rebuilding relcache entries in various scenarios
--
2.43.5
v34-0009-Prevent-PREPARE-for-transactions-with-in-place-r.patchtext/x-patch; charset=us-asciiDownload
From ea1497b5033f1c229dd6156251cf8adc1bb1f91a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 27 Aug 2024 16:09:29 +0900
Subject: [PATCH v34 09/16] Prevent PREPARE for transactions with in-place
relation persistence changes
Allowing a transaction to be prepared when an in-place relation
persistence change has occurred within the transaction can
significantly complicate crash recovery behavior. Since the benefits
of allowing this are minimal, prohibit such behavior.
---
src/backend/access/transam/xact.c | 6 ++++++
src/backend/storage/buffer/bufmgr.c | 13 +++++++++++++
src/include/storage/bufmgr.h | 1 +
3 files changed, 20 insertions(+)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 17a8fd0e20..1e1f86e578 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2534,6 +2534,12 @@ PrepareTransaction(void)
TransStateAsString(s->state));
Assert(s->parent == NULL);
+
+ /* Check if any relation persistence flips have been performed. */
+ if (CheckIfPersistenceChanged())
+ ereport(ERROR,
+ errmsg("cannot prepare transaction if persistence change has been made to any relation"));
+
/*
* Do pre-commit processing that involves calling user-defined code, such
* as triggers. Since closing cursors could queue trigger actions,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 14360f69b7..6de8e00bdc 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4479,6 +4479,19 @@ SetRelationBuffersPersistenceRedo(SMgrRelation srel, bool permanent,
cleanups = cu;
}
+/*
+ * CheckIfPersistenceChanged
+ *
+ * Returns true if any relation's persistence change has occurred in the
+ * current transaction.
+ * --------------------------------------------------------------------
+ */
+bool
+CheckIfPersistenceChanged(void)
+{
+ return cleanups != NULL;
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4267098080..df881ea86c 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -279,6 +279,7 @@ extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
extern void DropDatabaseBuffers(Oid dbid);
extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
bool permanent);
+extern bool CheckIfPersistenceChanged(void);
#define RelationGetNumberOfBlocks(reln) \
RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
--
2.43.5
v34-0010-In-place-persistance-change-to-UNLOGGED.patchtext/x-patch; charset=us-asciiDownload
From 468c4f5538a7777c016f63a2b6fc6aec345f52a2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 27 Aug 2024 11:19:53 +0900
Subject: [PATCH v34 10/16] In-place persistance change to UNLOGGED
This commit enables changing the persistence of relations to UNLOGGED
without creating a new storage file. ALTER TABLE LOGGED will continue
to create a new storage as before.
---
src/backend/commands/tablecmds.c | 226 +++++++++++++++++++++++++------
1 file changed, 187 insertions(+), 39 deletions(-)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e9bba3aceb..05364365a7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5634,6 +5634,143 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: perform in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Use ATRewriteTable instead of this function if the following condition
+ * is not satisfied.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * Initially, gather all relations that require a persistence change.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+ SMgrRelation srel;
+ bool persistent = (persistence == RELPERSISTENCE_PERMANENT);
+ bool is_index;
+
+ /*
+ * Reconstruct the storage when permanent and unlogged storage types
+ * are incompatible.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ !r->rd_indam->amunloggedstoragecompatible)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistent)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+
+ /* this doesn't fire REINDEX event triegger */
+ reindex_index(NULL, reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Currently, only allowing changes to UNLOGGED. */
+ Assert(!persistent);
+
+ RelationAssumePersistenceChange(r);
+
+ /* switch buffer persistence */
+ srel = RelationGetSmgr(r);
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, persistent);
+ SetRelationBuffersPersistence(srel, persistent);
+
+ /* then create the init fork */
+ is_index = (r->rd_rel->relkind == RELKIND_INDEX);
+ RelationCreateFork(srel, INIT_FORKNUM, !is_index, true);
+ if (is_index)
+ r->rd_indam->ambuildempty(r);
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5766,48 +5903,59 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE &&
+ persistence == RELPERSISTENCE_UNLOGGED)
+ {
+ /* Make in-place persistence change. */
+ RelationChangePersistence(tab, persistence, lockmode);
+ }
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
--
2.43.5
v34-0011-Add-test-for-ALTER-TABLE-UNLOGGED.patchtext/x-patch; charset=us-asciiDownload
From f36dd260faae2e8da7ecfe14b9baf8d31bdc09bd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 15 Aug 2024 16:06:34 +0900
Subject: [PATCH v34 11/16] Add test for ALTER TABLE UNLOGGED
---
src/test/recovery/t/044_persistence_change.pl | 512 ++++++++++++++++++
1 file changed, 512 insertions(+)
create mode 100644 src/test/recovery/t/044_persistence_change.pl
diff --git a/src/test/recovery/t/044_persistence_change.pl b/src/test/recovery/t/044_persistence_change.pl
new file mode 100644
index 0000000000..7f5ef84667
--- /dev/null
+++ b/src/test/recovery/t/044_persistence_change.pl
@@ -0,0 +1,512 @@
+# Copyright (c) 2023-2024, PostgreSQL Global Development Group
+#
+# Test in-place relation persistence changes
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+my @relnames = ('t', 'i_bt', 'i_gin', 'i_gist', 'i_hash', 'i_brin', 'i_spgist');
+my @noninplace_names = ('i_gist');
+
+# This feature works differently by wal_level.
+run_test('minimal');
+run_test('replica');
+done_testing();
+
+sub run_test
+{
+ my ($wal_level) = @_;
+
+ note "## run with wal_level = $wal_level";
+
+ # Initialize primary node.
+ my $node = PostgreSQL::Test::Cluster->new("node_$wal_level");
+ $node->init;
+ # Inhibit checkpoints to run
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+checkpoint_timeout = '24h'
+max_prepared_transactions = 2
+ ));
+ $node->start;
+
+ my $datadir = $node->data_dir;
+ my $datoid = $node->safe_psql('postgres',
+ q/SELECT oid FROM pg_database WHERE datname = current_database()/);
+ my $dbdir = $node->data_dir . "/base/$datoid";
+
+ # Create a table and indexes of built-in kinds
+ $node->psql('postgres', qq(
+ CREATE TABLE t (bt int, gin int[], gist point, hash int,
+ brin int, spgist point);
+ CREATE INDEX i_bt ON t USING btree (bt);
+ CREATE INDEX i_gin ON t USING gin (gin);
+ CREATE INDEX i_gist ON t USING gist (gist);
+ CREATE INDEX i_hash ON t USING hash (hash);
+ CREATE INDEX i_brin ON t USING brin (brin);
+ CREATE INDEX i_spgist ON t USING spgist (spgist);));
+
+ my $relfilenodes1 = getrelfilenodes($node, \@relnames);
+
+ # the number must correspond to the in list above
+ is (scalar %{$relfilenodes1}, 7, "number of relations is correct");
+
+ # check initial state
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages are in logged state");
+
+ # Normal crash-recovery of LOGGED tables
+ $node->stop('immediate');
+ $node->start;
+
+ # Insert data 0 to 1999
+ $node->psql('postgres', insert_data_query(0, 2000));
+
+ # Check if the data survives a crash
+ $node->stop('immediate');
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "data loss check: crash with LOGGED table");
+
+ # Change the table to UNLOGGED then commit.
+ $node->psql('postgres', 'ALTER TABLE t SET UNLOGGED');
+
+ # Check if SET UNLOGGED above didn't change relfilenumbers.
+ my $relfilenodes2 = getrelfilenodes($node, \@relnames);
+ ok (checkrelfilenodes($relfilenodes1, $relfilenodes2),
+ "relfilenumber transition is as expected after SET UNLOGGED");
+
+ # check init-file state
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are in unlogged state");
+
+ # Check if the table is reset through recovery.
+ $node->stop('immediate');
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 0,
+ "table data is reset though recovery");
+
+ # check reset state
+ ok (check_storage_state(\&is_reset_state, $node, \@relnames),
+ "storages are in reset state");
+
+ # Insert data 0 to 1999, then set persistence to LOGGED then crash.
+ $node->psql('postgres', insert_data_query(0, 2000));
+ $node->psql('postgres', qq(ALTER TABLE t SET LOGGED));
+ $node->stop('immediate');
+ $node->start;
+
+ # Check if SET LOGGED didn't change relfilenumbers and data survive a crash
+ my $relfilenodes3 = getrelfilenodes($node, \@relnames);
+ ok (!checkrelfilenodes($relfilenodes2, $relfilenodes3),
+ "crashed SET-LOGGED relations have sane relfilenodes transition");
+
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "crashed SET-LOGGED table does not lose data");
+
+ # Change to UNLOGGED then insert data, then shutdown normally.
+ $node->psql('postgres', 'ALTER TABLE t SET UNLOGGED');
+ $node->psql('postgres', insert_data_query(2000, 2000)); # 2000 - 3999
+ $node->stop;
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 4000,
+ "UNLOGGED table does not lose data after graceful restart");
+
+ # Test for mid-transaction change to LOGGED and crash.
+ # Now, the table has data 0-3999
+ $node->psql('postgres', insert_data_query(4000, 2000)); # 4000 - 5999
+
+ my $sess = $node->interactive_psql('postgres');
+ $sess->set_query_timer_restart();
+ $sess->query('BEGIN; ALTER TABLE t SET LOGGED');
+ $sess->query(insert_data_query(6000, 2000)); # 6000-7999, no commit
+ $node->stop('immediate');
+ $sess->quit;
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 0,
+ "table is reset after in-transaction SET-LOGGED then insert");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are reverted to unlogged state");
+
+ # Test for mid-transaction change to UNLOGGED and crash.
+ # Now, the table has no data
+ $node->psql('postgres', 'ALTER TABLE t SET LOGGED');
+ $node->psql('postgres', insert_data_query(0, 2000)); # 0 - 1999
+ $sess = $node->interactive_psql('postgres');
+ $sess->set_query_timer_restart();
+ $sess->query('BEGIN; ALTER TABLE t SET UNLOGGED');
+ $sess->query(insert_data_query(2000, 2000)); # 2000-3999, no commit
+ $node->stop('immediate');
+ $sess->quit;
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "table is reset after in-transaction SET-UNLOGGED then insert");
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages are reverted to logged state");
+
+ ### Subtransactions
+ ok ($node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET UNLOGGED; -- committed
+ SAVEPOINT a;
+ ALTER TABLE t SET LOGGED; -- aborted
+ SAVEPOINT b;
+ ROLLBACK TO a;
+ COMMIT;
+ )) != 3,
+ "command succeeds 1");
+
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "table data is not changed 1");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are changed to unlogged state");
+
+ ok ($node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET LOGGED; -- aborted
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED; -- aborted
+ SAVEPOINT b;
+ RELEASE a;
+ ROLLBACK;
+ )) != 3,
+ "command succeeds 2");
+
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "table data is not changed 2");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages stay in unlogged state");
+
+ ### Prepared transactions
+ my ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ ALTER TABLE t SET LOGGED;
+ BEGIN;
+ ALTER TABLE t SET UNLOGGED;
+ PREPARE TRANSACTION 'a';
+ COMMIT PREPARED 'a';
+ ));
+ ok ($stderr =~ m/cannot prepare transaction if persistence change/,
+ "errors out when persistence-flipped xact is prepared");
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages are in logged state");
+
+ ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ BEGIN;
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED;
+ PREPARE TRANSACTION 'a';
+ ROLLBACK PREPARED 'a';
+ ));
+ ok ($stderr =~ m/cannot prepare transaction if persistence change/,
+ "errors out when persistence-flipped xact is prepared 2");
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages stay in logged state");
+
+ ### Error out DML
+ $node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET LOGGED;
+ INSERT INTO t VALUES(1); -- Succeeds
+ COMMIT;
+ ));
+
+ ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET UNLOGGED;
+ INSERT INTO t VALUES(2); -- ERROR
+ ));
+ ok ($stderr =~ m/cannot execute INSERT on relation/,
+ "errors out when DML is issued after persistence toggling");
+
+ ok ($node->psql('postgres',
+ qq(
+ BEGIN;
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED;
+ ROLLBACK TO a;
+ INSERT INTO t VALUES(3); -- Succeeds
+ COMMIT;
+ )) != 3,
+ "insert after rolled-back persistence change succeeds");
+
+ ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ BEGIN;
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED;
+ RELEASE a;
+ UPDATE t SET bt = bt + 1; -- ERROR
+ ));
+ ok ($stderr =~ m/cannot execute UPDATE on relation/,
+ "errors out when DML is issued after persistence toggling in subxact");
+
+$node->stop;
+ $node->teardown_node;
+}
+
+#==== helper routines
+
+# Generates a query to insert data from $st to $st + $num - 1
+sub insert_data_query
+{
+ my ($st, $num) = @_;
+ my $ed = $st + $num - 1;
+ my $query = qq(
+INSERT INTO t
+ (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i)
+ FROM generate_series($st, $ed) i);
+);
+ return $query;
+}
+
+sub check_indexes
+{
+ my ($node, $st, $ed) = @_;
+ my $num_data = $ed - $st;
+
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO true;
+ SET enable_indexscan TO false;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "heap is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "btree is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gin = ARRAY[i, i * 2];)),
+ $num_data, "gin is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)),
+ $num_data, "gist is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE hash = i;)),
+ $num_data, "hash is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE brin = i;)),
+ $num_data, "brin is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)),
+ $num_data, "spgist is not broken");
+}
+
+sub getrelfilenodes
+{
+ my ($node, $relnames) = @_;
+
+ my $result = $node->safe_psql('postgres',
+ 'SELECT relname, relfilenode FROM pg_class
+ WHERE relname
+ IN (\'' .
+ join("','", @{$relnames}).
+ '\') ORDER BY oid');
+
+ my %relfilenodes;
+
+ foreach my $l (split(/\n/, $result))
+ {
+ die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/);
+ $relfilenodes{$1} = $2;
+ }
+
+ return \%relfilenodes;
+}
+
+sub checkrelfilenodes
+{
+ my ($rnodes1, $rnodes2) = @_;
+ my $result = 1;
+
+ foreach my $n (keys %{$rnodes1})
+ {
+ if (grep { $n eq $_ } @noninplace_names)
+ {
+ if ($rnodes1->{$n} == $rnodes2->{$n})
+ {
+ $result = 0;
+ note sprintf("$n: relfilenode is not changed: %d",
+ $rnodes1->{$n});
+ }
+ }
+ else
+ {
+ if ($rnodes1->{$n} != $rnodes2->{$n})
+ {
+ $result = 0;
+ note sprintf("$n: relfilenode is changed: %d => %d",
+ $rnodes1->{$n}, $rnodes2->{$n});
+ }
+ }
+ }
+ return $result;
+}
+
+sub getfilenames
+{
+ my ($dirname) = @_;
+
+ my $dir = opendir(my $dh, $dirname) or die "could not open $dirname: $!";
+ my @f = readdir($dh);
+ closedir($dh);
+
+ my @result = grep {$_ !~ /^..?$/} @f;
+
+ return \@result;
+}
+
+sub init_fork_exists
+{
+ my ($relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+
+ return $init_exists;
+}
+
+sub noninit_forks_exist
+{
+ my ($relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $noninit_exists = grep {/^${relfnumber}(_(?!init).*)?$/} @{$datafiles};
+
+ return $noninit_exists;
+}
+
+sub is_logged_state
+{
+ my ($node, $relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+ my $main_exists = grep {/^${relfnumber}$/} @{$datafiles};
+ my $persistence = $node->safe_psql('postgres',
+ qq(
+ SELECT relpersistence FROM pg_class WHERE relname = '$relname'
+ ));
+
+ if ($init_exists || !$main_exists || $persistence ne 'p')
+ {
+ # note the state if this test failed
+ note "## is_logged_state:($relname): \$init_exists=$init_exists, \$main_exists=$main_exists, \$persistence='$persistence'\n";
+ return 0 ;
+ }
+
+ return 1;
+}
+
+sub is_unlogged_state
+{
+ my ($node, $relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+ my $main_exists = grep {/^${relfnumber}$/} @{$datafiles};
+ my $persistence = $node->safe_psql('postgres',
+ qq(
+ SELECT relpersistence FROM pg_class WHERE relname = '$relname'
+ ));
+
+ if (!$init_exists || !$main_exists || $persistence ne 'u')
+ {
+ # note the state if this test failed
+ note "is_unlogged_state:($relname): \$init_exists=$init_exists, \$main_exists=$main_exists, \$persistence='$persistence'\n";
+ return 0 ;
+ }
+
+ return 1;
+}
+
+sub is_reset_state
+{
+ my ($node, $relfilenodes, $datafiles, $relname) = @_;
+
+ my $datoid = $node->safe_psql('postgres',
+ q/SELECT oid FROM pg_database WHERE datname = current_database()/);
+ my $dbdir = $node->data_dir . "/base/$datoid";
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+ my $main_exists = grep {/^${relfnumber}$/} @{$datafiles};
+ my $others_not_exist = !grep {/^${relfnumber}_(?!init).*$/} @{$datafiles};
+ my $persistence = $node->safe_psql('postgres',
+ qq(
+ SELECT relpersistence FROM pg_class WHERE relname = '$relname'
+ ));
+
+ if (!$init_exists || !$main_exists || !$others_not_exist ||
+ $persistence ne 'u')
+ {
+ # note the state if this test failed
+ note "## is_reset_state:($relname): \$init_exists=$init_exists, \$main_exists=$main_exists, \$others_not_exist=$others_not_exist, \$persistence='$persistence'\n";
+ return 0 ;
+ }
+
+ my $main_file = "$dbdir/${relfnumber}";
+ my $init_file = "$dbdir/${relfnumber}_init";
+ my $main_file_size = -s $main_file;
+ my $init_file_size = -s $init_file;
+
+ if ($main_file_size != $init_file_size)
+ {
+ note "## is_reset_state:($relname): \$main_file='$main_file', size=$main_file_size, \$init_file='$init_file', size=$init_file_size\n";
+ return 0;
+ }
+
+ return 1;
+}
+
+sub check_storage_state
+{
+ my ($func, $node, $relnames) = @_;
+ my $relfilenodes = getrelfilenodes($node, $relnames);
+ my $datoid = $node->safe_psql('postgres',
+ q/SELECT oid FROM pg_database WHERE datname = current_database()/);
+ my $dbdir = $node->data_dir . "/base/$datoid";
+ my $datafiles = getfilenames($dbdir);
+ my $result = 1;
+
+ foreach my $relname (@{$relnames})
+ {
+ if (!$func->($node, $relfilenodes, $datafiles, $relname))
+ {
+ $result = 0;
+
+ ## do not return immediately, run this test for all
+ ## relations to leave diagnosis information in the log
+ ## file.
+ }
+ }
+
+ return $result;
+}
--
2.43.5
v34-0012-Make-smgrdounlinkall-accept-forknumbers.patchtext/x-patch; charset=us-asciiDownload
From 12cbd8fab5889ee909edac278e9866f20cfb1f9f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 2 Aug 2024 19:34:06 +0900
Subject: [PATCH v34 12/16] Make smgrdounlinkall accept forknumbers
In a subsequent patch, crash-safe file deletion on a per-fork basis
will be required. To facilitate this, modify smgrdounlinkall(), which
efficiently removes multiple files, to accept fork numbers.
---
src/backend/catalog/storage.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 92 ++++++++++++++++++++++++-----
src/backend/storage/smgr/md.c | 2 +-
src/backend/storage/smgr/smgr.c | 28 ++++++---
src/backend/utils/cache/relcache.c | 2 +-
src/include/common/relpath.h | 11 ++++
src/include/storage/bufmgr.h | 2 +-
src/include/storage/smgr.h | 3 +-
8 files changed, 114 insertions(+), 28 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index a00c59a274..71ae5f9b08 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -783,7 +783,7 @@ smgrDoPendingDeletes(bool isCommit)
if (nrels > 0)
{
- smgrdounlinkall(srels, nrels, false);
+ smgrdounlinkall(srels, NULL, nrels, false);
for (int i = 0; i < nrels; i++)
smgrclose(srels[i]);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6de8e00bdc..3eb4a0c792 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -164,6 +164,16 @@ struct copy_storage_using_buffer_read_stream_private
BlockNumber nblocks;
};
+/*
+ * Helper struct for handling RelFileNOde and ForkNumber together in
+ * DropRelationsAllBuffers.
+ */
+typedef struct RelFileForks
+{
+ RelFileLocator rloc; /* key member for qsort */
+ ForkBitmap forks; /* fork number in bitmap */
+} RelFileForks;
+
/*
* Callback function to get next block for read stream object used in
* RelationCopyStorageUsingBuffer() function.
@@ -4498,24 +4508,32 @@ CheckIfPersistenceChanged(void)
* This function removes from the buffer pool all the pages of all
* forks of the specified relations. It's equivalent to calling
* DropRelationBuffers once per fork per relation with firstDelBlock = 0.
+ * The additional parameter forks is used to identify forks if
+ * provided.
* --------------------------------------------------------------------
*/
void
-DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
+DropRelationsAllBuffers(SMgrRelation *smgr_reln, ForkBitmap *pforks,
+ int nlocators)
{
int i;
int n = 0;
SMgrRelation *rels;
BlockNumber (*block)[MAX_FORKNUM + 1];
uint64 nBlocksToInvalidate = 0;
- RelFileLocator *locators;
+ ForkBitmap *forks = NULL;
+ RelFileForks *locators;
bool cached = true;
bool use_bsearch;
if (nlocators == 0)
return;
- rels = palloc(sizeof(SMgrRelation) * nlocators); /* non-local relations */
+ /* storages for non-local relations */
+ rels = palloc(sizeof(SMgrRelation) * nlocators);
+
+ if (pforks)
+ forks = palloc(sizeof(ForkBitmap) * nlocators);
/* If it's a local relation, it's localbuf.c's problem. */
for (i = 0; i < nlocators; i++)
@@ -4526,7 +4544,12 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
DropRelationAllLocalBuffers(smgr_reln[i]->smgr_rlocator.locator);
}
else
- rels[n++] = smgr_reln[i];
+ {
+ rels[n] = smgr_reln[i];
+ if (forks)
+ forks[n] = pforks[i];
+ n++;
+ }
}
/*
@@ -4536,6 +4559,10 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
if (n == 0)
{
pfree(rels);
+
+ if (forks)
+ pfree(forks);
+
return;
}
@@ -4554,6 +4581,13 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
{
for (int j = 0; j <= MAX_FORKNUM; j++)
{
+ /* Consider only the specified fork, if provided. */
+ if (forks && !FORKBITMAP_ISSET(forks[i], j))
+ {
+ block[i][j] = InvalidBlockNumber;
+ continue;
+ }
+
/* Get the number of blocks for a relation's fork. */
block[i][j] = smgrnblocks_cached(rels[i], j);
@@ -4581,7 +4615,7 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
{
for (int j = 0; j <= MAX_FORKNUM; j++)
{
- /* ignore relation forks that doesn't exist */
+ /* ignore relation forks that doesn't exist or is ignored */
if (!BlockNumberIsValid(block[i][j]))
continue;
@@ -4597,9 +4631,13 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
}
pfree(block);
- locators = palloc(sizeof(RelFileLocator) * n); /* non-local relations */
+ locators = palloc(sizeof(RelFileForks) * n); /* non-local relations */
+
for (i = 0; i < n; i++)
- locators[i] = rels[i]->smgr_rlocator.locator;
+ {
+ locators[i].rloc = rels[i]->smgr_rlocator.locator;
+ locators[i].forks = (forks ? forks[i] : FORKBITMAP_ALLFORKS());
+ }
/*
* For low number of relations to drop just use a simple walk through, to
@@ -4609,13 +4647,34 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
*/
use_bsearch = n > RELS_BSEARCH_THRESHOLD;
- /* sort the list of rlocators if necessary */
- if (use_bsearch)
- qsort(locators, n, sizeof(RelFileLocator), rlocator_comparator);
+ /*
+ * Sort and compress the list of RelFileForks if necessary. We believe the
+ * caller passed unique rlocators if forks are not specified.
+ */
+ if (use_bsearch || forks)
+ {
+ int j = 0;
+
+ qsort(locators, n, sizeof(RelFileForks), rlocator_comparator);
+
+ /*
+ * Now the list is in rlocator increasing order, compress the list by
+ * merging fork bitmaps so that all elements have unique rlocators.
+ */
+ for (i = 1 ; i < n ; i++)
+ {
+ if (RelFileLocatorEquals(locators[j].rloc, locators[i].rloc))
+ locators[j].forks |= locators[i].forks;
+ else
+ locators[++j] = locators[i];
+ }
+
+ n = j + 1;
+ }
for (i = 0; i < NBuffers; i++)
{
- RelFileLocator *rlocator = NULL;
+ RelFileForks *rlocator = NULL;
BufferDesc *bufHdr = GetBufferDescriptor(i);
uint32 buf_state;
@@ -4630,7 +4689,8 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
for (j = 0; j < n; j++)
{
- if (BufTagMatchesRelFileLocator(&bufHdr->tag, &locators[j]))
+ if (BufTagMatchesRelFileLocator(&bufHdr->tag,
+ &locators[j].rloc))
{
rlocator = &locators[j];
break;
@@ -4643,16 +4703,18 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
locator = BufTagGetRelFileLocator(&bufHdr->tag);
rlocator = bsearch((const void *) &(locator),
- locators, n, sizeof(RelFileLocator),
+ locators, n, sizeof(RelFileForks),
rlocator_comparator);
}
/* buffer doesn't belong to any of the given relfilelocators; skip it */
- if (rlocator == NULL)
+ if (rlocator == NULL ||
+ !FORKBITMAP_ISSET(rlocator->forks, BufTagGetForkNum(&bufHdr->tag)))
continue;
buf_state = LockBufHdr(bufHdr);
- if (BufTagMatchesRelFileLocator(&bufHdr->tag, rlocator))
+ if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator->rloc) &&
+ FORKBITMAP_ISSET(rlocator->forks, BufTagGetForkNum(&bufHdr->tag)))
InvalidateBuffer(bufHdr); /* releases spinlock */
else
UnlockBufHdr(bufHdr, buf_state);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6796756358..5125cfe640 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1463,7 +1463,7 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
srels[i] = srel;
}
- smgrdounlinkall(srels, ndelrels, isRedo);
+ smgrdounlinkall(srels, NULL, ndelrels, isRedo);
for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index eb01040772..68139e3943 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -451,15 +451,19 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
/*
* smgrdounlinkall() -- Immediately unlink all forks of all given relations
*
- * All forks of all given relations are removed from the store. This
- * should not be used during transactional operations, since it can't be
- * undone.
+ * Forks of all given relations are removed from the store. This should not be
+ * used during transactional operations, since it can't be undone.
+ *
+ * If forks is NULL, all forks are removed for all relations. Otherwise, only
+ * the specified fork is removed for the relation at the corresponding position
+ * in the rels array. InvalidForkNumber means removing all forks for the
+ * corresponding relation.
*
* If isRedo is true, it is okay for the underlying file(s) to be gone
* already.
*/
void
-smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
+smgrdounlinkall(SMgrRelation *rels, ForkBitmap *forks, int nrels, bool isRedo)
{
int i = 0;
RelFileLocatorBackend *rlocators;
@@ -472,7 +476,7 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
* Get rid of any remaining buffers for the relations. bufmgr will just
* drop them without bothering to write the contents.
*/
- DropRelationsAllBuffers(rels, nrels);
+ DropRelationsAllBuffers(rels, forks, nrels);
/*
* create an array which contains all relations to be dropped, and close
@@ -486,9 +490,13 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
rlocators[i] = rlocator;
- /* Close the forks at smgr level */
+ /* Close the spacified forks at smgr level. */
for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
- smgrsw[which].smgr_close(rels[i], forknum);
+ {
+ if (!forks || FORKBITMAP_ISSET(forks[i], forknum))
+ smgrsw[which].smgr_close(rels[i], forknum);
+ continue;
+ }
}
/*
@@ -515,7 +523,11 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
int which = rels[i]->smgr_which;
for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
- smgrsw[which].smgr_unlink(rlocators[i], forknum, isRedo);
+ {
+ if (!forks || FORKBITMAP_ISSET(forks[i], forknum))
+ smgrsw[which].smgr_unlink(rlocators[i], forknum, isRedo);
+ continue;
+ }
}
pfree(rlocators);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8040250a0c..5c6d280499 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3858,7 +3858,7 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
* anyway.
*/
srel = smgropen(relation->rd_locator, relation->rd_backend);
- smgrdounlinkall(&srel, 1, false);
+ smgrdounlinkall(&srel, NULL, 1, false);
smgrclose(srel);
}
else
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 6f006d5a93..d0088903c5 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -61,6 +61,17 @@ typedef enum ForkNumber
#define MAX_FORKNUM INIT_FORKNUM
+/* ForkBitmap holds multiple forks as a bitmap */
+StaticAssertDecl(MAX_FORKNUM < 8, "MAX_FORKNUM too large for ForkBitmap");
+
+typedef uint8 ForkBitmap;
+#define FORKBITMAP_BIT(f) (1 << (f))
+#define FORKBITMAP_INIT(m, f) ((m) = FORKBITMAP_BIT((f)))
+#define FORKBITMAP_SET(m, f) ((m) |= FORKBITMAP_BIT((f)))
+#define FORKBITMAP_RESET(m, f) ((m) &= ~(FORKBITMAP_BIT(f)))
+#define FORKBITMAP_ISSET(m, f) ((m) & FORKBITMAP_BIT(f))
+#define FORKBITMAP_ALLFORKS() ((1 << (MAX_FORKNUM + 1)) - 1)
+
#define FORKNAMECHARS 4 /* max chars for a fork name */
extern PGDLLIMPORT const char *const forkNames[];
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index df881ea86c..b45c330e01 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -275,7 +275,7 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
- int nlocators);
+ ForkBitmap *forks, int nlocators);
extern void DropDatabaseBuffers(Oid dbid);
extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
bool permanent);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index e867ff92ab..222d631bf0 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,7 +85,8 @@ extern void smgrreleaseall(void);
extern void smgrreleaserellocator(RelFileLocatorBackend rlocator);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
-extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
+extern void smgrdounlinkall(SMgrRelation *rels, ForkBitmap *forks, int nrels,
+ bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, const void *buffer, bool skipFsync);
extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
--
2.43.5
v34-0013-Enable-commit-records-to-handle-fork-removals.patchtext/x-patch; charset=us-asciiDownload
From cc35312154846bfbc5d2924bdc60b5ac67da093e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 2 Aug 2024 20:51:31 +0900
Subject: [PATCH v34 13/16] Enable commit records to handle fork removals
Currently, COMMIT/ABORT WAL records store relation locators that need
to be removed upon commit. This patch adds support for handling these
removals on a per-fork basis. While the PREPARE record can store the
same information, it is not used.
---
src/backend/access/rmgrdesc/xactdesc.c | 50 ++++++++++++++++++++---
src/backend/access/transam/twophase.c | 56 ++++++++++++++++++++++----
src/backend/access/transam/xact.c | 30 +++++++++++---
src/backend/storage/buffer/bufmgr.c | 2 +-
src/backend/storage/smgr/md.c | 11 +++--
src/include/access/xact.h | 8 ++++
src/include/storage/md.h | 3 +-
7 files changed, 137 insertions(+), 23 deletions(-)
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index dccca201e0..ac3815fc57 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -82,6 +82,12 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
data += MinSizeOfXactRelfileLocators;
data += xl_rellocators->nrels * sizeof(RelFileLocator);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ {
+ parsed->xforks = (ForkBitmap *)data;
+ data += xl_rellocators->nrels * sizeof(ForkBitmap);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -188,6 +194,12 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += MinSizeOfXactRelfileLocators;
data += xl_rellocator->nrels * sizeof(RelFileLocator);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ {
+ parsed->xforks = (ForkBitmap *)data;
+ data += xl_rellocator->nrels * sizeof(ForkBitmap);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -263,9 +275,21 @@ ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *p
parsed->xlocators = (RelFileLocator *) bufptr;
bufptr += MAXALIGN(xlrec->ncommitrels * sizeof(RelFileLocator));
+ if (xlrec->comhasforks)
+ {
+ parsed->xforks = (ForkBitmap *) bufptr;
+ bufptr += MAXALIGN(xlrec->ncommitrels * sizeof(ForkBitmap));
+ }
+
parsed->abortlocators = (RelFileLocator *) bufptr;
bufptr += MAXALIGN(xlrec->nabortrels * sizeof(RelFileLocator));
+ if (xlrec->abohasforks)
+ {
+ parsed->abortforks = (ForkBitmap *) bufptr;
+ bufptr += MAXALIGN(xlrec->nabortrels * sizeof(ForkBitmap));
+ }
+
parsed->stats = (xl_xact_stats_item *) bufptr;
bufptr += MAXALIGN(xlrec->ncommitstats * sizeof(xl_xact_stats_item));
@@ -278,7 +302,7 @@ ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *p
static void
xact_desc_relations(StringInfo buf, char *label, int nrels,
- RelFileLocator *xlocators)
+ RelFileLocator *xlocators, ForkBitmap *xforks)
{
int i;
@@ -291,6 +315,19 @@ xact_desc_relations(StringInfo buf, char *label, int nrels,
appendStringInfo(buf, " %s", path);
pfree(path);
+
+ if (xforks)
+ {
+ char delim = ':';
+ for (int j = 0 ; j <= MAX_FORKNUM ; j++)
+ {
+ if (FORKBITMAP_ISSET(xforks[i], j))
+ {
+ appendStringInfo(buf, "%c%d", delim, j);
+ delim = ',';
+ }
+ }
+ }
}
}
}
@@ -340,7 +377,8 @@ xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec, RepOriginId
appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time));
- xact_desc_relations(buf, "rels", parsed.nrels, parsed.xlocators);
+ xact_desc_relations(buf, "rels",
+ parsed.nrels, parsed.xlocators, parsed.xforks);
xact_desc_subxacts(buf, parsed.nsubxacts, parsed.subxacts);
xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
@@ -376,7 +414,8 @@ xact_desc_abort(StringInfo buf, uint8 info, xl_xact_abort *xlrec, RepOriginId or
appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time));
- xact_desc_relations(buf, "rels", parsed.nrels, parsed.xlocators);
+ xact_desc_relations(buf, "rels",
+ parsed.nrels, parsed.xlocators, parsed.xforks);
xact_desc_subxacts(buf, parsed.nsubxacts, parsed.subxacts);
if (parsed.xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -400,9 +439,10 @@ xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginI
appendStringInfo(buf, "gid %s: ", parsed.twophase_gid);
appendStringInfoString(buf, timestamptz_to_str(parsed.xact_time));
- xact_desc_relations(buf, "rels(commit)", parsed.nrels, parsed.xlocators);
+ xact_desc_relations(buf, "rels(commit)", parsed.nrels,
+ parsed.xlocators, parsed.xforks);
xact_desc_relations(buf, "rels(abort)", parsed.nabortrels,
- parsed.abortlocators);
+ parsed.abortlocators, parsed.abortforks);
xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);
xact_desc_subxacts(buf, parsed.nsubxacts, parsed.subxacts);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 24285c7d20..6ac1fbfa24 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -203,6 +203,7 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
TransactionId *children,
int nrels,
RelFileLocator *rels,
+ ForkBitmap *forks,
int nstats,
xl_xact_stats_item *stats,
int ninvalmsgs,
@@ -214,6 +215,7 @@ static void RecordTransactionAbortPrepared(TransactionId xid,
TransactionId *children,
int nrels,
RelFileLocator *rels,
+ ForkBitmap *forks,
int nstats,
xl_xact_stats_item *stats,
const char *gid);
@@ -1070,7 +1072,9 @@ StartPrepare(GlobalTransaction gxact)
TwoPhaseFileHeader hdr;
TransactionId *children;
RelFileLocator *commitrels;
+ ForkBitmap *commitforks = NULL;
RelFileLocator *abortrels;
+ ForkBitmap *abortforks = NULL;
xl_xact_stats_item *abortstats = NULL;
xl_xact_stats_item *commitstats = NULL;
SharedInvalidationMessage *invalmsgs;
@@ -1097,7 +1101,9 @@ StartPrepare(GlobalTransaction gxact)
hdr.owner = gxact->owner;
hdr.nsubxacts = xactGetCommittedChildren(&children);
hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
+ hdr.comhasforks = (commitforks != NULL);
hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
+ hdr.abohasforks = (abortforks != NULL);
hdr.ncommitstats =
pgstat_get_transactional_drops(true, &commitstats);
hdr.nabortstats =
@@ -1126,11 +1132,23 @@ StartPrepare(GlobalTransaction gxact)
{
save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileLocator));
pfree(commitrels);
+
+ if (hdr.comhasforks)
+ {
+ save_state_data(commitforks, hdr.ncommitrels * sizeof(ForkBitmap));
+ pfree(commitforks);
+ }
}
if (hdr.nabortrels > 0)
{
save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileLocator));
pfree(abortrels);
+
+ if (hdr.abohasforks)
+ {
+ save_state_data(abortforks, hdr.nabortrels * sizeof(ForkBitmap));
+ pfree(abortforks);
+ }
}
if (hdr.ncommitstats > 0)
{
@@ -1512,8 +1530,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
TransactionId latestXid;
TransactionId *children;
RelFileLocator *commitrels;
+ ForkBitmap *commitforks = NULL;
RelFileLocator *abortrels;
+ ForkBitmap *abortforks = NULL;
RelFileLocator *delrels;
+ ForkBitmap *delforks;
int ndelrels;
xl_xact_stats_item *commitstats;
xl_xact_stats_item *abortstats;
@@ -1549,8 +1570,18 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
commitrels = (RelFileLocator *) bufptr;
bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileLocator));
+ if (hdr->comhasforks)
+ {
+ commitforks = (ForkBitmap *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(ForkBitmap));
+ }
abortrels = (RelFileLocator *) bufptr;
bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileLocator));
+ if (hdr->abohasforks)
+ {
+ abortforks = (ForkBitmap *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(ForkBitmap));
+ }
commitstats = (xl_xact_stats_item *) bufptr;
bufptr += MAXALIGN(hdr->ncommitstats * sizeof(xl_xact_stats_item));
abortstats = (xl_xact_stats_item *) bufptr;
@@ -1575,7 +1606,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
if (isCommit)
RecordTransactionCommitPrepared(xid,
hdr->nsubxacts, children,
- hdr->ncommitrels, commitrels,
+ hdr->ncommitrels,
+ commitrels, commitforks,
hdr->ncommitstats,
commitstats,
hdr->ninvalmsgs, invalmsgs,
@@ -1583,7 +1615,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels,
+ hdr->nabortrels,
+ abortrels, abortforks,
hdr->nabortstats,
abortstats,
gid);
@@ -1604,6 +1637,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
*/
gxact->valid = false;
+ /* Currently, prepare info should not have per-fork storage information. */
+ Assert(!commitforks && !abortforks);
+
/*
* We have to remove any files that were supposed to be dropped. For
* consistency with the regular xact.c code paths, must do this before
@@ -1614,16 +1650,18 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
if (isCommit)
{
delrels = commitrels;
+ delforks = commitforks;
ndelrels = hdr->ncommitrels;
}
else
{
delrels = abortrels;
+ delforks = abortforks;
ndelrels = hdr->nabortrels;
}
/* Make sure files supposed to be dropped are dropped */
- DropRelationFiles(delrels, ndelrels, false);
+ DropRelationFiles(delrels, delforks, ndelrels, false);
if (isCommit)
pgstat_execute_transactional_drops(hdr->ncommitstats, commitstats, false);
@@ -2128,7 +2166,11 @@ RecoverPreparedTransactions(void)
subxids = (TransactionId *) bufptr;
bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileLocator));
+ if (hdr->comhasforks)
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(ForkBitmap));
bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileLocator));
+ if (hdr->abohasforks)
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(ForkBitmap));
bufptr += MAXALIGN(hdr->ncommitstats * sizeof(xl_xact_stats_item));
bufptr += MAXALIGN(hdr->nabortstats * sizeof(xl_xact_stats_item));
bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
@@ -2312,7 +2354,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileLocator *rels,
+ RelFileLocator *rels, ForkBitmap *forks,
int nstats,
xl_xact_stats_item *stats,
int ninvalmsgs,
@@ -2343,7 +2385,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
* not they do.
*/
recptr = XactLogCommitRecord(committs,
- nchildren, children, nrels, rels,
+ nchildren, children, nrels, rels, forks,
nstats, stats,
ninvalmsgs, invalmsgs,
initfileinval,
@@ -2410,7 +2452,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileLocator *rels,
+ RelFileLocator *rels, ForkBitmap *forks,
int nstats,
xl_xact_stats_item *stats,
const char *gid)
@@ -2442,7 +2484,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
*/
recptr = XactLogAbortRecord(GetCurrentTimestamp(),
nchildren, children,
- nrels, rels,
+ nrels, rels, forks,
nstats, stats,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
xid, gid);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1e1f86e578..3a594325d2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1325,6 +1325,7 @@ RecordTransactionCommit(void)
TransactionId latestXid = InvalidTransactionId;
int nrels;
RelFileLocator *rels;
+ ForkBitmap *forks = NULL;
int nchildren;
TransactionId *children;
int ndroppedstats = 0;
@@ -1436,7 +1437,7 @@ RecordTransactionCommit(void)
* Insert the commit XLOG record.
*/
XactLogCommitRecord(GetCurrentTransactionStopTimestamp(),
- nchildren, children, nrels, rels,
+ nchildren, children, nrels, rels, forks,
ndroppedstats, droppedstats,
nmsgs, invalMessages,
RelcacheInitFileInval,
@@ -1753,6 +1754,7 @@ RecordTransactionAbort(bool isSubXact)
TransactionId latestXid;
int nrels;
RelFileLocator *rels;
+ ForkBitmap *forks = NULL;
int ndroppedstats = 0;
xl_xact_stats_item *droppedstats = NULL;
int nchildren;
@@ -1814,7 +1816,7 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
- nrels, rels,
+ nrels, rels, forks,
ndroppedstats, droppedstats,
MyXactFlags, InvalidTransactionId,
NULL);
@@ -5844,7 +5846,7 @@ xactGetCommittedChildren(TransactionId **ptr)
XLogRecPtr
XactLogCommitRecord(TimestampTz commit_time,
int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
+ int nrels, RelFileLocator *rels, ForkBitmap *forks,
int ndroppedstats, xl_xact_stats_item *droppedstats,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval,
@@ -5912,6 +5914,9 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
xl_relfilelocators.nrels = nrels;
info |= XLR_SPECIAL_REL_UPDATE;
+
+ if (forks)
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILEFORKS;
}
if (ndroppedstats > 0)
@@ -5974,6 +5979,10 @@ XactLogCommitRecord(TimestampTz commit_time,
MinSizeOfXactRelfileLocators);
XLogRegisterData((char *) rels,
nrels * sizeof(RelFileLocator));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ XLogRegisterData((char *) forks,
+ nrels * sizeof(ForkBitmap));
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -6016,7 +6025,7 @@ XactLogCommitRecord(TimestampTz commit_time,
XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
+ int nrels, RelFileLocator *rels, ForkBitmap *forks,
int ndroppedstats, xl_xact_stats_item *droppedstats,
int xactflags, TransactionId twophase_xid,
const char *twophase_gid)
@@ -6061,6 +6070,9 @@ XactLogAbortRecord(TimestampTz abort_time,
xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
xl_relfilelocators.nrels = nrels;
info |= XLR_SPECIAL_REL_UPDATE;
+
+ if (forks)
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILEFORKS;
}
if (ndroppedstats > 0)
@@ -6127,6 +6139,10 @@ XactLogAbortRecord(TimestampTz abort_time,
MinSizeOfXactRelfileLocators);
XLogRegisterData((char *) rels,
nrels * sizeof(RelFileLocator));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ XLogRegisterData((char *) forks,
+ nrels * sizeof(ForkBitmap));
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -6267,7 +6283,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
XLogFlush(lsn);
/* Make sure files supposed to be dropped are dropped */
- DropRelationFiles(parsed->xlocators, parsed->nrels, true);
+ DropRelationFiles(parsed->xlocators, parsed->xforks, parsed->nrels,
+ true);
}
SimpleUndoLog_UndoByXid(true, xid, parsed->nsubxacts, parsed->subxacts);
@@ -6381,7 +6398,8 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
*/
XLogFlush(lsn);
- DropRelationFiles(parsed->xlocators, parsed->nrels, true);
+ DropRelationFiles(parsed->xlocators, parsed->xforks, parsed->nrels,
+ true);
}
SimpleUndoLog_UndoByXid(false, xid, parsed->nsubxacts, parsed->subxacts);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3eb4a0c792..87b56403c1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -171,7 +171,7 @@ struct copy_storage_using_buffer_read_stream_private
typedef struct RelFileForks
{
RelFileLocator rloc; /* key member for qsort */
- ForkBitmap forks; /* fork number in bitmap */
+ ForkBitmap forks; /* fork numbers in bitmap */
} RelFileForks;
/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 5125cfe640..95bb91df50 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1443,7 +1443,8 @@ ForgetDatabaseSyncRequests(Oid dbid)
* DropRelationFiles -- drop files of all given relations
*/
void
-DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
+DropRelationFiles(RelFileLocator *delrels, ForkBitmap *delforks, int ndelrels,
+ bool isRedo)
{
SMgrRelation *srels;
int i;
@@ -1457,13 +1458,17 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
{
ForkNumber fork;
+ /* Close the spacified forks at smgr level. */
for (fork = 0; fork <= MAX_FORKNUM; fork++)
- XLogDropRelation(delrels[i], fork);
+ {
+ if (!delforks || FORKBITMAP_ISSET(delforks[i], fork))
+ XLogDropRelation(delrels[i], fork);
+ }
}
srels[i] = srel;
}
- smgrdounlinkall(srels, NULL, ndelrels, isRedo);
+ smgrdounlinkall(srels, delforks, ndelrels, isRedo);
for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8f23611dcc..73c6dd3ba1 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -196,6 +196,7 @@ typedef struct SavedTransactionCharacteristics
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
#define XACT_XINFO_HAS_GID (1U << 7)
#define XACT_XINFO_HAS_DROPPED_STATS (1U << 8)
+#define XACT_XINFO_HAS_RELFILEFORKS (1U << 9)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -355,7 +356,9 @@ typedef struct xl_xact_prepare
Oid owner; /* user running the transaction */
int32 nsubxacts; /* number of following subxact XIDs */
int32 ncommitrels; /* number of delete-on-commit rels */
+ bool comhasforks; /* commitrels is accompanied by forknums */
int32 nabortrels; /* number of delete-on-abort rels */
+ bool abohasforks; /* abortrels is accompanied by forknums */
int32 ncommitstats; /* number of stats to drop on commit */
int32 nabortstats; /* number of stats to drop on abort */
int32 ninvalmsgs; /* number of cache invalidation messages */
@@ -383,6 +386,7 @@ typedef struct xl_xact_parsed_commit
int nrels;
RelFileLocator *xlocators;
+ ForkBitmap *xforks;
int nstats;
xl_xact_stats_item *stats;
@@ -394,6 +398,7 @@ typedef struct xl_xact_parsed_commit
char twophase_gid[GIDSIZE]; /* only for 2PC */
int nabortrels; /* only for 2PC */
RelFileLocator *abortlocators; /* only for 2PC */
+ ForkBitmap *abortforks; /* only for 2PC */
int nabortstats; /* only for 2PC */
xl_xact_stats_item *abortstats; /* only for 2PC */
@@ -416,6 +421,7 @@ typedef struct xl_xact_parsed_abort
int nrels;
RelFileLocator *xlocators;
+ ForkBitmap *xforks;
int nstats;
xl_xact_stats_item *stats;
@@ -499,6 +505,7 @@ extern int xactGetCommittedChildren(TransactionId **ptr);
extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileLocator *rels,
+ ForkBitmap *forks,
int ndroppedstats,
xl_xact_stats_item *droppedstats,
int nmsgs, SharedInvalidationMessage *msgs,
@@ -510,6 +517,7 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileLocator *rels,
+ ForkBitmap *forks,
int ndroppedstats,
xl_xact_stats_item *droppedstats,
int xactflags, TransactionId twophase_xid,
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 620f10abde..c9166c7ac1 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -46,7 +46,8 @@ extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
-extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
+extern void DropRelationFiles(RelFileLocator *delrels, ForkBitmap *delforks,
+ int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
--
2.43.5
v34-0014-Add-per-fork-deletion-support-to-pendingDeletes.patchtext/x-patch; charset=us-asciiDownload
From 583f70a5da2e74a9485ce1a05f7d10f465bda0e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 2 Aug 2024 21:39:11 +0900
Subject: [PATCH v34 14/16] Add per-fork deletion support to pendingDeletes
This patch introduces the ability to handle commit-time pending
deletes on a per-fork basis.
---
src/backend/access/transam/twophase.c | 4 +-
src/backend/access/transam/xact.c | 4 +-
src/backend/catalog/storage.c | 61 ++++++++++++++++++++++++---
src/include/catalog/storage.h | 3 +-
4 files changed, 60 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6ac1fbfa24..22e72fa1e4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1100,9 +1100,9 @@ StartPrepare(GlobalTransaction gxact)
hdr.prepared_at = gxact->prepared_at;
hdr.owner = gxact->owner;
hdr.nsubxacts = xactGetCommittedChildren(&children);
- hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
+ hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels, &commitforks);
hdr.comhasforks = (commitforks != NULL);
- hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
+ hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels, &abortforks);
hdr.abohasforks = (abortforks != NULL);
hdr.ncommitstats =
pgstat_get_transactional_drops(true, &commitstats);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3a594325d2..d77fa33fe1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1346,7 +1346,7 @@ RecordTransactionCommit(void)
LogLogicalInvalidations();
/* Get data needed for commit record */
- nrels = smgrGetPendingDeletes(true, &rels);
+ nrels = smgrGetPendingDeletes(true, &rels, &forks);
nchildren = xactGetCommittedChildren(&children);
ndroppedstats = pgstat_get_transactional_drops(true, &droppedstats);
if (XLogStandbyInfoActive())
@@ -1799,7 +1799,7 @@ RecordTransactionAbort(bool isSubXact)
replorigin_session_origin != DoNotReplicateId);
/* Fetch the data we need for the abort record */
- nrels = smgrGetPendingDeletes(false, &rels);
+ nrels = smgrGetPendingDeletes(false, &rels, &forks);
nchildren = xactGetCommittedChildren(&children);
ndroppedstats = pgstat_get_transactional_drops(false, &droppedstats);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 71ae5f9b08..dc76cd6bc6 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -80,6 +80,7 @@ int wal_skip_threshold = 2048; /* in kilobytes */
typedef struct PendingRelDelete
{
RelFileLocator rlocator; /* relation that may need to be deleted */
+ ForkBitmap forks; /* fork bitmap */
ProcNumber procNumber; /* INVALID_PROC_NUMBER if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -295,6 +296,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->rlocator = rel->rd_locator;
+ pending->forks = FORKBITMAP_ALLFORKS();
pending->procNumber = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -360,6 +362,8 @@ RelationPreserveStorage(RelFileLocator rlocator, bool atCommit)
if (RelFileLocatorEquals(rlocator, pending->rlocator)
&& pending->atCommit == atCommit)
{
+ Assert(pending->forks == FORKBITMAP_ALLFORKS());
+
/* unlink and delete list entry */
if (prev)
prev->next = next;
@@ -683,7 +687,7 @@ SerializePendingSyncs(Size maxSize, char *startAddress)
/* remove deleted rnodes */
for (delete = pendingDeletes; delete != NULL; delete = delete->next)
- if (delete->atCommit)
+ if (delete->atCommit && delete->forks == FORKBITMAP_ALLFORKS())
(void) hash_search(tmphash, &delete->rlocator,
HASH_REMOVE, NULL);
@@ -737,6 +741,7 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ ForkBitmap *forks = NULL;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -759,6 +764,8 @@ smgrDoPendingDeletes(bool isCommit)
{
SMgrRelation srel;
+ Assert(pending->forks == FORKBITMAP_ALLFORKS());
+
srel = smgropen(pending->rlocator, pending->procNumber);
/* allocate the initial array, or extend it, if needed */
@@ -771,8 +778,26 @@ smgrDoPendingDeletes(bool isCommit)
{
maxrels *= 2;
srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+
+ /* expand forks array if any */
+ if (forks)
+ forks = repalloc(forks, sizeof(ForkBitmap) * maxrels);
}
+ /* Create forks array on encountering partial forks. */
+ Assert((pending->forks & ~FORKBITMAP_ALLFORKS()) == 0);
+ if (!forks && pending->forks != FORKBITMAP_ALLFORKS())
+ {
+ forks = palloc(sizeof(ForkBitmap) * maxrels);
+
+ /* fill in the past elements */
+ for (int i = 0 ; i < nrels ; i++)
+ forks[i] = FORKBITMAP_ALLFORKS();
+ }
+
+ if (forks)
+ forks[nrels] = pending->forks;
+
srels[nrels++] = srel;
}
/* must explicitly free the list entry */
@@ -783,12 +808,15 @@ smgrDoPendingDeletes(bool isCommit)
if (nrels > 0)
{
- smgrdounlinkall(srels, NULL, nrels, false);
+ smgrdounlinkall(srels, forks, nrels, false);
for (int i = 0; i < nrels; i++)
smgrclose(srels[i]);
pfree(srels);
+
+ if (forks)
+ pfree(forks);
}
}
@@ -948,34 +976,53 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
* by upper-level transactions.
*/
int
-smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr)
+smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr, ForkBitmap **fptr)
{
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
+ bool hasforks = false;
RelFileLocator *rptr;
+ ForkBitmap *rfptr = NULL;
PendingRelDelete *pending;
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->procNumber == INVALID_PROC_NUMBER)
+ Assert((pending->forks & ~FORKBITMAP_ALLFORKS()) == 0);
+
+ if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit)
+ {
nrels++;
+
+ if (pending->forks != FORKBITMAP_ALLFORKS())
+ hasforks = true;
+ }
}
if (nrels == 0)
{
*ptr = NULL;
+ *fptr = NULL;
return 0;
}
rptr = (RelFileLocator *) palloc(nrels * sizeof(RelFileLocator));
*ptr = rptr;
+
+ if (hasforks)
+ rfptr = (ForkBitmap *) palloc(nrels * sizeof(ForkBitmap));
+ *fptr = rfptr;
+
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
- && pending->procNumber == INVALID_PROC_NUMBER)
+ if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit)
{
*rptr = pending->rlocator;
rptr++;
+
+ if (rfptr)
+ {
+ *rfptr = pending->forks;
+ rfptr++;
+ }
}
}
return nrels;
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3451d6ac80..15e35831e9 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -44,7 +44,8 @@ extern void RestorePendingSyncs(char *startAddress);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
-extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern int smgrGetPendingDeletes(bool forCommit,
+ RelFileLocator **ptr, ForkBitmap **fptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
--
2.43.5
v34-0015-Allow-init-fork-to-be-dropped.patchtext/x-patch; charset=us-asciiDownload
From 34112ffc6df61550564c46dd88dfdf4ce3d4ac59 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 16 Aug 2024 23:35:43 +0900
Subject: [PATCH v34 15/16] Allow init fork to be dropped
---
src/backend/catalog/storage.c | 58 +++++++++++++++++++++++++++++++----
src/include/catalog/storage.h | 1 +
2 files changed, 53 insertions(+), 6 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index dc76cd6bc6..9914717935 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -192,11 +192,25 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
* aborts or server crashes later on, the fork will be removed. If the caller
* plans to remove the fork in another way, it should pass false. Additionally,
* it is WAL-logged if wal_log is true.
+ *
+ * Returns true if the storage file was actually created. False means the file
+ * already existed.
*/
void
RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
bool wal_log, bool undo_log)
{
+#ifdef USE_ASSERT_CHECKING
+ /* we must not have pending delete for the init fork. */
+ if (forkNum == INIT_FORKNUM)
+ {
+ for (PendingRelDelete *p = pendingDeletes ; p != NULL ; p = p->next)
+ Assert(!FORKBITMAP_ISSET(p->forks, INIT_FORKNUM) ||
+ !RelFileLocatorEquals(srel->smgr_rlocator.locator,
+ p->rlocator));
+ }
+#endif
+
/* Schedule the removal of this init fork at abort if requested. */
if (undo_log)
ulog_smgrcreate(srel, forkNum);
@@ -206,9 +220,32 @@ RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
log_smgrcreate(&srel->smgr_rlocator.locator, forkNum);
smgrcreate(srel, forkNum, false);
-
}
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ */
+void
+RelationDropInitFork(SMgrRelation srel)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ RelFileLocator rlocator = srel->smgr_rlocator.locator;
+ ProcNumber procNumber = srel->smgr_rlocator.backend;
+ PendingRelDelete *pending;
+
+ /* Schedule the removal of this init fork at commit. */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->rlocator = rlocator;
+ pending->procNumber = procNumber;
+ pending->forks = FORKBITMAP_BIT(INIT_FORKNUM);
+ pending->atCommit = true;
+ pending->nestLevel = nestLevel;
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -764,8 +801,6 @@ smgrDoPendingDeletes(bool isCommit)
{
SMgrRelation srel;
- Assert(pending->forks == FORKBITMAP_ALLFORKS());
-
srel = smgropen(pending->rlocator, pending->procNumber);
/* allocate the initial array, or extend it, if needed */
@@ -1064,8 +1099,18 @@ AtSubCommit_smgr(void)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- if (pending->nestLevel >= nestLevel)
- pending->nestLevel = nestLevel - 1;
+ if (pending->nestLevel < nestLevel)
+ {
+#ifdef USE_ASSERT_CHECKING
+ /* all the remaining entries must be of upper subtransactions */
+ for (; pending ; pending = pending->next)
+ Assert(pending->nestLevel < nestLevel);
+#endif
+ break;
+ }
+
+ /* move this entry to the immediately upper subtransaction */
+ pending->nestLevel = nestLevel - 1;
}
}
@@ -1195,7 +1240,8 @@ smgr_redo(XLogReaderState *record)
SMgrRelation reln;
reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
- SetRelationBuffersPersistence(reln, xlrec->persistence);
+ SetRelationBuffersPersistenceRedo(reln, xlrec->persistence,
+ XLogRecGetXid(record));
}
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 15e35831e9..8629a13706 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -27,6 +27,7 @@ extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
bool register_delete);
extern void RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
bool wal_log, bool undo_log);
+extern void RelationDropInitFork(SMgrRelation srel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
--
2.43.5
v34-0016-In-place-persistence-change-to-LOGGED.patchtext/x-patch; charset=us-asciiDownload
From a15c6026b06d4e2d7b7c1629da92bc2b78a64809 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 27 Aug 2024 10:44:46 +0900
Subject: [PATCH v34 16/16] In-place persistence change to LOGGED
---
src/backend/commands/tablecmds.c | 27 +++++++-----
src/test/recovery/t/044_persistence_change.pl | 43 ++++++++++---------
2 files changed, 40 insertions(+), 30 deletions(-)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 05364365a7..2f443fd4a6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5730,9 +5730,6 @@ RelationChangePersistence(AlteredTableInfo *tab, char persistence,
continue;
}
- /* Currently, only allowing changes to UNLOGGED. */
- Assert(!persistent);
-
RelationAssumePersistenceChange(r);
/* switch buffer persistence */
@@ -5740,11 +5737,22 @@ RelationChangePersistence(AlteredTableInfo *tab, char persistence,
log_smgrbufpersistence(srel->smgr_rlocator.locator, persistent);
SetRelationBuffersPersistence(srel, persistent);
- /* then create the init fork */
- is_index = (r->rd_rel->relkind == RELKIND_INDEX);
- RelationCreateFork(srel, INIT_FORKNUM, !is_index, true);
- if (is_index)
- r->rd_indam->ambuildempty(r);
+ /* then create or drop the init fork */
+ if (persistent)
+ RelationDropInitFork(srel);
+ else
+ {
+ is_index = (r->rd_rel->relkind == RELKIND_INDEX);
+
+ /*
+ * If it is an index, have access methods initialize the file. In
+ * that case, WAL-logging is expected to performed by the
+ * ambuildempty() method.
+ */
+ RelationCreateFork(srel, INIT_FORKNUM, !is_index, true);
+ if (is_index)
+ r->rd_indam->ambuildempty(r);
+ }
/* Update catalog */
tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
@@ -5903,8 +5911,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE &&
- persistence == RELPERSISTENCE_UNLOGGED)
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
{
/* Make in-place persistence change. */
RelationChangePersistence(tab, persistence, lockmode);
diff --git a/src/test/recovery/t/044_persistence_change.pl b/src/test/recovery/t/044_persistence_change.pl
index 7f5ef84667..2356d61251 100644
--- a/src/test/recovery/t/044_persistence_change.pl
+++ b/src/test/recovery/t/044_persistence_change.pl
@@ -100,8 +100,8 @@ max_prepared_transactions = 2
# Check if SET LOGGED didn't change relfilenumbers and data survive a crash
my $relfilenodes3 = getrelfilenodes($node, \@relnames);
- ok (!checkrelfilenodes($relfilenodes2, $relfilenodes3),
- "crashed SET-LOGGED relations have sane relfilenodes transition");
+ ok (checkrelfilenodes($relfilenodes2, $relfilenodes3),
+ "crashed SET-LOGGED relations have sane relfilenodes transition");
is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
"crashed SET-LOGGED table does not lose data");
@@ -147,34 +147,35 @@ max_prepared_transactions = 2
"storages are reverted to logged state");
### Subtransactions
- ok ($node->psql('postgres',
+ my ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
qq(
BEGIN;
ALTER TABLE t SET UNLOGGED; -- committed
SAVEPOINT a;
- ALTER TABLE t SET LOGGED; -- aborted
+ ALTER TABLE t SET LOGGED; -- ERROR
SAVEPOINT b;
ROLLBACK TO a;
COMMIT;
- )) != 3,
- "command succeeds 1");
-
- is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
- "table data is not changed 1");
- ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
- "storages are changed to unlogged state");
+ ));
+ ok ($stderr =~ m/persistence of this relation has been already changed/,
+ "errors out when double flip occured in a single transaction");
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages stay in logged state");
ok ($node->psql('postgres',
qq(
+ ALTER TABLE t SET UNLOGGED;
BEGIN;
+ SAVEPOINT a;
ALTER TABLE t SET LOGGED; -- aborted
+ ROLLBACK TO a;
SAVEPOINT a;
- ALTER TABLE t SET UNLOGGED; -- aborted
- SAVEPOINT b;
+ ALTER TABLE t SET LOGGED; -- no error
RELEASE a;
ROLLBACK;
)) != 3,
- "command succeeds 2");
+ "rolled-back persistence flip doesn't prevent subsequent flips");
is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
"table data is not changed 2");
@@ -182,7 +183,7 @@ max_prepared_transactions = 2
"storages stay in unlogged state");
### Prepared transactions
- my ($ret, $stdout, $stderr) =
+ ($ret, $stdout, $stderr) =
$node->psql('postgres',
qq(
ALTER TABLE t SET LOGGED;
@@ -208,16 +209,17 @@ max_prepared_transactions = 2
ok ($stderr =~ m/cannot prepare transaction if persistence change/,
"errors out when persistence-flipped xact is prepared 2");
ok (check_storage_state(\&is_logged_state, $node, \@relnames),
- "storages stay in logged state");
+ "storages stay in logged state 2");
### Error out DML
- $node->psql('postgres',
+ ok($node->psql('postgres',
qq(
BEGIN;
- ALTER TABLE t SET LOGGED;
+ ALTER TABLE t SET LOGGED; -- no effect
INSERT INTO t VALUES(1); -- Succeeds
COMMIT;
- ));
+ )) != 3,
+ "ineffective persistence change doesn't prevent DML");
($ret, $stdout, $stderr) =
$node->psql('postgres',
@@ -233,7 +235,7 @@ max_prepared_transactions = 2
qq(
BEGIN;
SAVEPOINT a;
- ALTER TABLE t SET UNLOGGED;
+ ALTER TABLE t SET LOGGED;
ROLLBACK TO a;
INSERT INTO t VALUES(3); -- Succeeds
COMMIT;
@@ -243,6 +245,7 @@ max_prepared_transactions = 2
($ret, $stdout, $stderr) =
$node->psql('postgres',
qq(
+ ALTER TABLE t SET LOGGED;
BEGIN;
SAVEPOINT a;
ALTER TABLE t SET UNLOGGED;
--
2.43.5
On 31/08/2024 19:09, Kyotaro Horiguchi wrote:
- UNDO log(0002): This handles file deletion during transaction aborts,
which was previously managed, in part, by the commit XLOG record at
the end of a transaction.- Prevent orphan files after a crash (0005): This is another use-case
of the UNDO log system.
Nice, I'm very excited if we can fix that long-standing issue! I'll try
to review this properly later, but at a quick 5 minute glance, one thing
caught my eye:
This requires fsync()ing the per-xid undo log file every time a relation
is created. I fear that can be a pretty big performance hit for
workloads that repeatedly create and drop small tables. Especially if
they're otherwise running with synchronous_commit=off. Instead of
flushing the undo log file after every write, I'd suggest WAL-logging
the undo log like regular relations and SLRUs. So before writing the
entry to the undo log, WAL-log it. And with a little more effort, you
could postpone creating the files altogether until a checkpoint happens,
similar to how twophase state files are checkpointed nowadays.
I wonder if the twophase state files and undo log files should be merged
into one file. They're similar in many ways: there's one file per
transaction, named using the XID. I haven't thought this fully through,
just a thought..
+static void +undolog_set_filename(char *buf, TransactionId xid) +{ + snprintf(buf, MAXPGPATH, "%s/%08x", SIMPLE_UNDOLOG_DIR, xid); +}
I'd suggest using FullTransactionId. Doesn't matter much, but seems like
a good future-proofing.
--
Heikki Linnakangas
Neon (https://neon.tech)
On Sun, Sep 01, 2024 at 10:15:00PM +0300, Heikki Linnakangas wrote:
I wonder if the twophase state files and undo log files should be merged
into one file. They're similar in many ways: there's one file per
transaction, named using the XID. I haven't thought this fully through, just
a thought..
Hmm. It could be possible to extract some of this knowledge out of
twophase.c and design some APIs that could be used for both, but would
that be really necessary? The 2PC data and the LSNs used by the files
to check if things are replayed or on disk rely on
GlobalTransactionData that has its own idea of things and timings at
recovery.
Or perhaps your point is actually to do that and add one layer for the
file handlings and their flush timings? I am not sure, TBH, what this
thread is trying to fix is complicated enough that it may be better to
live with two different code paths. But perhaps my gut feeling is
just wrong reading your paragraph.
--
Michael
Hello.
Thank you for the response.
At Sun, 1 Sep 2024 22:15:00 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
On 31/08/2024 19:09, Kyotaro Horiguchi wrote:
- UNDO log(0002): This handles file deletion during transaction aborts,
which was previously managed, in part, by the commit XLOG record at
the end of a transaction.
- Prevent orphan files after a crash (0005): This is another use-case
of the UNDO log system.Nice, I'm very excited if we can fix that long-standing issue! I'll
try to review this properly later, but at a quick 5 minute glance, one
thing caught my eye:This requires fsync()ing the per-xid undo log file every time a
relation is created. I fear that can be a pretty big performance hit
for workloads that repeatedly create and drop small tables. Especially
I initially thought that one additional file manipulation during file
creation wouldn't be an issue. However, the created storage file isn't
being synced, so your concern seems valid.
if they're otherwise running with synchronous_commit=off. Instead of
flushing the undo log file after every write, I'd suggest WAL-logging
the undo log like regular relations and SLRUs. So before writing the
entry to the undo log, WAL-log it. And with a little more effort, you
could postpone creating the files altogether until a checkpoint
happens, similar to how twophase state files are checkpointed
nowadays.
I thought that an UNDO log file not flushed before the last checkpoint
might not survive a system crash. However, including UNDO files in the
checkpointing process resolves that concern. Thansk you for the
suggestion.
I wonder if the twophase state files and undo log files should be
merged into one file. They're similar in many ways: there's one file
per transaction, named using the XID. I haven't thought this fully
through, just a thought..
Precisely, UNDO log files are created per subtransaction, unlike
twophase files. It might be possible if we allow the twophase files
(as they are currently named) to be overwritten or modified at every
subcommit. If ULOG contents are WAL-logged, these two things will
become even more similar. However, I'm not planning to include that in
the next version for now.
+static void +undolog_set_filename(char *buf, TransactionId xid) +{ + snprintf(buf, MAXPGPATH, "%s/%08x", SIMPLE_UNDOLOG_DIR, xid); +}I'd suggest using FullTransactionId. Doesn't matter much, but seems
like a good future-proofing.
Agreed. Will fix it in the next vesion.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Mon, 2 Sep 2024 09:30:20 +0900, Michael Paquier <michael@paquier.xyz> wrote in
On Sun, Sep 01, 2024 at 10:15:00PM +0300, Heikki Linnakangas wrote:
I wonder if the twophase state files and undo log files should be merged
into one file. They're similar in many ways: there's one file per
transaction, named using the XID. I haven't thought this fully through, just
a thought..Hmm. It could be possible to extract some of this knowledge out of
twophase.c and design some APIs that could be used for both, but would
that be really necessary? The 2PC data and the LSNs used by the files
to check if things are replayed or on disk rely on
GlobalTransactionData that has its own idea of things and timings at
recovery.
I'm not sure, but I feel that Heikki mentioned only about using the
file format and in/out functions if the file formats of the two are
sufficiently overlapping.
Or perhaps your point is actually to do that and add one layer for the
file handlings and their flush timings? I am not sure, TBH, what this
thread is trying to fix is complicated enough that it may be better to
live with two different code paths. But perhaps my gut feeling is
just wrong reading your paragraph.
I believe this statement is valid, so I’m not in a hurry to do this.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 31/08/2024 19:09, Kyotaro Horiguchi wrote:
Subject: [PATCH v34 03/16] Remove function for retaining files on outer
transaction abortsThe function RelationPreserveStorage() was initially created to keep
storage files committed in a subtransaction on the abort of outer
transactions. It was introduced by commit b9b8831ad6 in 2010, but no
use case for this behavior has emerged since then. If we move the
at-commit removal feature of storage files from pendingDeletes to the
UNDO log system, the UNDO system would need to accept the cancellation
of already logged entries, which makes the system overly complex with
no benefit. Therefore, remove the feature.
I don't think that's quite right. I don't think this was meant for
subtransaction aborts, but to make sure that if the top-transaction
aborts after AtEOXact_RelationMap() has already been called, we don't
remove the new relation. AtEOXact_RelationMap() is called very late in
the commit process to keep the window as small as possible, but if it
nevertheless happens, the consequences are pretty bad if you remove a
relation file that is in fact needed.
--
Heikki Linnakangas
Neon (https://neon.tech)
At Sun, 1 Sep 2024 22:15:00 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
On 31/08/2024 19:09, Kyotaro Horiguchi wrote:
This requires fsync()ing the per-xid undo log file every time a
relation is created. I fear that can be a pretty big performance hit
for workloads that repeatedly create and drop small tables. Especially
if they're otherwise running with synchronous_commit=off. Instead of
flushing the undo log file after every write, I'd suggest WAL-logging
the undo log like regular relations and SLRUs. So before writing the
entry to the undo log, WAL-log it. And with a little more effort, you
could postpone creating the files altogether until a checkpoint
happens, similar to how twophase state files are checkpointed
nowadays.
After some delays, here’s the new version. In this update, UNDO logs
are WAL-logged and processed in memory under most conditions. During
checkpoints, they’re flushed to files, which are then read when a
specific XID’s UNDO log is accessed for the first time during
recovery.
The biggest changes are in patches 0001 through 0004 (equivalent to
the previous 0001-0002). After that, there aren’t any major
changes. Since this update involves removing some existing features,
I’ve split these parts into multiple smaller identity transformations
to make them clearer.
As for changes beyond that, the main one is lifting the previous
restriction on PREPARE for transactions after a persistence
change. This was made possible because, with the shift to in-memory
processing of UNDO logs, commit-time crash recovery detection is now
simpler. Additional changes include completely removing the
abort-handling portion from the pendingDeletes mechanism (0008-0010).
I'd suggest using FullTransactionId. Doesn't matter much, but seems
like a good future-proofing.
And, the patchset now uses full transaction ids.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v35-0001-Add-XLOG-resource-for-the-undo-log-system.patchtext/x-patch; charset=us-asciiDownload
From 5889454424d6047a8d7c3b9f96f6dd23a54f13a8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 30 Sep 2024 16:31:02 +0900
Subject: [PATCH v35 01/21] Add XLOG resource for the undo log system
In the upcoming UNDO log system, XLOG will be used to persist UNDO log
information. This commit adds the necessary XLOG components, leaving
out the main part of the UNDO log, to provide a minimal implementation
for easier review.
---
src/backend/access/rmgrdesc/Makefile | 1 +
src/backend/access/rmgrdesc/meson.build | 1 +
src/backend/access/rmgrdesc/undologdesc.c | 99 +++++++++++++++++++++++
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/rmgr.c | 3 +-
src/backend/access/transam/undolog.c | 38 +++++++++
src/bin/pg_rewind/parsexlog.c | 2 +-
src/bin/pg_waldump/rmgrdesc.c | 3 +-
src/include/access/rmgr.h | 2 +-
src/include/access/rmgrlist.h | 47 +++++------
src/include/access/undolog.h | 82 +++++++++++++++++++
src/tools/pgindent/typedefs.list | 5 ++
13 files changed, 258 insertions(+), 27 deletions(-)
create mode 100644 src/backend/access/rmgrdesc/undologdesc.c
create mode 100644 src/backend/access/transam/undolog.c
create mode 100644 src/include/access/undolog.h
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index cd95eec37f1..542fd3d6a8e 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -29,6 +29,7 @@ OBJS = \
spgdesc.o \
standbydesc.o \
tblspcdesc.o \
+ undologdesc.o \
xactdesc.o \
xlogdesc.o
diff --git a/src/backend/access/rmgrdesc/meson.build b/src/backend/access/rmgrdesc/meson.build
index e8b7a65fc76..d19c2c3b7ca 100644
--- a/src/backend/access/rmgrdesc/meson.build
+++ b/src/backend/access/rmgrdesc/meson.build
@@ -22,6 +22,7 @@ rmgr_desc_sources = files(
'spgdesc.c',
'standbydesc.c',
'tblspcdesc.c',
+ 'undologdesc.c',
'xactdesc.c',
'xlogdesc.c',
)
diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c
new file mode 100644
index 00000000000..d717646d2e0
--- /dev/null
+++ b/src/backend/access/rmgrdesc/undologdesc.c
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * undologdesc.c
+ * rmgr descriptor routines for access/transam/undolog.c
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/undologdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/undolog.h"
+
+typedef struct UndoDescData
+{
+ const char *rm_name;
+ void (*rm_undodesc) (StringInfo buf, UndoLogRecord *record);
+ const char *(*rm_undoidentify) (uint8 info);
+} UndoDescData;
+
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_cleanup_init,undo_recoveryend) \
+ { name, undo_desc, undo_identify },
+
+static UndoDescData UndoRoutines[RM_MAX_ID + 1] = {
+#include "access/rmgrlist.h"
+};
+#undef PG_RMGR
+
+void
+undolog_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_ULOG_CREATE)
+ {
+ xl_ulog_create *crec = (xl_ulog_create *) rec;
+ char fname[MAXPGPATH];
+
+ UndoLogSetFilename(fname, crec->xid);
+ appendStringInfo(buf, "\"%s\"", fname);
+ }
+ else if (info == XLOG_ULOG_WRITE)
+ {
+ xl_ulog_write *wrec = (xl_ulog_write *) rec;
+ UndoLogRecord *urec = (UndoLogRecord *) wrec->bytes;
+
+ /*
+ * The file header and records are recovered in the same way without
+ * using resource manager routines. However, while description routines
+ * are typically provided as resource routines, the file header does
+ * not have one. Therefore, it requires explicit handling here.
+ */
+ if (wrec->off == 0)
+ {
+ /* This is the file header. No extra data is currently stored. */
+ appendStringInfo(buf, "HEADER");
+ }
+ else
+ {
+ /* This is a ulog record. Let rmgr routines handle it. */
+ UndoDescData rmgr = UndoRoutines[urec->ul_rmid];
+ const char *id = rmgr.rm_undoidentify(ULogRecGetInfo(urec));
+
+ Assert(UndoRoutines[urec->ul_rmid].rm_undoidentify);
+
+ if (id == NULL)
+ appendStringInfo(buf, "UNKNOWN (%X): ",
+ ULogRecGetInfo(urec));
+ else
+ appendStringInfo(buf, "%s: ", id);
+
+ if (UndoRoutines[urec->ul_rmid].rm_undodesc)
+ UndoRoutines[urec->ul_rmid].rm_undodesc(buf, urec);
+ }
+ }
+}
+
+const char *
+undolog_identify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case XLOG_ULOG_CREATE:
+ id = "CREATE";
+ break;
+ case XLOG_ULOG_WRITE:
+ id = "WRITE";
+ break;
+ }
+
+ return id;
+}
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index a32f473e0a2..58c029741e2 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -25,6 +25,7 @@ OBJS = \
transam.o \
twophase.o \
twophase_rmgr.o \
+ undolog.o \
varsup.o \
xact.o \
xlog.o \
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 91d258f9df1..d7ac4e1aead 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
'transam.c',
'twophase.c',
'twophase_rmgr.c',
+ 'undolog.c',
'varsup.c',
'xact.c',
'xlog.c',
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 1b7499726eb..2fd27d1801d 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,6 +30,7 @@
#include "access/multixact.h"
#include "access/nbtxlog.h"
#include "access/spgxlog.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "catalog/storage_xlog.h"
#include "commands/dbcommands_xlog.h"
@@ -44,7 +45,7 @@
/* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_cleanup_init,undo_recoveryend) \
{ name, redo, desc, identify, startup, cleanup, mask, decode },
RmgrData RmgrTable[RM_MAX_ID + 1] = {
diff --git a/src/backend/access/transam/undolog.c b/src/backend/access/transam/undolog.c
new file mode 100644
index 00000000000..c32f5cd0b6f
--- /dev/null
+++ b/src/backend/access/transam/undolog.c
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.c
+ * Undo log manager for PostgreSQL
+ *
+ * This module logs the cleanup procedures required during a transaction abort.
+ * The information is recorded in WAL-logged files to ensure post-crash
+ * recovery runs the necessary cleanup procedures.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/undolog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/undolog.h"
+
+/*
+ * undollg_redo()
+ *
+ * Recovery routine for undo logs.
+ */
+void
+undolog_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_ULOG_CREATE)
+ {
+ }
+ else if (info == XLOG_ULOG_WRITE)
+ {
+ }
+}
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 22f7351fdcd..7b541137dd4 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -28,7 +28,7 @@
* RmgrNames is an array of the built-in resource manager names, to make error
* messages a bit nicer.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_cleanup_init,undo_recoveryend) \
name,
static const char *const RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 6b8c17bb4c4..d17634c9bd9 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -20,6 +20,7 @@
#include "access/nbtxlog.h"
#include "access/rmgr.h"
#include "access/spgxlog.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "catalog/storage_xlog.h"
@@ -32,7 +33,7 @@
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_cleanup_init,undo_recoveryend) \
{ name, desc, identify},
static const RmgrDescData RmgrDescTable[RM_N_BUILTIN_IDS] = {
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index 3b6a497e1b4..eed615e9196 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
* Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
* file format.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_cleanup_init,undo_recoveryend) \
symname,
typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 78e6b908c6e..02755b04bb9 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -24,26 +24,27 @@
* Changes to this list possibly need an XLOG_PAGE_MAGIC bump.
*/
-/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
+/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode, undo, undo_desc, undo_identify, undo_cleanup_init, undo_recoveryend */
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_ULOG_ID, "UndoLog", undolog_redo, undolog_desc, undolog_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
new file mode 100644
index 00000000000..8955197398d
--- /dev/null
+++ b/src/include/access/undolog.h
@@ -0,0 +1,82 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.h
+ * Definitions for undolog module of PostgresSQL
+ *
+ * Copyright (c) 2000-2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/undolog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_H
+#define UNDOLOG_H
+
+#include "access/transam.h"
+#include "access/xlogreader.h"
+
+/* Directory for storing undo logs */
+#define UNDOLOG_DIR "pg_ulog"
+
+typedef struct UndoLogFileHeader
+{
+ int32 magic; /* fixed ULOG file magic number */
+ /* SimpleUndoLogRecord follows */
+} UndoLogFileHeader;
+
+typedef struct UndoLogRecord
+{
+ uint32 ul_tot_len; /* total length of entire record */
+ pg_crc32c ul_crc; /* CRC for this record */
+ RmgrId ul_rmid; /* resource manager for this record */
+ uint8 ul_info; /* record info */
+ /* rmgr-specific data follow, no padding */
+} UndoLogRecord;
+
+/*
+ * The high 4 bits in ul_info may be used freely by rmgr. The lower 4 bits are
+ * not used for now.
+ */
+#define ULR_INFO_MASK 0x0F
+#define ULR_RMGR_INFO_MASK 0xF0
+
+/* XLOG stuff */
+#define XLOG_ULOG_CREATE 0x00
+#define XLOG_ULOG_WRITE 0x10
+
+typedef struct xl_ulog_create
+{
+ FullTransactionId xid;
+} xl_ulog_create;
+
+typedef struct xl_ulog_write
+{
+ FullTransactionId xid;
+ off_t off;
+ Size len;
+ unsigned char bytes[FLEXIBLE_ARRAY_MEMBER];
+} xl_ulog_write;
+
+extern void undolog_redo(XLogReaderState *record);
+extern void undolog_desc(StringInfo buf, XLogReaderState *record);
+extern const char *undolog_identify(uint8 info);
+
+#define ULogRecGetData(record) ((char *)record + sizeof(UndoLogRecord))
+#define ULogRecGetInfo(record) ((record)->ul_info)
+
+/*
+ * UndoLogSetFilename()
+ *
+ * Generates undo log file name for the xid. Used in simpleundolog.c and
+ * simpleundologdesc.c.
+ */
+static inline void
+UndoLogSetFilename(char *buf, FullTransactionId xid)
+{
+ StaticAssertDecl(sizeof(FullTransactionId) == 8,
+ "width of FullTrasactionId is not 8");
+ snprintf(buf, MAXPGPATH, "%s/%016" INT64_MODIFIER "x",
+ UNDOLOG_DIR, U64FromFullTransactionId(xid));
+}
+
+#endif /* UNDOLOG_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 171a7dd5d2b..27e7bfde48f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3032,6 +3032,9 @@ ULONG
ULONG_PTR
UV
UVersionInfo
+UndoDescData
+UndoLogFileHeader
+UndoLogRecord
UnicodeNormalizationForm
UnicodeNormalizationQC
Unique
@@ -4135,6 +4138,8 @@ xl_standby_locks
xl_tblspc_create_rec
xl_tblspc_drop_rec
xl_testcustomrmgrs_message
+xl_ulog_create
+xl_ulog_write
xl_xact_abort
xl_xact_assignment
xl_xact_commit
--
2.43.5
v35-0002-Delay-the-reset-of-UNLOGGED-relations.patchtext/x-patch; charset=us-asciiDownload
From 3021eea3905e6979b1eade9669ffba85a41246e7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 30 Sep 2024 17:56:46 +0900
Subject: [PATCH v35 02/21] Delay the reset of UNLOGGED relations
This patch set enables the creation of INIT forks within transactions,
and abort-time cleanup of such forks will be handled by the UNDO log
system introduced in a subsequent commit. Since UNDO logs depend on
WAL, to ensure correct UNDO processing, any operations involving INIT
forks, specifically reinit, must take place after recovery reaches
consistency. To prepare for the introduction of the UNDO log system,
this commit moves the UNLOGGED relation cleanup from before recovery
begins to when the consistency point is reached, or to the end of
recovery if hot standby is disabled.
---
src/backend/access/transam/xlog.c | 17 +++++++++--------
src/backend/access/transam/xlogrecovery.c | 9 +++++++++
2 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3ecaf181392..e9d029ebfac 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5756,14 +5756,6 @@ StartupXLOG(void)
/* Check that the GUCs used to generate the WAL allow recovery */
CheckRequiredParameterValues();
- /*
- * We're in recovery, so unlogged relations may be trashed and must be
- * reset. This should be done BEFORE allowing Hot Standby
- * connections, so that read-only backends don't try to read whatever
- * garbage is left over from before.
- */
- ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
-
/*
* Likewise, delete any saved transaction snapshot files that got left
* behind by crashed backends.
@@ -5911,7 +5903,16 @@ StartupXLOG(void)
* end-of-recovery steps fail.
*/
if (InRecovery)
+ {
+ /*
+ * Clean up unlogged relations if not already done. If consistency has
+ * been established, this cleanup would have occurred when entering hot
+ * standby mode (see CheckRecoveryConsistency for details).
+ */
+ if (!reachedConsistency)
+ ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+ }
/*
* Pre-scan prepared transactions to find out the range of XIDs present.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 31caa49d6c3..278154ad9a0 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -57,6 +57,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/datetime.h"
#include "utils/fmgrprotos.h"
@@ -2271,6 +2272,14 @@ CheckRecoveryConsistency(void)
reachedConsistency &&
IsUnderPostmaster)
{
+ /*
+ * Unlogged relations may be trashed and must be reset. This should be
+ * done BEFORE allowing Hot Standby connections, so that read-only
+ * backends don't try to read whatever garbage is left over from
+ * before.
+ */
+ ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
+
SpinLockAcquire(&XLogRecoveryCtl->info_lck);
XLogRecoveryCtl->SharedHotStandbyActive = true;
SpinLockRelease(&XLogRecoveryCtl->info_lck);
--
2.43.5
v35-0003-Add-new-function-TwoPhaseXidExists.patchtext/x-patch; charset=us-asciiDownload
From 199d160b42cb29ea5044a90008a05e1fc380934c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 3 Oct 2024 17:46:06 +0900
Subject: [PATCH v35 03/21] Add new function TwoPhaseXidExists
The undo log system needs to know whether a transaction is in the
prepared state or not. Add a new function TwoPhaseXidExists to
accommodate this requirement.
---
src/backend/access/transam/twophase.c | 31 +++++++++++++++++++++------
src/include/access/twophase.h | 1 +
2 files changed, 26 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 23dd0c6ef6e..0a62c237b9d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -794,10 +794,11 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
* specified by XID
*
* If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
- * caller had better hold it.
+ * caller had better hold it. If noerror is true, returns NULL if the global
+ * transaction does not exist.
*/
static GlobalTransaction
-TwoPhaseGetGXact(TransactionId xid, bool lock_held)
+TwoPhaseGetGXact(TransactionId xid, bool lock_held, bool noerror)
{
GlobalTransaction result = NULL;
int i;
@@ -831,8 +832,13 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
if (!lock_held)
LWLockRelease(TwoPhaseStateLock);
- if (result == NULL) /* should not happen */
- elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+ if (result == NULL)
+ {
+ if (!noerror)
+ elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+
+ return NULL;
+ }
cached_xid = xid;
cached_gxact = result;
@@ -902,7 +908,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
ProcNumber
TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
{
- GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+ GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held, false);
return gxact->pgprocno;
}
@@ -917,11 +923,24 @@ TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
PGPROC *
TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
{
- GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+ GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held, false);
return GetPGProcByNumber(gxact->pgprocno);
}
+/*
+ * TwoPhaseXidExists
+ * Returns whether the prepared transaction specified by XID exists
+ *
+ * If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
+ * caller had better hold it.
+ */
+bool
+TwoPhaseXidExists(TransactionId xid, bool lock_held)
+{
+ return TwoPhaseGetGXact(xid, lock_held, true) != NULL;
+}
+
/************************************************************************/
/* State file support */
/************************************************************************/
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b85b65c604e..c6298332d36 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -38,6 +38,7 @@ extern TransactionId TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
bool *have_more);
extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
extern int TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held);
+extern bool TwoPhaseXidExists(TransactionId xid, bool lock_held);
extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
TimestampTz prepared_at,
--
2.43.5
v35-0004-Introduce-undo-log-implementation.patchtext/x-patch; charset=iso-8859-7Download
From ded270c99b3167269de1dc7dd290bfebf6abeff9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 3 Oct 2024 18:24:54 +0900
Subject: [PATCH v35 04/21] Introduce undo log implementation
This commit introduces the UNDO log system. In this implementation,
undo information is primarily stored in dynamic shared memory (DSM),
processed and dropped at transaction commit. Undo information is also
WAL-logged and restored to the same in-memory structure during
recovery. To ensure correct recovery beyond checkpoints, undo logs in
DSM are copied at each checkpoint to files in the `pg_ulog` directory,
with filenames based on full transaction IDs. During (sub)transaction
commits, aborts, and cleanups at recovery end and consistency point,
the UNDO log system calls the undo routines (not yet implemented) with
the appropriate transaction states, allowing these routines to perform
necessary operations.
---
src/backend/access/transam/twophase.c | 3 +
src/backend/access/transam/undolog.c | 1265 +++++++++++++++++
src/backend/access/transam/xact.c | 21 +
src/backend/access/transam/xlog.c | 14 +-
src/backend/access/transam/xlogrecovery.c | 2 +
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/lwlock.c | 3 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/init/postinit.c | 4 +
src/bin/initdb/initdb.c | 17 +
src/bin/pg_waldump/t/001_basic.pl | 3 +-
src/include/access/undolog.h | 22 +
src/include/storage/lwlock.h | 3 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 6 +
15 files changed, 1364 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 0a62c237b9d..8455ceb057a 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/twophase_rmgr.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -1607,6 +1608,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
abortstats,
gid);
+ UndoLog_UndoByXid(isCommit, xid, hdr->nsubxacts, children, false);
+
ProcArrayRemove(proc, latestXid);
/*
diff --git a/src/backend/access/transam/undolog.c b/src/backend/access/transam/undolog.c
index c32f5cd0b6f..d19caac5946 100644
--- a/src/backend/access/transam/undolog.c
+++ b/src/backend/access/transam/undolog.c
@@ -12,12 +12,1093 @@
*
* src/backend/access/transam/undolog.c
*
+ * Each undo log record is stored in a dynamic shared area block, mapped by a
+ * dynamic shared hash. These records are WAL-logged but not immediately
+ * written to files; instead, they are flushed to multiple files at every
+ * checkpoint. Otherwise, no files are written. Unless during recovery, the
+ * xids for which a backend created undo logs are recorded in a local array for
+ * fast lookup of undo logs to process at commit.
+ *
*-------------------------------------------------------------------------
*/
#include "postgres.h"
+#include <sys/stat.h>
+
+#include "lib/stringinfo.h"
+#include "access/parallel.h"
#include "access/undolog.h"
+#include "access/twophase_rmgr.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "lib/dshash.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/procarray.h"
+#include "utils/memutils.h"
+
+
+#define ULOG_FILE_MAGIC 0x474f4c55 /* 'ULOG' in big-endian */
+
+/* Resource manager definition */
+typedef struct RmgrUndoData
+{
+ const char *rm_name;
+ void (*rm_undo) (UndoLogRecord *record, ULogOp op,
+ bool recovered, bool redo);
+ void (*rm_undocleanupinit) (void);
+ void (*rm_undorecoveryend) (void);
+} RmgrUndoData;
+
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_cleanupinit,undo_recoveryend) \
+ { name, undo, undo_cleanupinit, undo_recoveryend },
+
+static RmgrUndoData RmgrUndo[RM_MAX_ID + 1] = {
+#include "access/rmgrlist.h"
+};
+#undef PG_RMGR
+
+/*
+ * Undo log DSA block struct, each block is pointed from a dshash entry.
+ *
+ * refcount does not directly represent the number of referencers. It is set to
+ * 2 when allocated, then decremented by 2 when dropped. It is decremented by 1
+ * if the contents of this entry have been moved to another DSA entry for the
+ * same hash entry due to space expansion. Checkpointer increments the refcount
+ * while writing the file for this entry to prevent the entry from being freed.
+ *
+ * image is written by backends (or by startup during recovery), and then read
+ * by the writers and the checkpointer. written_upto primarily stores the
+ * working state of the checkpointer, and it is not crucial if the value is
+ * slightly outdated. Therefore, we do not use an LWLock and instead rely on a
+ * memory barrier to ensure the integrity of this entry.
+ */
+typedef struct UndoLogEntry
+{
+ FullTransactionId xid; /* target full xid */
+ pg_atomic_uint32 refcount; /* reference count */
+ off_t written_upto; /* how far this block is flushed */
+ bool recovered; /* restored in recovery ? */
+ Size image_buf_len; /* record image buffer length */
+ Size image_len; /* record image length.*/
+ unsigned char image[FLEXIBLE_ARRAY_MEMBER]; /* record image buffer */
+} UndoLogEntry;
+
+/* Allocation size for UndoLogEntry */
+#define UndoLogEntrySize(body_len) \
+ (offsetof(UndoLogEntry, image) + body_len)
+
+/* Initial undolog block length, arbitrary number. */
+#define InitialUndoLogLen 256
+
+/* round the size up to the nearest power of two */
+static inline Size
+undolog_adjust_bufsize(Size target)
+{
+ Size s = InitialUndoLogLen;
+
+ while (s < target)
+ s *= 2;
+
+ return s;
+}
+
+/*
+ * Initial space capacity for the record image of a newly created undo log
+ * entry
+ */
+#define UNDOLOG_INITIAL_LOG_CAPACITY 32
+
+/* Undo log dshash entry, keyed by xid */
+typedef struct UndoLogHashEntry
+{
+ FullTransactionId xid; /* xid */
+ dsa_pointer body; /* UndoLogEntry DSA pointer */
+} UndoLogHashEntry;
+
+/* Array for storing xids belongs to the current session */
+#define ACTIVE_ULOG_XIDLIST_INITIAL_LEN 32
+
+/*
+ * Struct for top-level management variables.
+ *
+ * Stored in local memory. current_ulog points to the currently active undo log
+ * dsa block, for the transaction xid. buf holds a memory area of length buflen
+ * for various uses in this module, helping to avoid frequent palloc/pfree
+ * cycles.
+ */
+typedef struct ULogStateData
+{
+ MemoryContext cxt; /* working memroy context */
+ UndoLogEntry *current_ulog; /* current open entry */
+ dsa_area *dsa; /* UndoLogEntry dsa */
+ dshash_table *hash; /* UndoLogHashEntry hash */
+ FullTransactionId *ulog_xids; /* xid list, length ulog_xids_cap */
+ int ulog_xids_cap; /* capacity of the above list */
+ int ulog_xids_num; /* number of elements */
+ void *buf; /* working buffer */
+ int buflen; /* length of the buffer */
+} ULogStateData;
+
+static ULogStateData ULogState = {NULL, NULL, NULL, NULL, NULL, 0, 0, NULL, 0};
+
+#define UndoLogContext (ULogState.cxt)
+
+/*
+ * Struct for bootstrap info about dynamic shared memory, stored in static
+ * shared memory.
+ */
+typedef struct UndoLogCtrlStruct
+{
+ dsa_handle dsah;
+ dshash_table_handle dshh;
+ pg_atomic_uint32 nhashent; /* # of hash entries in dshh */
+} UndoLogCtrlStruct;
+
+static UndoLogCtrlStruct *UndoLogCtrl;
+
+/* dshash parameter */
+static const dshash_parameters dsh_params = {
+ sizeof(FullTransactionId),
+ sizeof(UndoLogHashEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_UNDOLOG_HASH
+};
+
+/*
+ * Shared memory intializer functions
+ */
+Size
+UndoLogShmemSize(void)
+{
+ return MAXALIGN(sizeof(UndoLogCtrlStruct));
+}
+
+void
+UndoLogShmemInit(void)
+{
+ bool found;
+
+ UndoLogCtrl = (UndoLogCtrlStruct *) ShmemInitStruct("UNDO Log Data",
+ UndoLogShmemSize(),
+ &found);
+ if (!found)
+ {
+ UndoLogCtrl->dsah = DSA_HANDLE_INVALID;
+ UndoLogCtrl->dshh = DSHASH_HANDLE_INVALID;
+ }
+}
+
+/*
+ * InitUndoLog() - initialize undo log system
+ */
+void
+InitUndoLog(void)
+{
+ /* shouldn't be called from postmaster */
+ Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
+
+ UndoLogContext = AllocSetContextCreate(TopMemoryContext,
+ "Undo log system",
+ ALLOCSET_DEFAULT_SIZES);
+
+ LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+ if (UndoLogCtrl->dshh == DSHASH_HANDLE_INVALID)
+ {
+ /* We're the first process, allocate them. */
+ ULogState.dsa = dsa_create(LWTRANCHE_UNDOLOG_DSA);
+ dsa_pin(ULogState.dsa);
+ dsa_pin_mapping(ULogState.dsa);
+
+ ULogState.hash = dshash_create(ULogState.dsa, &dsh_params, NULL);
+
+ /* Share handles with succeeding processes */
+ UndoLogCtrl->dsah = dsa_get_handle(ULogState.dsa);
+ UndoLogCtrl->dshh = dshash_get_hash_table_handle(ULogState.hash);
+ pg_atomic_init_u32(&UndoLogCtrl->nhashent, 0);
+ }
+ else
+ {
+ /* Attach to existing dsm and dshash table */
+ ULogState.dsa = dsa_attach(UndoLogCtrl->dsah);
+ dsa_pin_mapping(ULogState.dsa);
+ ULogState.hash = dshash_attach(ULogState.dsa, &dsh_params,
+ UndoLogCtrl->dshh, NULL);
+ }
+
+ LWLockRelease(UndoLogLock);
+
+ /* xids list for the current session */
+ if (ULogState.ulog_xids_cap == 0)
+ {
+ Size alloc_size;
+
+ ULogState.ulog_xids_cap = UNDOLOG_INITIAL_LOG_CAPACITY;
+
+ alloc_size = ULogState.ulog_xids_cap * sizeof(FullTransactionId);
+ ULogState.ulog_xids = MemoryContextAlloc(UndoLogContext, alloc_size);
+ }
+}
+
+/*
+ * undolog_ensure_buffer()
+ *
+ * Ensures that the data buffer in ULogState is larger than the specified size.
+ */
+static void *
+undolog_ensure_buffer(Size size)
+{
+ if (size > ULogState.buflen)
+ {
+ if (likely(ULogState.buf))
+ ULogState.buf = repalloc(ULogState.buf, size);
+ else
+ ULogState.buf = MemoryContextAlloc(UndoLogContext, size);
+ ULogState.buflen = size;
+ }
+
+ return ULogState.buf;
+}
+
+/*
+ * undolog_file_exists() - Checks for file corresponding to specified xid.
+ */
+static bool
+undolog_file_exists(FullTransactionId xid)
+{
+ char fname[MAXPGPATH];
+ struct stat statbuf;
+
+ UndoLogSetFilename(fname, xid);
+
+ if (stat(fname, &statbuf) < 0)
+ {
+ if (errno == ENOENT)
+ return false;
+
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("stat failed for undo file \"%s\": %m", fname));
+ }
+
+ return true;
+}
+
+/*
+ * undolog_create_entry() - Creates a new undo log memory entry
+ *
+ * Creates a new undo log entry and hash entry for the xid, with an initial
+ * payload of bodylen. Assumes that the hash entry for the xid does not
+ * exist. Then stores the xid in the local xid list, unless during recovery.
+ */
+static UndoLogEntry *
+undolog_create_entry(FullTransactionId xid, Size bodylen, bool redo)
+{
+ dsa_pointer chunk;
+ UndoLogEntry *newlog;
+ UndoLogHashEntry *shhashent;
+ bool found;
+
+ if (!redo)
+ {
+ Assert(ULogState.ulog_xids_num <= ULogState.ulog_xids_cap);
+
+ /* Expand xids array if needed */
+ if (ULogState.ulog_xids_num == ULogState.ulog_xids_cap)
+ {
+ ULogState.ulog_xids_cap *= 2;
+ ULogState.ulog_xids =
+ repalloc(ULogState.ulog_xids,
+ ULogState.ulog_xids_cap * sizeof(FullTransactionId));
+ }
+ ULogState.ulog_xids[ULogState.ulog_xids_num++] = xid;
+ }
+
+
+ /* Adjust allocation size */
+ bodylen = undolog_adjust_bufsize(bodylen);
+
+ /* Allocate undo log memory entry with the initial size */
+ chunk = dsa_allocate0(ULogState.dsa, UndoLogEntrySize(bodylen));
+ newlog = dsa_get_address(ULogState.dsa, chunk);
+
+ newlog->xid = xid;
+ pg_atomic_init_u32(&newlog->refcount, 2);
+ newlog->written_upto = 0;
+ newlog->recovered = false;
+ newlog->image_buf_len = bodylen;
+ newlog->image_len = 0;
+
+ /* the following dshash access is exepcted to act as write barrier */
+
+ /* Register the memory entry into the hash. */
+ shhashent = dshash_find_or_insert(ULogState.hash, &xid, &found);
+ Assert(!found);
+ shhashent->body = chunk;
+ dshash_release_lock(ULogState.hash, shhashent);
+ pg_atomic_add_fetch_u32(&UndoLogCtrl->nhashent, 1);
+ return newlog;
+}
+
+/*
+ * undolog_find_ulog() - Finds the undo log entry for the xid
+ *
+ * If the undo log for the xid is already in shared memory, return the
+ * entry. Otherwise, load it from a file if one exists. If create is true, a
+ * new log is created if no previous one is found.
+ */
+static UndoLogEntry *
+undolog_find_ulog(FullTransactionId xid, bool create, bool redo, bool *found)
+{
+ UndoLogHashEntry *shhashent;
+
+ *found = false;
+
+ /* fastpath for the currenlty active undo log */
+ if (ULogState.current_ulog &&
+ FullTransactionIdEquals(xid, ULogState.current_ulog->xid))
+ {
+ *found = true;
+ return ULogState.current_ulog;
+ }
+
+ /* search for the in-memory entry */
+ shhashent = dshash_find(ULogState.hash, &xid, false);
+ if (shhashent)
+ {
+ ULogState.current_ulog =
+ dsa_get_address(ULogState.dsa, shhashent->body);
+ dshash_release_lock(ULogState.hash, shhashent);
+
+ Assert(FullTransactionIdEquals(ULogState.current_ulog->xid, xid));
+
+ *found = true;
+ return ULogState.current_ulog;
+ }
+
+ /* Finally, check for the file. */
+ if (undolog_file_exists(xid))
+ {
+ char fname[MAXPGPATH];
+ struct stat sbuf;
+ int fd;
+ int ret;
+ UndoLogEntry *ulog;
+
+ UndoLogSetFilename(fname, xid);
+ fd = BasicOpenFile(fname, PG_BINARY | O_RDONLY);
+
+ if (fstat(fd, &sbuf) < 0)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to stat ulog file \"%s\": %m", fname));
+
+ ulog = undolog_create_entry(xid, sbuf.st_size, redo);
+ Assert(ulog->image_buf_len >= sbuf.st_size);
+
+ ret = read(fd, ulog->image, sbuf.st_size);
+ if (ret < 0)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to read ulog file \"%s\": %m", fname));
+ close(fd);
+
+ ulog->image_len = sbuf.st_size;
+
+ /* ensure all previous writes are visible before follower continues. */
+ pg_write_barrier();
+
+ ULogState.current_ulog = ulog;
+
+ *found = true;
+ return ULogState.current_ulog;
+ }
+
+ if (create)
+ {
+ xl_ulog_write *wrec;
+ UndoLogFileHeader *fheader;
+ UndoLogEntry *ulog;
+ Size bodylen;
+ Size wreclen;
+ XLogRecPtr recptr;
+
+ /*
+ * WAL-log the creation of this undo log file. However, we don't
+ * actually crate the file since undo log is usually processed
+ * in-memory.
+ */
+ Assert(FullTransactionIdIsValid(xid));
+
+ if (!redo)
+ {
+ xl_ulog_create crec;
+
+ crec.xid = xid;
+ XLogBeginInsert();
+ XLogRegisterData((char *) &crec, sizeof(crec));
+ (void) XLogInsert(RM_ULOG_ID, XLOG_ULOG_CREATE);
+ }
+
+ ulog = undolog_create_entry(xid, InitialUndoLogLen, redo);
+
+ bodylen = sizeof(UndoLogFileHeader);
+ wreclen = sizeof(xl_ulog_write) + bodylen;
+ wrec = undolog_ensure_buffer(wreclen);
+
+ wrec->xid = xid;
+ wrec->off = 0;
+ wrec->len = bodylen;
+ fheader = (UndoLogFileHeader *) &wrec->bytes;
+ fheader->magic = ULOG_FILE_MAGIC;
+
+ if (!redo)
+ {
+ XLogBeginInsert();
+ XLogRegisterData((char *) wrec, wreclen);
+ recptr = XLogInsert(RM_ULOG_ID, XLOG_ULOG_WRITE);
+ XLogFlush(recptr);
+ }
+
+ Assert(ulog->image_buf_len >= bodylen);
+ memcpy(ulog->image, fheader, bodylen);
+ ulog->image_len = bodylen;
+
+ /* ensure all previous writes are visible before follower continues. */
+ pg_write_barrier();
+
+ ULogState.current_ulog = ulog;
+ }
+ else
+ ULogState.current_ulog = NULL;
+
+ return ULogState.current_ulog;
+}
+
+/*
+ * undolog_remove_file() - Removes a file specified by ULogState.file_name.
+ *
+ * The file must already have been closed.
+ */
+static void
+undolog_remove_file(FullTransactionId xid)
+{
+ char fname[MAXPGPATH];
+
+ UndoLogSetFilename(fname, xid);
+
+ durable_unlink(fname, FATAL);
+}
+
+static void
+undolog_drop_ulog(FullTransactionId xid, bool redo)
+{
+ UndoLogHashEntry *shhashent;
+ dsa_pointer ulog_p;
+ UndoLogEntry *ulog;
+
+ Assert(FullTransactionIdIsValid(xid));
+
+ /* dereference current_ulog if it is about to be dropped */
+ if (ULogState.current_ulog &&
+ FullTransactionIdEquals(xid, ULogState.current_ulog->xid))
+ ULogState.current_ulog = NULL;
+
+ /*
+ * Search for the hash entry. Minimize the time before delete to release
+ * the lock quickly. We should find the entry here.
+ */
+ shhashent = dshash_find(ULogState.hash, &xid, true);
+ ulog_p = shhashent->body;
+ dshash_delete_entry(ULogState.hash, shhashent);
+ Assert(shhashent);
+
+ pg_atomic_sub_fetch_u32(&UndoLogCtrl->nhashent, 1);
+
+ /*
+ * Remove the dsa entry, then the file if any.
+ */
+ ulog = dsa_get_address(ULogState.dsa, ulog_p);
+
+ /*
+ * Decrement the refcount by 2. If it reaches 0, no other process is
+ * referencing it, so we can and should remove the DSA entry. Undo log file
+ * can be removed in either case, as no one will read it.
+ */
+ if (pg_atomic_sub_fetch_u32(&ulog->refcount, 2) == 0)
+ dsa_free(ULogState.dsa, ulog_p);
+
+ if (undolog_file_exists(xid))
+ undolog_remove_file(xid);
+
+ /*
+ * Remove this xid from the local xid list if not in recovery.
+ *
+ * This is kind of bogus during the two-phase commit phase, as the xid list
+ * in ULogState is empty in this case, but this function still tries to
+ * remove the xid from the list. Since the list is empty, there's no issue
+ * here.
+ */
+ if (!redo)
+ {
+ for (int i = 0 ; i < ULogState.ulog_xids_num ; i++)
+ {
+ if (FullTransactionIdEquals(ULogState.ulog_xids[i], xid))
+ {
+ for (; i < ULogState.ulog_xids_num - 1; i++)
+ ULogState.ulog_xids[i] = ULogState.ulog_xids[i + 1];
+
+ ULogState.ulog_xids_num--;
+ }
+ }
+ }
+}
+
+
+/*
+ * undolog_write_to_file - write the undo log into file
+ *
+ * Writes the undo log image from start_off to end_off.
+ */
+static void
+undolog_write_to_file(UndoLogEntry *ulog, off_t start_off, off_t end_off)
+{
+ char fname[MAXPGPATH];
+ int fd;
+ int ret;
+ Size write_len;
+
+ Assert(end_off > start_off);
+
+ write_len = end_off - start_off;
+
+ if (write_len == 0)
+ return;
+
+ UndoLogSetFilename(fname, ulog->xid);
+ fd = BasicOpenFile(fname, PG_BINARY | O_WRONLY | O_CREAT);
+ if (fd < 0)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to open or create undo file \"%s\": %m", fname));
+
+ ret = pg_pwrite(fd,
+ ulog->image + start_off, /* buf address */
+ write_len, /* write_size */
+ start_off); /* file offset */
+
+ if (ret != write_len)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to write to undo file \"%s\": %m", fname));
+
+ close(fd);
+
+ return;
+}
+
+static UndoLogEntry *
+undolog_realloc_ulog(UndoLogEntry *ulog, Size target_size)
+{
+ UndoLogHashEntry *shhashent;
+ dsa_pointer chunk;
+ dsa_pointer oldchunk;
+ Size oldsize;
+ Size newsize;
+ UndoLogEntry *newlog;
+ UndoLogEntry *oldlog = ulog;
+
+ if (ulog->image_buf_len >= target_size)
+ return oldlog;
+
+ /* adjust the target size */
+ target_size = undolog_adjust_bufsize(target_size);
+
+ /*
+ * Allocate a new undo log entry with double the size, then copy its
+ * contents. No other process will write to the old entry, so there’s no
+ * need to lock it.
+ */
+ oldsize = UndoLogEntrySize(oldlog->image_buf_len);
+ newsize = UndoLogEntrySize(target_size);
+ chunk = dsa_allocate0(ULogState.dsa, newsize);
+ newlog = dsa_get_address(ULogState.dsa, chunk);
+
+ /* don't bother clearing expanded area */
+ memcpy(newlog, oldlog, oldsize);
+
+ /* adjust and initialize some attributes of the new log */
+ newlog->image_buf_len = target_size;
+ pg_atomic_init_u32(&newlog->refcount, 2);
+
+ ULogState.current_ulog = ulog = newlog;
+
+ /*
+ * Remap the hash entry to point to the new body. This shared hash is
+ * likely to become a hotspot, so be careful not to hold the lock for
+ * too long.
+ */
+ shhashent = dshash_find(ULogState.hash, &ulog->xid, true);
+ oldchunk = shhashent->body;
+ shhashent->body = chunk;
+ dshash_release_lock(ULogState.hash, shhashent);
+
+ /*
+ * Mark the entry body as moved, and free it if no other process is
+ * working on it. Otherwise, let that process handle the task.
+ */
+ Assert(oldlog == dsa_get_address(ULogState.dsa, oldchunk));
+ if (pg_atomic_sub_fetch_u32(&oldlog->refcount, 1) == 1)
+ {
+ /*
+ * No other process is referencing this entry. Decrement again
+ */
+ if (pg_atomic_sub_fetch_u32(&oldlog->refcount, 1) == 0)
+ dsa_free(ULogState.dsa, oldchunk);
+ }
+
+ return newlog;
+}
+
+/*
+ * UndoLogWrite() - Writes an undolog record using current xid
+ *
+ * This write is WAL-logged.
+ */
+void
+UndoLogWrite(RmgrId rmgr, uint8 info, void *data, int len)
+{
+ FullTransactionId xid;
+ int reclen = sizeof(UndoLogRecord) + len;
+ int wreclen = sizeof(xl_ulog_write) + reclen;
+ xl_ulog_write *wrec;
+ UndoLogRecord *rec;
+ pg_crc32c undodata_crc;
+ UndoLogEntry *ulog;
+ XLogRecPtr recptr;
+ bool found;
+
+ Assert(!RecoveryInProgress());
+ Assert(!IsParallelWorker());
+
+ /*
+ * The following line may assign a new transaction ID. This is somewhat
+ * clumsy, but the caller needs to assign it soon.
+ */
+ xid = GetCurrentFullTransactionId();
+
+ if (!IsUnderPostmaster)
+ return;
+
+ /* the caller can set rmgr bits only */
+ Assert((info & ~ULR_RMGR_INFO_MASK) == 0);
+
+ ulog = undolog_find_ulog(xid, true, false, &found);
+ Assert(ulog);
+
+ /* create undo record as a part of WAL record to avoid copying */
+ wrec = undolog_ensure_buffer(wreclen);
+ rec = (UndoLogRecord *) &wrec->bytes;
+ rec->ul_tot_len = reclen;
+ rec->ul_rmid = rmgr;
+ rec->ul_info = info;
+
+ memcpy((char *)rec + sizeof(UndoLogRecord), data, len);
+
+ /* calculate CRC of the data */
+ INIT_CRC32C(undodata_crc);
+ COMP_CRC32C(undodata_crc, &rec->ul_rmid,
+ reclen - offsetof(UndoLogRecord, ul_rmid));
+ rec->ul_crc = undodata_crc;
+
+ /*
+ * Write an XLOG record for this undo log record. It is crucial to flush
+ * immediately, as this record will cancel the action taken immediately
+ * after.
+ */
+ wrec->xid = ulog->xid;
+ wrec->off = ulog->image_len;
+ wrec->len = reclen;
+ XLogBeginInsert();
+ XLogRegisterData((char *) wrec, wreclen);
+ recptr = XLogInsert(RM_ULOG_ID, XLOG_ULOG_WRITE);
+ XLogFlush(recptr);
+
+ /* Write the undo log record. */
+
+ /* expand entry body if it is too short */
+ if (ulog->image_buf_len < ulog->image_len + reclen)
+ ulog = undolog_realloc_ulog(ulog, ulog->image_len + reclen);
+
+ /* Append the record. */
+ memcpy(ulog->image + ulog->image_len, rec, reclen);
+ ulog->image_len += reclen;
+
+ /* ensure all previous writes are visible before follower continues. */
+ pg_write_barrier();
+}
+
+/*
+ * ulog_process_ulog() - Processes the undo log.
+ *
+ * 'op' specifies the operation mode (commit, abort, or prepare) passed to the
+ * rmgr routines. 'outercxt' is the memory context used while the rmgr routines
+ * are running.
+ */
+static void
+undolog_process_ulog(ULogOp op, UndoLogEntry *ulog,
+ bool redo, MemoryContext outercxt)
+{
+ unsigned char *p = ulog->image;
+ unsigned char *end = p + ulog->image_len;
+
+ Assert (ulog);
+ Assert (outercxt);
+
+ p += sizeof(UndoLogFileHeader);
+
+
+ while (p < end)
+ {
+ UndoLogRecord *rec = (UndoLogRecord *)p;
+ MemoryContext oldcxt;
+ pg_crc32c undodata_crc;
+
+ /* CRC check */
+ INIT_CRC32C(undodata_crc);
+ COMP_CRC32C(undodata_crc, &rec->ul_rmid,
+ rec->ul_tot_len - offsetof(UndoLogRecord, ul_rmid));
+ if (!EQ_CRC32C(rec->ul_crc, undodata_crc))
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("incorrect undolog record checksum at %lld for xid %lld, abort undo",
+ (long long int) (p - (unsigned char *)ulog),
+ (long long unsigned int) U64FromFullTransactionId(ulog->xid)));
+
+ /* The undo routines may want to allcoate memory in the outer context */
+ oldcxt = MemoryContextSwitchTo(outercxt);
+ RmgrUndo[rec->ul_rmid].rm_undo(rec, op, ulog->recovered, redo);
+ MemoryContextSwitchTo(oldcxt);
+
+ p += rec->ul_tot_len;
+ }
+}
+
+/*
+ * ulog_undo() - Processes undo log for the specified xid.
+ *
+ * The undo log file for the xid is removed before this function returns,
+ * regardless whether it is processed or not. Therefore, the in-memory entry
+ * for this xid must be removed afterwards, if any.
+ */
+static void
+undolog_undo(bool isCommit, UndoLogEntry *ulog, FullTransactionId xid,
+ bool redo)
+{
+ ULogOp op;
+
+ Assert(!IsParallelWorker());
+
+ if (!ulog)
+ return;
+
+ if (isCommit)
+ op = ULOGOP_COMMIT;
+ else
+ op = ULOGOP_ABORT;
+
+ undolog_process_ulog(op, ulog, redo, CurrentMemoryContext);
+
+ undolog_drop_ulog(xid, redo);
+}
+
+/*
+ * UndoLog_UndoByXid()
+ *
+ * Processes undo logs for the specified transactions, intended for use in
+ * finishing prepared transactions or during recovery.
+ *
+ * children is the list of subtransaction IDs of the xid, with a length of
+ * nchildren.
+ */
+void
+UndoLog_UndoByXid(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children, bool redo)
+{
+ uint32 nextepoch;
+ TransactionId nextxid;
+ uint32 epoch;
+ FullTransactionId fxid;
+ UndoLogEntry *ulog;
+ bool found;
+
+ nextepoch = EpochFromFullTransactionId(TransamVariables->nextXid);
+ nextxid = XidFromFullTransactionId(TransamVariables->nextXid);
+
+ /* Adjust epoch, if needed. */
+ if (xid <= nextxid)
+ epoch = nextepoch;
+ else
+ epoch = nextepoch - 1;
+
+ /* process undo logs */
+ fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+
+ ulog = undolog_find_ulog(fxid, false, redo, &found);
+
+ if (ulog)
+ undolog_undo(isCommit, ulog, fxid, redo);
+
+ for (int i = 0 ; i < nchildren ; i++)
+ {
+ if (children[i] <= nextxid)
+ epoch = nextepoch;
+ else
+ epoch = nextepoch - 1;
+
+ fxid = FullTransactionIdFromEpochAndXid(epoch, children[i]);
+
+ ulog = undolog_find_ulog(fxid, false, redo, &found);
+
+ if (ulog)
+ undolog_undo(isCommit, ulog, fxid, redo);
+ }
+}
+
+/*
+ * AtEOXact_UndoLog() - At end-of-xact processing of undo logs.
+ *
+ * Processes all existing undo log files, leaving none remaining after this
+ * function returns.
+ */
+void
+AtEOXact_UndoLog(bool isCommit)
+{
+ if (ULogState.ulog_xids_num == 0)
+ return;
+
+ for (int i = ULogState.ulog_xids_num - 1 ; i >= 0 ; i--)
+ {
+ UndoLogEntry *ulog;
+ bool found;
+
+ ulog = undolog_find_ulog(ULogState.ulog_xids[i], false, false, &found);
+ Assert(ulog);
+
+ undolog_undo(isCommit, ulog, ULogState.ulog_xids[i], false);
+ }
+
+ ULogState.current_ulog = NULL;
+}
+
+/*
+ * AtEOXact_UndoLog() - At end-of-subxact processing of undo logs.
+ *
+ * The undo log for the subtransaction will be removed on abort. It will remain
+ * on commit and be processed at the end of the top-level transaction.
+ */
+void
+AtEOSubXact_UndoLog(bool isCommit)
+{
+ FullTransactionId xid;
+ UndoLogEntry *ulog;
+ bool found;
+
+ /*
+ * Undo logs of committed subtransactions are processed at the end of the
+ * top-level transaction.
+ */
+ if (isCommit)
+ return;
+
+ xid = GetCurrentFullTransactionIdIfAny();
+
+ /* Return if the innermost subxid is not assigned. */
+ if (!FullTransactionIdIsValid(xid))
+ return;
+
+ ulog = undolog_find_ulog(xid, false, false, &found);
+
+ if (ulog)
+ undolog_undo(isCommit, ulog, xid, false);
+}
+
+/*
+ * AtPrepare_UndoLog()
+ *
+ * Blow away the xid list for the current transaction.
+ *
+ * Undo log entries in shared memory are left as-is, waiting for
+ * finish-prepared processing.
+ */
+void
+AtPrepare_UndoLog(void)
+{
+ ULogState.ulog_xids_num = 0;
+}
+
+static void
+undolog_cleanup_init(void)
+{
+ for (int rmid = 0; rmid <= RM_MAX_ID; rmid++)
+ {
+ if (RmgrUndo[rmid].rm_name == NULL)
+ continue;
+
+ if (RmgrUndo[rmid].rm_undocleanupinit != NULL)
+ RmgrUndo[rmid].rm_undocleanupinit();
+ }
+}
+
+void
+UndoLogRecoveryEnd(void)
+{
+ for (int rmid = 0; rmid <= RM_MAX_ID; rmid++)
+ {
+ if (RmgrUndo[rmid].rm_name == NULL)
+ continue;
+
+ if (RmgrUndo[rmid].rm_undorecoveryend != NULL)
+ RmgrUndo[rmid].rm_undorecoveryend();
+ }
+}
+
+/*
+ * UndoLogCleanup() - On-recovery cleanup of undo log
+ *
+ * This function is called after ULOG file consistency is established, either
+ * when recovery reaches consistency or after recovery finishes if hot standby
+ * is not active.
+ */
+void
+UndoLogCleanup(bool end_of_recovery)
+{
+ dshash_seq_status hstat;
+ UndoLogHashEntry *p;
+ UndoLogEntry *ulog;
+ dsa_pointer ulog_dsap;
+ MemoryContext outercxt;
+
+
+ /*
+ * Some memory allocation occurs during this process. Use a separate memory
+ * context to avoid memory leaks.
+ */
+ outercxt = MemoryContextSwitchTo(UndoLogContext);
+
+ undolog_cleanup_init();
+
+ /*
+ * scan through all undo log files.
+ *
+ * Since we're in recovery and this is called only once, take a simpler way
+ * in exchange for allowing a possible brief pause of checkpointer. Because
+ * we may drop some entries, we run the dshash loop in exclusive lock mode.
+ */
+ dshash_seq_init(&hstat, ULogState.hash, true);
+ while ((p = dshash_seq_next(&hstat)) != NULL)
+ {
+ FullTransactionId log_fxid;
+ TransactionId log_xid;
+ FullTransactionId next_fxid;
+ TransactionId next_xid;
+ uint32 oldest_epoch;
+ TransactionId oldest_xid;
+ FullTransactionId oldest_fxid;
+ ULogOp op;
+ bool xact_prepared;
+ uint32 refcount PG_USED_FOR_ASSERTS_ONLY;
+
+ ulog_dsap = p->body;
+ ulog = dsa_get_address(ULogState.dsa, ulog_dsap);
+
+ /* increment refcount to avoid deleteion of this block */
+ refcount = pg_atomic_add_fetch_u32(&ulog->refcount, 1);
+
+ /* refcount should not be 1, see CheckPointUndoLog for details. */
+ Assert(refcount > 1);
+
+ /*
+ * Make sure the log's xid is valid.
+ */
+ log_fxid = ulog->xid;
+ log_xid = XidFromFullTransactionId(log_fxid);
+
+ LWLockAcquire(XactTruncationLock, LW_SHARED);
+ next_fxid = TransamVariables->nextXid;
+ oldest_xid = TransamVariables->oldestClogXid;
+ LWLockRelease(XactTruncationLock);
+
+ /* Generate full xid for oldest_xid based on next_fxid */
+ next_xid = XidFromFullTransactionId(next_fxid);
+ oldest_epoch = EpochFromFullTransactionId(next_fxid);
+
+ /* adjust epoch for oldest xid */
+ if (oldest_xid > next_xid)
+ oldest_epoch--;
+
+ oldest_fxid =
+ FullTransactionIdFromEpochAndXid(oldest_epoch, oldest_xid);
+
+ /* check the ulog xid */
+ if (FullTransactionIdPrecedes(log_fxid, oldest_fxid))
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("undolog found for too-old transaction %llu",
+ (long long unsigned int) U64FromFullTransactionId(log_fxid)));
+
+ /* All transactions with undo log must be in-progress. */
+ if (!TransactionIdIsInProgress(log_xid))
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("undolog found for non-acitve transaction: %llu",
+ (long long unsigned int) U64FromFullTransactionId(log_fxid)));
+
+ /*
+ * Let undo routines perform cleanup tasks with appropriate
+ * assumptions. If the transaction is prepared or when recovery is
+ * reaching consistency, assume it is active; otherwise, perform abort
+ * cleanups.
+ */
+ xact_prepared = TwoPhaseXidExists(log_xid, false);
+
+ if (!end_of_recovery || xact_prepared)
+ op = ULOGOP_ACTIVE;
+ else
+ op = ULOGOP_ABORT;
+
+ undolog_process_ulog(op, ulog, true, outercxt);
+
+ if (!xact_prepared)
+ {
+ /*
+ * Since undolog_drop_ulog() is not supposed to be called inside
+ * dshash loops, we deliberately replicate part of the function
+ * here.
+ */
+ dshash_delete_current(&hstat);
+ pg_atomic_sub_fetch_u32(&UndoLogCtrl->nhashent, 1);
+
+ if (pg_atomic_sub_fetch_u32(&ulog->refcount, 2) == 0)
+ dsa_free(ULogState.dsa, ulog_dsap);
+
+ if (undolog_file_exists(log_fxid))
+ undolog_remove_file(log_fxid);
+ }
+ }
+
+ dshash_seq_term(&hstat);
+
+ MemoryContextSwitchTo(outercxt);
+
+ ULogState.current_ulog = NULL;
+}
/*
* undollg_redo()
@@ -31,8 +1112,192 @@ undolog_redo(XLogReaderState *record)
if (info == XLOG_ULOG_CREATE)
{
+ xl_ulog_create *rec = (xl_ulog_create *) XLogRecGetData(record);
+ bool found;
+
+ /*
+ * We don't check for the existence of the log. Although the log should
+ * not be found in a consistent state, it may appear during the
+ * inconsistent period in recovery.
+ */
+ (void) undolog_find_ulog(rec->xid, true, true, &found);
}
else if (info == XLOG_ULOG_WRITE)
{
+ UndoLogEntry *ulog;
+ Size target_size;
+ xl_ulog_write *rec = (xl_ulog_write *) XLogRecGetData(record);
+ bool found;
+
+ ulog = undolog_find_ulog(rec->xid, false, false, &found);
+ Assert(ulog);
+
+ /* mark this log as recovered */
+ ulog->recovered = true;
+
+ target_size = rec->off + rec->len;
+
+ if (ulog->image_buf_len < target_size)
+ ulog = undolog_realloc_ulog(ulog, target_size);
+
+ memcpy(ulog->image + rec->off, rec->bytes, rec->len);
+
+ /*
+ * The log image can extend beyond the end of the write during crash
+ * recovery. Even in that case, the undo log grows to its final length
+ * until consistency is reached. Therefore, we don't perform sanity
+ * checks on the image length. Additionally, XLOG_ULOG_CREATE writes an
+ * extra header part, which reappears as XLOG_ULOG_WRITE. However, this
+ * is not an issue from an integrity perspective. Although we could
+ * omit issuing the XLOG_ULOG_WRITE record for the header part, it is
+ * currently still emitted to maintain the apparent integrity among
+ * XLOG records.
+ */
+ ulog->image_len = rec->off + rec->len;
+
+ /* ensure all previous writes are visible before follower continues. */
+ pg_write_barrier();
}
}
+
+/*
+ * CheckPointRelationMap
+ *
+ * This is called during a checkpoint. It must ensure that any undo log writes
+ * that were WAL-logged before the start of the checkpoint are securely flushed
+ * to disk so that we won't lose their existence and content before this
+ * checkpoint start.
+ */
+void
+CheckPointUndoLog(void)
+{
+ dshash_seq_status hstat;
+ UndoLogHashEntry *p;
+ dsa_pointer *ulogs;
+ UndoLogEntry *ulog;
+ dsa_pointer ulog_dsap;
+ int ulogs_len;
+ int n_initial_ulogs;
+ int nulogs;
+ bool written = false;
+
+ if (!IsUnderPostmaster)
+ return;
+
+
+ /*
+ * Allocate working area with the approximate size of the current hash
+ * entries. This is not accurate but enough to avoid freuqent repallocs.
+ * Add 16 as an arbitrary number to avoid repalloc as much as possible.
+ */
+ n_initial_ulogs = pg_atomic_read_u32(&UndoLogCtrl->nhashent) + 16;
+
+ ulogs_len = 64;
+
+ while (ulogs_len < n_initial_ulogs)
+ ulogs_len *= 2;
+
+ ulogs = MemoryContextAlloc(UndoLogContext,
+ ulogs_len * sizeof(dsa_pointer));
+ nulogs = 0;
+
+ /*
+ * This loop acquires a lock on the dshash. To minimize lock time, we only
+ * collect candidate UNDO logs here, and the actual processing is done
+ * afterward.
+
+ */
+ dshash_seq_init(&hstat, ULogState.hash, false);
+ while ((p = dshash_seq_next(&hstat)) != NULL)
+ {
+ uint32 refcount PG_USED_FOR_ASSERTS_ONLY;
+
+ ulog_dsap = p->body;
+
+ ulog = dsa_get_address(ULogState.dsa, ulog_dsap);
+
+ /* increment refcount to avoid deleteion of this block */
+ refcount = pg_atomic_add_fetch_u32(&ulog->refcount, 1);
+
+ /*
+ * Since undolog_drop_ulog() drops the hash entry before the DSA entry,
+ * we should not expect to fetch the dropped DSA entries at this point.
+ * (This suggests that a read-after-write error might be occurring.)
+ */
+ Assert(refcount > 1);
+ Assert(nulogs <= ulogs_len);
+
+ if (unlikely(nulogs == ulogs_len))
+ {
+ ulogs_len *= 2;
+ ulogs = repalloc(ulogs, ulogs_len * sizeof(UndoLogEntry *));
+ }
+
+ ulogs[nulogs] = ulog_dsap;
+ nulogs++;
+ }
+ dshash_seq_term(&hstat);
+
+ for (int i = 0 ; i < nulogs ; i++)
+ {
+ uint32 refcount;
+
+ ulog = dsa_get_address(ULogState.dsa, ulogs[i]);
+
+ /*
+ * Check the refcount again to check if the undo log is already dropped
+ * before actually writing the file. Since we have incremented it, 1
+ * indicates that the entry is being dropped by any other
+ * process. Otherwise we can safely write the contents to file. See
+ * UndoLogWrite() and undolog_drop_ulog() for the corresponding code.
+ */
+ if (pg_atomic_read_u32(&ulog->refcount) > 1 &&
+ ulog->image_len > ulog->written_upto)
+ {
+ off_t start_off;
+ off_t end_off;
+
+ ulog = dsa_get_address(ULogState.dsa, ulogs[i]);
+
+ /*
+ * Ensure visibility of previously written data before proceeding
+ * with reads.
+ */
+ pg_read_barrier();
+
+ /*
+ * Copy values from shared memory to local memory to finalize them.
+ */
+ start_off = ulog->written_upto;
+ end_off = ulog->image_len;
+
+ if (start_off < end_off)
+ {
+ undolog_write_to_file(ulog, start_off, end_off);
+
+ if (ulog->written_upto < end_off)
+ written = true;
+
+ /* No fencing needed, as only checkpointer accesses it. */
+ ulog->written_upto = end_off;
+ }
+ }
+
+ /*
+ * Decrement the 1 we added ourselves. If this results in 0, it means
+ * the DSA area is no longer needed, and we ourselves need to delete
+ * it.
+ */
+ refcount = pg_atomic_sub_fetch_u32(&ulog->refcount, 1);
+
+ /* Free the entry if it is no longer used */
+ if (refcount < 2)
+ dsa_free(ULogState.dsa, ulogs[i]);
+ }
+
+ pfree(ulogs);
+
+ /* Sync the directory if any files have been written to it. */
+ if (written)
+ fsync_fname(UNDOLOG_DIR, true);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 004f7e10e55..5739ef3b7f5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -2432,6 +2433,12 @@ CommitTransaction(void)
AtEOXact_MultiXact();
+ /*
+ * Discard UNDO log. This does not necessarily need to be done here, but it
+ * is performed in the same place as abort.
+ */
+ AtEOXact_UndoLog(true);
+
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_LOCKS,
true, true);
@@ -2670,6 +2677,7 @@ PrepareTransaction(void)
AtPrepare_PgStat();
AtPrepare_MultiXact();
AtPrepare_RelationMap();
+ AtPrepare_UndoLog();
/*
* Here is where we really truly prepare.
@@ -2971,6 +2979,13 @@ AbortTransaction(void)
AtEOXact_TypeCache();
AtEOXact_Inval(false);
AtEOXact_MultiXact();
+
+ /*
+ * Drop storage files. This has to happen after buffer pins are
+ * dropped, required by DropRelationBuffers().
+ */
+ AtEOXact_UndoLog(false);
+
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_LOCKS,
false, true);
@@ -5174,6 +5189,7 @@ CommitSubTransaction(void)
AtEOSubXact_TypeCache();
AtEOSubXact_Inval(true);
AtSubCommit_smgr();
+ AtEOSubXact_UndoLog(true);
/*
* The only lock we actually release here is the subtransaction XID lock.
@@ -5356,6 +5372,7 @@ AbortSubTransaction(void)
RESOURCE_RELEASE_AFTER_LOCKS,
false, false);
AtSubAbort_smgr();
+ AtEOSubXact_UndoLog(false);
AtEOXact_GUC(false, s->gucNestLevel);
AtEOSubXact_SPI(false, s->subTransactionId);
@@ -6246,6 +6263,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ UndoLog_UndoByXid(true, xid, parsed->nsubxacts, parsed->subxacts, true);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6357,6 +6376,8 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ UndoLog_UndoByXid(false, xid, parsed->nsubxacts, parsed->subxacts, true);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e9d029ebfac..5ba4468f52d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -55,6 +55,7 @@
#include "access/timeline.h"
#include "access/transam.h"
#include "access/twophase.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
@@ -5905,13 +5906,19 @@ StartupXLOG(void)
if (InRecovery)
{
/*
- * Clean up unlogged relations if not already done. If consistency has
- * been established, this cleanup would have occurred when entering hot
- * standby mode (see CheckRecoveryConsistency for details).
+ * If consistency has not been established, process undo log files to
+ * clean up storage files from unfinished transactions and clean up
+ * unlogged relations. (See CheckRecoveryConsistency for details.)
*/
if (!reachedConsistency)
+ {
+ UndoLogCleanup(true);
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
+ }
+
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+
+ UndoLogRecoveryEnd();
}
/*
@@ -7516,6 +7523,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ CheckPointUndoLog();
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 278154ad9a0..93d56dee75e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -33,6 +33,7 @@
#include "access/timeline.h"
#include "access/transam.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
@@ -2278,6 +2279,7 @@ CheckRecoveryConsistency(void)
* backends don't try to read whatever garbage is left over from
* before.
*/
+ UndoLogCleanup(false);
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
SpinLockAcquire(&XLogRecoveryCtl->info_lck);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index d68aa29d93e..4982861b624 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
#include "access/syncscan.h"
#include "access/transam.h"
#include "access/twophase.h"
+#include "access/undolog.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "access/xlogwait.h"
@@ -150,6 +151,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, WaitLSNShmemSize());
+ size = add_size(size, UndoLogShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -287,6 +289,7 @@ CreateOrAttachShmemStructs(void)
XLogPrefetchShmemInit();
XLogRecoveryShmemInit();
CLOGShmemInit();
+ UndoLogShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
MultiXactShmemInit();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index db6ed784ab3..8a12a977c4e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,9 @@ static const char *const BuiltinTrancheNames[] = {
[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
[LWTRANCHE_XACT_SLRU] = "XactSLRU",
[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+ [LWTRANCHE_UNDOLOG_DSA] = "UndoLogDSA",
+ [LWTRANCHE_UNDOLOG_HASH] = "UndoLogHash",
+ [LWTRANCHE_UNDOLOG_DATA] = "UndoLogData",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6f..b8f31db46aa 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -347,6 +347,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
WaitLSN "Waiting to read or update shared Wait-for-LSN state."
+UndoLog "Waiting to read or update shared UNDO log state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a024b1151d0..fda74d7af00 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -24,6 +24,7 @@
#include "access/htup_details.h"
#include "access/session.h"
#include "access/tableam.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -640,6 +641,9 @@ BaseInit(void)
*/
InitXLogInsert();
+ /* Initialize undo log system */
+ InitUndoLog();
+
/* Initialize lock manager's local structs */
InitLockManagerAccess();
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783e..7c1f4e1f53a 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -307,6 +307,7 @@ void setup_signals(void);
void setup_text_search(void);
void create_data_directory(void);
void create_xlog_or_symlink(void);
+void create_ulog(void);
void warn_on_mount_point(int error);
void initialize_data_directory(void);
@@ -3012,6 +3013,21 @@ create_xlog_or_symlink(void)
free(subdirloc);
}
+/* Create undo log directory */
+void
+create_ulog(void)
+{
+ char *subdirloc;
+
+ /* form name of the place for the subdirectory */
+ subdirloc = psprintf("%s/pg_ulog", pg_data);
+
+ if (mkdir(subdirloc, pg_dir_create_mode) < 0)
+ pg_fatal("could not create directory \"%s\": %m",
+ subdirloc);
+
+ free(subdirloc);
+}
void
warn_on_mount_point(int error)
@@ -3046,6 +3062,7 @@ initialize_data_directory(void)
create_data_directory();
create_xlog_or_symlink();
+ create_ulog();
/* Create required subdirectories (other than pg_wal) */
printf(_("creating subdirectories ... "));
diff --git a/src/bin/pg_waldump/t/001_basic.pl b/src/bin/pg_waldump/t/001_basic.pl
index 578e4731394..09396f065fa 100644
--- a/src/bin/pg_waldump/t/001_basic.pl
+++ b/src/bin/pg_waldump/t/001_basic.pl
@@ -73,7 +73,8 @@ BRIN
CommitTs
ReplicationOrigin
Generic
-LogicalMessage$/,
+LogicalMessage
+UndoLog$/,
'rmgr list');
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
index 8955197398d..827d3f9fa90 100644
--- a/src/include/access/undolog.h
+++ b/src/include/access/undolog.h
@@ -33,6 +33,14 @@ typedef struct UndoLogRecord
/* rmgr-specific data follow, no padding */
} UndoLogRecord;
+/* Operation mode for rm_uedo() resource manger routine */
+typedef enum ULogOp
+{
+ ULOGOP_COMMIT, /* tell to perform commit action */
+ ULOGOP_ABORT, /* tell to perform abort action */
+ ULOGOP_ACTIVE /* tell xact is active */
+} ULogOp;
+
/*
* The high 4 bits in ul_info may be used freely by rmgr. The lower 4 bits are
* not used for now.
@@ -57,6 +65,20 @@ typedef struct xl_ulog_write
unsigned char bytes[FLEXIBLE_ARRAY_MEMBER];
} xl_ulog_write;
+extern Size UndoLogShmemSize(void);
+extern void UndoLogShmemInit(void);
+extern void InitUndoLog(void);
+extern void UndoLogWrite(RmgrId rmgr, uint8 info, void *data, int len);
+extern void AtEOXact_UndoLog(bool isCommit);
+extern void AtEOSubXact_UndoLog(bool isCommit);
+extern void AtPrepare_UndoLog(void);
+extern void UndoLog_UndoByXid(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children,
+ bool redo);
+extern void UndoLogCleanup(bool recovery_end);
+extern void UndoLogRecoveryEnd(void);
+extern void CheckPointUndoLog(void);
+
extern void undolog_redo(XLogReaderState *record);
extern void undolog_desc(StringInfo buf, XLogReaderState *record);
extern const char *undolog_identify(uint8 info);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e09..d3cbfefda60 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -215,6 +215,9 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_SUBTRANS_SLRU,
LWTRANCHE_XACT_SLRU,
LWTRANCHE_PARALLEL_VACUUM_DSA,
+ LWTRANCHE_UNDOLOG_DSA,
+ LWTRANCHE_UNDOLOG_HASH,
+ LWTRANCHE_UNDOLOG_DATA,
LWTRANCHE_FIRST_USER_DEFINED,
} BuiltinTrancheIds;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 88dc79b2bd6..cdee38e791a 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
PG_LWLOCK(53, WaitLSN)
+PG_LWLOCK(54, UndoLog)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 27e7bfde48f..edca81d88b0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2477,6 +2477,7 @@ RewriteState
RmgrData
RmgrDescData
RmgrId
+RmgrUndoData
RoleNameEntry
RoleNameItem
RoleSpec
@@ -3030,10 +3031,15 @@ UINT
ULARGE_INTEGER
ULONG
ULONG_PTR
+ULogOp
+ULogStateData
UV
UVersionInfo
UndoDescData
+UndoLogCtrlStruct
+UndoLogEntry
UndoLogFileHeader
+UndoLogHashEntry
UndoLogRecord
UnicodeNormalizationForm
UnicodeNormalizationQC
--
2.43.5
v35-0005-Remove-function-for-retaining-files-on-outer-tra.patchtext/x-patch; charset=us-asciiDownload
From 754019d478e59220054bf52d9bd2ce216939e8ea Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 26 Jul 2024 09:40:17 +0900
Subject: [PATCH v35 05/21] Remove function for retaining files on outer
transaction aborts - step 1/2
The function RelationPreserveStorage() was introduced by commit
b9b8831ad6 in 2010 to retain storage files committed in a
subtransaction at the abort of outer transactions. However, no use
case for this behavior has emerged since then. Moving the at-commit
removal of storage files from pendingDeletes to the UNDO log system
would require the UNDO system to handle cancellation of already logged
entries, adding unnecessary complexity with no benefit. Therefore,
remove this feature.
---
src/backend/catalog/storage.c | 16 +++++++++++++++
src/backend/utils/cache/relmapper.c | 30 +++++++++--------------------
2 files changed, 25 insertions(+), 21 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index f56b3cc0f23..bdbed9fba3c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -254,6 +254,22 @@ RelationPreserveStorage(RelFileLocator rlocator, bool atCommit)
PendingRelDelete *prev;
PendingRelDelete *next;
+ /*
+ * There is no caller that passes false for atCommit.
+ *
+ * The only caller that used to pass false for atCommit was
+ * write_relmapper_file(), which intended to preserve committed storage
+ * files for mapped relations if outer transactions aborted. However, this
+ * has not occurred for more than ten years, and it is unlikely to be
+ * needed in the future. The code to let storage files committed in
+ * subtransactions survive after the top transaction aborts makes the UNDO
+ * log system overly complex and inefficient. Therefore, this feature has
+ * been removed. The function signature is left unchanged to make this
+ * change less invasive and to prevent the function from being mistakenly
+ * called during transaction aborts.
+ */
+ Assert (atCommit);
+
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 48d344ae3ff..89072627120 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -1001,29 +1001,17 @@ write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
CacheInvalidateRelmap(dbid);
/*
- * Make sure that the files listed in the map are not deleted if the outer
- * transaction aborts. This had better be within the critical section
- * too: it's not likely to fail, but if it did, we'd arrive at transaction
- * abort with the files still vulnerable. PANICing will leave things in a
- * good state on-disk.
+ * There was a call to RelationPreserveStorage(). It was originally
+ * intended to ensure that storage files committed in subtransactions would
+ * survive an outer transaction's abort. This was introduced by commit
+ * b9b8831ad6 in 2010, but no use case has emerged since then. To simplify
+ * the UNDO log system, this code has been removed. See
+ * RelationMapUpdateMap() for more details. Now, we only check that this
+ * function is called in a top transaction.
*
- * Note: we're cheating a little bit here by assuming that mapped files
- * are either in pg_global or the database's default tablespace.
+ * During boot processing or recovery, the nest level will be zero.
*/
- if (preserve_files)
- {
- int32 i;
-
- for (i = 0; i < newmap->num_mappings; i++)
- {
- RelFileLocator rlocator;
-
- rlocator.spcOid = tsid;
- rlocator.dbOid = dbid;
- rlocator.relNumber = newmap->mappings[i].mapfilenumber;
- RelationPreserveStorage(rlocator, false);
- }
- }
+ Assert(!preserve_files || GetCurrentTransactionNestLevel() <= 1);
/* Critical section done */
if (write_wal)
--
2.43.5
v35-0006-Remove-function-for-retaining-files-on-outer-tra.patchtext/x-patch; charset=us-asciiDownload
From af316f18cefcec8a40f25591c60b0c7bd21b9ea9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 31 Jul 2024 17:49:24 +0900
Subject: [PATCH v35 06/21] Remove function for retaining files on outer
transaction aborts - step 2/2
Remove function parameters that became unnecessary due to the previous
commit.
---
src/backend/utils/cache/relmapper.c | 39 ++++++++++++++---------------
1 file changed, 19 insertions(+), 20 deletions(-)
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 89072627120..75a1f550509 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -143,7 +143,7 @@ static void load_relmap_file(bool shared, bool lock_held);
static void read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held,
int elevel);
static void write_relmap_file(RelMapFile *newmap, bool write_wal,
- bool send_sinval, bool preserve_files,
+ bool send_sinval,
Oid dbid, Oid tsid, const char *dbpath);
static void perform_relmap_update(bool shared, const RelMapFile *updates);
@@ -309,7 +309,7 @@ RelationMapCopy(Oid dbid, Oid tsid, char *srcdbpath, char *dstdbpath)
* file.
*/
LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
- write_relmap_file(&map, true, false, false, dbid, tsid, dstdbpath);
+ write_relmap_file(&map, true, false, dbid, tsid, dstdbpath);
LWLockRelease(RelationMappingLock);
}
@@ -634,9 +634,9 @@ RelationMapFinishBootstrap(void)
/* Write the files; no WAL or sinval needed */
LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
- write_relmap_file(&shared_map, false, false, false,
+ write_relmap_file(&shared_map, false, false,
InvalidOid, GLOBALTABLESPACE_OID, "global");
- write_relmap_file(&local_map, false, false, false,
+ write_relmap_file(&local_map, false, false,
MyDatabaseId, MyDatabaseTableSpace, DatabasePath);
LWLockRelease(RelationMappingLock);
}
@@ -887,7 +887,7 @@ read_relmap_file(RelMapFile *map, char *dbpath, bool lock_held, int elevel)
*/
static void
write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
- bool preserve_files, Oid dbid, Oid tsid, const char *dbpath)
+ Oid dbid, Oid tsid, const char *dbpath)
{
int fd;
char mapfilename[MAXPGPATH];
@@ -1000,19 +1000,6 @@ write_relmap_file(RelMapFile *newmap, bool write_wal, bool send_sinval,
if (send_sinval)
CacheInvalidateRelmap(dbid);
- /*
- * There was a call to RelationPreserveStorage(). It was originally
- * intended to ensure that storage files committed in subtransactions would
- * survive an outer transaction's abort. This was introduced by commit
- * b9b8831ad6 in 2010, but no use case has emerged since then. To simplify
- * the UNDO log system, this code has been removed. See
- * RelationMapUpdateMap() for more details. Now, we only check that this
- * function is called in a top transaction.
- *
- * During boot processing or recovery, the nest level will be zero.
- */
- Assert(!preserve_files || GetCurrentTransactionNestLevel() <= 1);
-
/* Critical section done */
if (write_wal)
END_CRIT_SECTION();
@@ -1058,8 +1045,20 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
*/
merge_map_updates(&newmap, updates, allowSystemTableMods);
+ /*
+ * write_relmap_file() had a feature to allow storage files committed in
+ * subtransactions to survive the aborts of outer transactions. This was
+ * introduced by commit b9b8831ad6 in 2010, but no use case has emerged
+ * since then. To keep the UNDO log system straightforward, this code has
+ * been removed. See `RelationMapUpdateMap()` for more details. Now, we
+ * only check that this function is called in a top-level transaction.
+ *
+ * During boot processing or recovery, the nest level will be zero.
+ */
+ Assert (GetCurrentTransactionNestLevel() <= 1);
+
/* Write out the updated map and do other necessary tasks */
- write_relmap_file(&newmap, true, true, true,
+ write_relmap_file(&newmap, true, true,
(shared ? InvalidOid : MyDatabaseId),
(shared ? GLOBALTABLESPACE_OID : MyDatabaseTableSpace),
(shared ? "global" : DatabasePath));
@@ -1118,7 +1117,7 @@ relmap_redo(XLogReaderState *record)
* performed.
*/
LWLockAcquire(RelationMappingLock, LW_EXCLUSIVE);
- write_relmap_file(&newmap, false, true, false,
+ write_relmap_file(&newmap, false, true,
xlrec->dbid, xlrec->tsid, dbpath);
LWLockRelease(RelationMappingLock);
--
2.43.5
v35-0007-Prevent-orphan-storage-files-after-server-crash.patchtext/x-patch; charset=us-asciiDownload
From d93c4fc771ff984368c99cb6f398aa5f3e17ae36 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 3 Oct 2024 19:30:23 +0900
Subject: [PATCH v35 07/21] Prevent orphan storage files after server crash
When a server crashes during a transaction that creates tables, newly
created but unused storage files are not removed. This patch prevents
such orphan files by utilizing the UNDO log system for storage files.
The behavior of this feature overlaps with the existing functionality
that handles the removal of unnecessary files during rollback via
pendingDeletes. As a result, that functionality has been removed in
this commit. However, commit-time file deletions are not covered by
the UNDO log system, so that part remains in use. Consequently, the
isCommit flag for entries in the pendingDeletes list is now always
true. To avoid unnecessary changes to the code, the flag has been
retained.
---
src/backend/access/heap/heapam_handler.c | 22 +--
src/backend/access/rmgrdesc/Makefile | 1 +
src/backend/access/rmgrdesc/smgrundodesc.c | 49 ++++++
src/backend/access/rmgrdesc/undologdesc.c | 2 +
src/backend/access/transam/undolog.c | 1 +
src/backend/catalog/index.c | 4 +-
src/backend/catalog/storage.c | 189 ++++++++++++++++++---
src/backend/commands/sequence.c | 4 +-
src/backend/commands/tablecmds.c | 19 ++-
src/backend/storage/buffer/bufmgr.c | 4 +-
src/backend/storage/file/reinit.c | 92 ++++++++++
src/backend/storage/smgr/smgr.c | 9 +
src/include/access/rmgrlist.h | 2 +-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_ulog.h | 41 +++++
src/include/storage/reinit.h | 4 +
src/include/storage/smgr.h | 1 +
src/test/recovery/t/013_crash_restart.pl | 19 +++
18 files changed, 410 insertions(+), 55 deletions(-)
create mode 100644 src/backend/access/rmgrdesc/smgrundodesc.c
create mode 100644 src/include/catalog/storage_ulog.h
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a8d95e0f1c1..00136b7e39b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -611,8 +611,7 @@ heapam_relation_set_new_filelocator(Relation rel,
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_TOASTVALUE);
- smgrcreate(srel, INIT_FORKNUM, false);
- log_smgrcreate(newrlocator, INIT_FORKNUM);
+ RelationCreateFork(srel, INIT_FORKNUM, true, true);
}
smgrclose(srel);
@@ -656,16 +655,17 @@ heapam_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
{
if (smgrexists(RelationGetSmgr(rel), forkNum))
{
- smgrcreate(dstrel, forkNum, false);
-
- /*
- * WAL log creation if the relation is persistent, or this is the
- * init fork of an unlogged relation.
- */
- if (RelationIsPermanent(rel) ||
+ bool wal_log = RelationIsPermanent(rel) |
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
- forkNum == INIT_FORKNUM))
- log_smgrcreate(newrlocator, forkNum);
+ forkNum == INIT_FORKNUM);
+
+ /*
+ * Usually, we don't use UNDO log for FSM or VM forks, as their
+ * creation is not transactional. However, we're currently copying
+ * the entire relation in a transactional manner, which requires
+ * after-crash cleanup.
+ */
+ RelationCreateFork(dstrel, forkNum, wal_log, true);
RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
rel->rd_rel->relpersistence);
}
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 542fd3d6a8e..fc4605bd30b 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -26,6 +26,7 @@ OBJS = \
rmgrdesc_utils.o \
seqdesc.o \
smgrdesc.o \
+ smgrundodesc.o \
spgdesc.o \
standbydesc.o \
tblspcdesc.o \
diff --git a/src/backend/access/rmgrdesc/smgrundodesc.c b/src/backend/access/rmgrdesc/smgrundodesc.c
new file mode 100644
index 00000000000..d9082ba079b
--- /dev/null
+++ b/src/backend/access/rmgrdesc/smgrundodesc.c
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrundodesc.c
+ * rmgr undolog descriptor routines for catalog/storage.c
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/smgrundodesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include "catalog/storage_ulog.h"
+#include "lib/stringinfo.h"
+
+void
+smgr_undodesc(StringInfo buf, UndoLogRecord *record)
+{
+ uint8 info = ULogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == ULOG_SMGR_CREATE)
+ {
+ ul_smgr_create *urec = (ul_smgr_create *) ULogRecGetData(record);
+
+ appendStringInfo(buf, ": %d/%d/%d, fork %d, backend %d",
+ urec->rlocator.spcOid,
+ urec->rlocator.dbOid,
+ urec->rlocator.relNumber,
+ urec->forknum, urec->backend);
+ }
+}
+
+const char *
+smgr_undoidentify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case ULOG_SMGR_CREATE:
+ id = "SMGRCREATE";
+ break;
+ }
+
+ return id;
+}
diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c
index d717646d2e0..1810f2693ff 100644
--- a/src/backend/access/rmgrdesc/undologdesc.c
+++ b/src/backend/access/rmgrdesc/undologdesc.c
@@ -14,6 +14,8 @@
#include "postgres.h"
#include "access/undolog.h"
+#include "catalog/storage.h"
+#include "catalog/storage_ulog.h"
typedef struct UndoDescData
{
diff --git a/src/backend/access/transam/undolog.c b/src/backend/access/transam/undolog.c
index d19caac5946..e7abca07e03 100644
--- a/src/backend/access/transam/undolog.c
+++ b/src/backend/access/transam/undolog.c
@@ -35,6 +35,7 @@
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "lib/dshash.h"
+#include "catalog/storage_ulog.h"
#include "miscadmin.h"
#include "storage/fd.h"
#include "storage/procarray.h"
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 74d0f3097eb..7e283fdbe5b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3042,8 +3042,8 @@ index_build(Relation heapRelation,
if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
!smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
{
- smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
- log_smgrcreate(&indexRelation->rd_locator, INIT_FORKNUM);
+ RelationCreateFork(RelationGetSmgr(indexRelation),
+ INIT_FORKNUM, true, true);
indexRelation->rd_indam->ambuildempty(indexRelation);
}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index bdbed9fba3c..0b6748d803f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,19 +19,27 @@
#include "postgres.h"
+#include "access/amapi.h"
+#include "access/undolog.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogutils.h"
#include "catalog/storage.h"
+#include "catalog/storage_ulog.h"
#include "catalog/storage_xlog.h"
+#include "common/file_utils.h"
#include "miscadmin.h"
#include "storage/bulk_write.h"
+#include "storage/copydir.h"
+#include "storage/fd.h"
#include "storage/freespace.h"
#include "storage/proc.h"
+#include "storage/reinit.h"
#include "storage/smgr.h"
#include "utils/hsearch.h"
+#include "utils/inval.h"
#include "utils/memutils.h"
#include "utils/rel.h"
@@ -76,6 +84,8 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
+/* local functions */
+static void ulog_smgrcreate(SMgrRelation srel, ForkNumber forkNum);
/*
* AddPendingSync
@@ -147,28 +157,8 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
}
srel = smgropen(rlocator, procNumber);
- smgrcreate(srel, MAIN_FORKNUM, false);
- if (needs_wal)
- log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
-
- /*
- * Add the relation to the list of stuff to delete at abort, if we are
- * asked to do so.
- */
- if (register_delete)
- {
- PendingRelDelete *pending;
-
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->procNumber = procNumber;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
- }
+ RelationCreateFork(srel, MAIN_FORKNUM, needs_wal, register_delete);
if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
{
@@ -179,6 +169,31 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
return srel;
}
+/*
+ * RelationCreateFork
+ * Create physical storage for a fork of a relation.
+ *
+ * This function creates a relation fork in a transactional manner. When
+ * undo_log is true, the creation is UNDO-logged so that in case of transaction
+ * aborts or server crashes later on, the fork will be removed. If the caller
+ * plans to remove the fork in another way, it should pass false. Additionally,
+ * it is WAL-logged if wal_log is true.
+ */
+void
+RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
+ bool wal_log, bool undo_log)
+{
+ /* Schedule the removal of this init fork at abort if requested. */
+ if (undo_log)
+ ulog_smgrcreate(srel, forkNum);
+
+ /* WAL-log this creation if requested. */
+ if (wal_log)
+ log_smgrcreate(&srel->smgr_rlocator.locator, forkNum);
+
+ smgrcreate(srel, forkNum, false);
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -198,6 +213,20 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform UndoLogWrite of an XLOG_SMGR_CREATE record to UNDO log.
+ */
+void
+ulog_smgrcreate(SMgrRelation srel, ForkNumber forkNum)
+{
+ ul_smgr_create ulrec;
+
+ ulrec.rlocator = srel->smgr_rlocator.locator;
+ ulrec.backend = srel->smgr_rlocator.backend;
+ ulrec.forknum = forkNum;
+ UndoLogWrite(RM_SMGR_ID, ULOG_SMGR_CREATE, &ulrec, sizeof(ulrec));
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -218,13 +247,12 @@ RelationDropStorage(Relation rel)
pendingDeletes = pending;
/*
- * NOTE: if the relation was created in this transaction, it will now be
- * present in the pending-delete list twice, once with atCommit true and
- * once with atCommit false. Hence, it will be physically deleted at end
- * of xact in either case (and the other entry will be ignored by
- * smgrDoPendingDeletes, so no error will occur). We could instead remove
- * the existing list entry and delete the physical file immediately, but
- * for now I'll keep the logic simple.
+ * NOTE: If the relation was created in this transaction, it will now be
+ * present both in the pending-delete list for commit and in the UNDO log
+ * for abort. Thus, it will be physically deleted at the end of the
+ * transaction in either case. While we could remove existing UNDO log
+ * records, allowing this would add complexity and inefficiency to the UNDO
+ * log system. Therefore, keep the logic simple here.
*/
RelationCloseSmgr(rel);
@@ -1060,3 +1088,108 @@ smgr_redo(XLogReaderState *record)
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
+
+void
+smgr_undo(UndoLogRecord *record, ULogOp op, bool recovered, bool redo)
+{
+ uint8 info = record->ul_info & ~ULR_INFO_MASK;
+
+ if (info == ULOG_SMGR_CREATE)
+ {
+ ul_smgr_create *ulrec = (ul_smgr_create *) ULogRecGetData(record);
+
+ if (op == ULOGOP_ACTIVE)
+ {
+ /*
+ * This operation is specified only during recovery cleanups. If
+ * the transaction is prepared or still active, tell reinit not to
+ * reset this relation.
+ */
+ ResetUnloggedRelationIgnore(ulrec->rlocator, ulrec->backend);
+ }
+ else if (op == ULOGOP_ABORT)
+ {
+ /* Otherwise, remove the file immediately. */
+ SMgrRelation reln;
+ ForkNumber forks[3];
+ BlockNumber firstblocks[3] = {0};
+ int nforks = 0;
+
+ forks[nforks++] = ulrec->forknum;
+
+ /*
+ * If the MAIN fork was created in the transaction, the rollback
+ * should remove all forks of this relation. Although we could
+ * register an undo record individually for each fork, this may be
+ * more complex because VM and FSM can be created
+ * non-transactionally outside the transaction that created the
+ * MAIN fork.
+ */
+ if (ulrec->forknum == MAIN_FORKNUM)
+ {
+ forks[nforks++] = VISIBILITYMAP_FORKNUM;
+ forks[nforks++] = FSM_FORKNUM;
+ }
+
+ /*
+ * Drop buffers, then the files. This can be improved by using
+ * smgrdounlinkall(), but currently I take the simpler way.
+ */
+ reln = smgropen(ulrec->rlocator, ulrec->backend);
+ DropRelationBuffers(reln, forks, nforks, firstblocks);
+ for (int i = 0 ; i < nforks ; i++)
+ smgrunlink(reln, forks[i], true);
+
+ smgrclose(reln);
+ }
+ else if (op == ULOGOP_COMMIT)
+ {
+ /*
+ * If an init fork was created during recovery, the entire relation
+ * is set to be reset at recovery-end or the consistency point.
+ * Therefore, we need to drop the relation's buffers to prevent the
+ * end-of-recovery checkpoint from flushing storage files for these
+ * relations once they have been reset.
+ */
+ if (redo && ulrec->forknum == INIT_FORKNUM)
+ {
+ SMgrRelation reln;
+ int nforks;
+ ForkNumber forks[MAX_FORKNUM + 1];
+ BlockNumber firstblocks[MAX_FORKNUM + 1] = {0};
+
+ Assert(ulrec->backend == INVALID_PROC_NUMBER);
+
+ reln = smgropen(ulrec->rlocator, ulrec->backend);
+
+ nforks = 0;
+ for (int i = 0 ; i <= MAX_FORKNUM ; i++)
+ {
+ if (smgrexists(reln, i))
+ forks[nforks++] = i;
+ }
+
+ if (nforks > 0)
+ DropRelationBuffers(reln, forks, nforks, firstblocks);
+
+ smgrclose(reln);
+ }
+ }
+ else
+ elog(PANIC, "smgr_undo: unknown ulogop code %d", op);
+ }
+ else
+ elog(PANIC, "smgr_undo: unknown op code %u", info);
+}
+
+void
+smgr_undocleanupinit(void)
+{
+ ResetUnloggedRelationIgnoreClear();
+}
+
+void
+smgr_undoshutdown(void)
+{
+ ResetUnloggedRelationIgnoreClear();
+}
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0188e8bbd5b..be6afc7df52 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
#include "catalog/objectaccess.h"
#include "catalog/pg_sequence.h"
#include "catalog/pg_type.h"
+#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
#include "commands/defrem.h"
#include "commands/sequence.h"
@@ -344,8 +345,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
SMgrRelation srel;
srel = smgropen(rel->rd_locator, INVALID_PROC_NUMBER);
- smgrcreate(srel, INIT_FORKNUM, false);
- log_smgrcreate(&rel->rd_locator, INIT_FORKNUM);
+ RelationCreateFork(srel, INIT_FORKNUM, true, true);
fill_seq_fork_with_data(rel, tuple, INIT_FORKNUM);
FlushRelationBuffers(rel);
smgrclose(srel);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 4345b96de5e..56c9d61aa21 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15655,16 +15655,17 @@ index_copy_data(Relation rel, RelFileLocator newrlocator)
{
if (smgrexists(RelationGetSmgr(rel), forkNum))
{
- smgrcreate(dstrel, forkNum, false);
-
- /*
- * WAL log creation if the relation is persistent, or this is the
- * init fork of an unlogged relation.
- */
- if (RelationIsPermanent(rel) ||
+ bool wal_log = RelationIsPermanent(rel) |
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
- forkNum == INIT_FORKNUM))
- log_smgrcreate(&newrlocator, forkNum);
+ forkNum == INIT_FORKNUM);
+
+ /*
+ * Usually, we don't use UNDO log for FSM or VM forks, as their
+ * creation is not transactional. However, we're currently copying
+ * the entire relation in a transactional manner, which requires
+ * after-crash cleanup.
+ */
+ RelationCreateFork(dstrel, forkNum, wal_log, true);
RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
rel->rd_rel->relpersistence);
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0f02bf62fa3..cb524cfa42c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4812,8 +4812,7 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
/*
* Create and copy all forks of the relation. During create database we
* have a separate cleanup mechanism which deletes complete database
- * directory. Therefore, each individual relation doesn't need to be
- * registered for cleanup.
+ * directory. Therefore, do not issue an UNDO log for this relation.
*/
RelationCreateStorage(dst_rlocator, relpersistence, false);
@@ -4827,6 +4826,7 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
{
if (smgrexists(src_rel, forkNum))
{
+ /* Use smgrcreate() directly as no UNDO log is required. */
smgrcreate(dst_rel, forkNum, false);
/*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 01e267abf9b..d3a42d3f566 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -34,6 +34,39 @@ typedef struct
RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
+static char **ignore_files = NULL;
+static int nignore_elems = 0;
+static int nignore_files = 0;
+
+/*
+ * determine if the file should be ignored when resetting unlogged relations
+ */
+static bool
+reinit_ignore_file(const char *dirname, const char *name)
+{
+ char fnamebuf[MAXPGPATH];
+ int len;
+
+ if (nignore_files == 0)
+ return false;
+
+ strncpy(fnamebuf, dirname, MAXPGPATH - 1);
+ strncat(fnamebuf, "/", MAXPGPATH - 1);
+ strncat(fnamebuf, name, MAXPGPATH - 1);
+ fnamebuf[MAXPGPATH - 1] = 0;
+
+ for (int i = 0 ; i < nignore_files ; i++)
+ {
+ /* match ignoring fork part */
+ len = strlen(ignore_files[i]);
+ if (strncmp(fnamebuf, ignore_files[i], len) == 0 &&
+ (fnamebuf[len] == 0 || fnamebuf[len] == '_'))
+ return true;
+ }
+
+ return false;
+}
+
/*
* Reset unlogged relations from before the last restart.
*
@@ -204,6 +237,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -243,6 +280,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* We never remove the init fork. */
if (forkNum == INIT_FORKNUM)
continue;
@@ -294,6 +335,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -337,6 +382,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -366,6 +415,49 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
}
}
+/*
+ * Record relfilenodes that should be left alone during reinitializing unlogged
+ * relations.
+ */
+void
+ResetUnloggedRelationIgnore(RelFileLocator rloc, ProcNumber backend)
+{
+ RelFileLocatorBackend rbloc;
+
+ if (nignore_files >= nignore_elems)
+ {
+ if (ignore_files == NULL)
+ {
+ nignore_elems = 16;
+ ignore_files = palloc(sizeof(char *) * nignore_elems);
+ }
+ else
+ {
+ nignore_elems *= 2;
+ ignore_files = repalloc(ignore_files,
+ sizeof(char *) * nignore_elems);
+ }
+ }
+
+ rbloc.backend = backend;
+ rbloc.locator = rloc;
+ ignore_files[nignore_files++] = relpath(rbloc, MAIN_FORKNUM);
+}
+
+/*
+ * Clear the ignore list
+ */
+void
+ResetUnloggedRelationIgnoreClear(void)
+{
+ if (nignore_elems == 0)
+ return;
+
+ pfree(ignore_files);
+ ignore_files = NULL;
+ nignore_elems = 0;
+}
+
/*
* Basic parsing of putative relation filenames.
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 925728eb6c1..5a403ffb04b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -813,6 +813,15 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+/*
+ * smgrunlink() -- unlink the storage file
+ */
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 02755b04bb9..66f564d9646 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -27,7 +27,7 @@
/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode, undo, undo_desc, undo_identify, undo_cleanup_init, undo_recoveryend */
PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode, NULL, NULL, NULL, NULL, NULL)
PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode, NULL, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, smgr_undo, smgr_undodesc, smgr_undoidentify, smgr_undocleanupinit, smgr_undoshutdown)
PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 72ef3ee92c0..3451d6ac80c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
+ bool wal_log, bool undo_log);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_ulog.h b/src/include/catalog/storage_ulog.h
new file mode 100644
index 00000000000..41c181ba2af
--- /dev/null
+++ b/src/include/catalog/storage_ulog.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * storage_ulog.h
+ * prototypes for Undo Log support for backend/catalog/storage.c
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/storage_ulog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STORAGE_ULOG_H
+#define STORAGE_ULOG_H
+
+#include "access/undolog.h"
+#include "storage/smgr.h"
+
+/* ULOG gives us high 4 bits (just following xlog) */
+#define ULOG_SMGR_CREATE 0x10
+
+/* undo log entry for storage file creation */
+typedef struct ul_smgr_create
+{
+ RelFileLocator rlocator;
+ ProcNumber backend;
+ ForkNumber forknum;
+} ul_smgr_create;
+
+extern void smgr_undo(UndoLogRecord *record, ULogOp op,
+ bool recovered, bool redo);
+extern void smgr_undodesc(StringInfo buf, UndoLogRecord *record);
+extern const char *smgr_undoidentify(uint8 info);
+extern void smgr_undocleanupinit(void);
+extern void smgr_undoshutdown(void);
+
+#define ULogRecGetData(record) ((char *)record + sizeof(UndoLogRecord))
+#define ULogRecGetInfo(record) ((record)->ul_info)
+
+#endif /* STORAGE_XLOG_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index 1373d509df2..02bf55d3a6b 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,9 +16,13 @@
#define REINIT_H
#include "common/relpath.h"
+#include "storage/relfilelocator.h"
extern void ResetUnloggedRelations(int op);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc,
+ ProcNumber backend);
+extern void ResetUnloggedRelationIgnoreClear(void);
extern bool parse_filename_for_nontemp_relation(const char *name,
RelFileNumber *relnumber,
ForkNumber *fork,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 899d0d681c5..a05436a8a7c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -109,6 +109,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum,
int nforks, BlockNumber *nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void AtEOXact_SMgr(void);
extern bool ProcessBarrierSmgrRelease(void);
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index d5d24e31d90..4df88efeb3d 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -86,6 +86,23 @@ ok( pump_until(
$killme_stdout = '';
$killme_stderr = '';
+#also, create a table whose storage should *not* survive.
+$killme_stdin .= q[
+CREATE TABLE should_not_survive (a int);
+SELECT pg_relation_filepath('should_not_survive');
+];
+ok( pump_until(
+ $killme, $psql_timeout, \$killme_stdout,
+ qr/base\/[[:digit:]\/]+[\r\n]$/m),
+ 'created a table');
+my $relfilerelpath = $killme_stdout;
+chomp($relfilerelpath);
+$killme_stdout = '';
+$killme_stderr = '';
+
+my $relfilepath = $node->data_dir . "/" . $relfilerelpath;
+ok( -e $relfilepath,
+ "storage file is created in xact that is going to crash");
# Start longrunning query in second session; its failure will signal that
# crash-restart has occurred. The initial wait for the trivial select is to
@@ -144,6 +161,8 @@ $killme->run();
($monitor_stdin, $monitor_stdout, $monitor_stderr) = ('', '', '');
$monitor->run();
+ok( ! -e $relfilepath,
+ "orphaned storage file is correctly removed");
# Acquire pid of new backend
$killme_stdin .= q[
--
2.43.5
v35-0008-Remove-isCommit-flag-from-PendingRelDelete.patchtext/x-patch; charset=us-asciiDownload
From b6a1b652686f0784569a808bc15b656de8178197 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 24 Oct 2024 20:19:09 +0900
Subject: [PATCH v35 08/21] Remove isCommit flag from PendingRelDelete
This is the first step in a series of three commits to modify
pendingDeletes.
The storage UNDO log now manages abort-time deletions, eliminating the
need for the pending delete mechanism in these cases. Therefore,
remove the isCommit flag and adjust the related code. In this initial
step, some calls to smgrGetPendingDeletes() are retained and will be
removed in the next patch.
---
src/backend/catalog/storage.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0b6748d803f..e3b0aa8983c 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -70,7 +70,6 @@ typedef struct PendingRelDelete
{
RelFileLocator rlocator; /* relation that may need to be deleted */
ProcNumber procNumber; /* INVALID_PROC_NUMBER if not a temp rel */
- bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
struct PendingRelDelete *next; /* linked-list link */
} PendingRelDelete;
@@ -241,7 +240,6 @@ RelationDropStorage(Relation rel)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->rlocator = rel->rd_locator;
pending->procNumber = rel->rd_backend;
- pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
@@ -302,8 +300,7 @@ RelationPreserveStorage(RelFileLocator rlocator, bool atCommit)
for (pending = pendingDeletes; pending != NULL; pending = next)
{
next = pending->next;
- if (RelFileLocatorEquals(rlocator, pending->rlocator)
- && pending->atCommit == atCommit)
+ if (RelFileLocatorEquals(rlocator, pending->rlocator))
{
/* unlink and delete list entry */
if (prev)
@@ -628,9 +625,8 @@ SerializePendingSyncs(Size maxSize, char *startAddress)
/* remove deleted rnodes */
for (delete = pendingDeletes; delete != NULL; delete = delete->next)
- if (delete->atCommit)
- (void) hash_search(tmphash, &delete->rlocator,
- HASH_REMOVE, NULL);
+ (void) hash_search(tmphash, &delete->rlocator,
+ HASH_REMOVE, NULL);
hash_seq_init(&scan, tmphash);
while ((src = (RelFileLocator *) hash_seq_search(&scan)))
@@ -700,7 +696,7 @@ smgrDoPendingDeletes(bool isCommit)
else
pendingDeletes = next;
/* do deletion if called for */
- if (pending->atCommit == isCommit)
+ if (isCommit)
{
SMgrRelation srel;
@@ -773,9 +769,8 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
/* Skip syncing nodes that smgrDoPendingDeletes() will delete. */
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
- if (pending->atCommit)
- (void) hash_search(pendingSyncHash, &pending->rlocator,
- HASH_REMOVE, NULL);
+ (void) hash_search(pendingSyncHash, &pending->rlocator,
+ HASH_REMOVE, NULL);
hash_seq_init(&scan, pendingSyncHash);
while ((pendingsync = (PendingRelSync *) hash_seq_search(&scan)))
@@ -900,10 +895,16 @@ smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr)
RelFileLocator *rptr;
PendingRelDelete *pending;
+ if (!forCommit)
+ {
+ *ptr = NULL;
+ return 0;
+ }
+
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
+ if (pending->nestLevel >= nestLevel
&& pending->procNumber == INVALID_PROC_NUMBER)
nrels++;
}
@@ -916,7 +917,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr)
*ptr = rptr;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
+ if (pending->nestLevel >= nestLevel
&& pending->procNumber == INVALID_PROC_NUMBER)
{
*rptr = pending->rlocator;
--
2.43.5
v35-0009-Remove-code-related-to-at-abort-pending-deletes.patchtext/x-patch; charset=us-asciiDownload
From c7b67b89a41053f856c2f6e19179653307d7149d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2024 11:52:39 +0900
Subject: [PATCH v35 09/21] Remove code related to at-abort pending deletes
This is the second step in a series of three commits to modify
pendingDeletes.
With abort-time processing now managed by the storage UNDO log, the
pendingDeletes system no longer handles these deletions. Consequently,
the abort and prepare code paths no longer explicitly handle file
deletions. Remove the outdated code from these paths.
---
src/backend/access/rmgrdesc/xactdesc.c | 4 ---
src/backend/access/transam/twophase.c | 34 ++++----------------------
src/backend/access/transam/xact.c | 23 -----------------
src/backend/catalog/storage.c | 27 ++++++--------------
src/include/access/xact.h | 2 --
5 files changed, 13 insertions(+), 77 deletions(-)
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 889cb955c18..08172df83fd 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -251,7 +251,6 @@ ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *p
parsed->dbId = xlrec->database;
parsed->nsubxacts = xlrec->nsubxacts;
parsed->nrels = xlrec->ncommitrels;
- parsed->nabortrels = xlrec->nabortrels;
parsed->nmsgs = xlrec->ninvalmsgs;
strncpy(parsed->twophase_gid, bufptr, xlrec->gidlen);
@@ -263,9 +262,6 @@ ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *p
parsed->xlocators = (RelFileLocator *) bufptr;
bufptr += MAXALIGN(xlrec->ncommitrels * sizeof(RelFileLocator));
- parsed->abortlocators = (RelFileLocator *) bufptr;
- bufptr += MAXALIGN(xlrec->nabortrels * sizeof(RelFileLocator));
-
parsed->stats = (xl_xact_stats_item *) bufptr;
bufptr += MAXALIGN(xlrec->ncommitstats * sizeof(xl_xact_stats_item));
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8455ceb057a..d3dac67d8cd 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -212,8 +212,6 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
static void RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
- int nrels,
- RelFileLocator *rels,
int nstats,
xl_xact_stats_item *stats,
const char *gid);
@@ -1089,7 +1087,7 @@ StartPrepare(GlobalTransaction gxact)
TwoPhaseFileHeader hdr;
TransactionId *children;
RelFileLocator *commitrels;
- RelFileLocator *abortrels;
+
xl_xact_stats_item *abortstats = NULL;
xl_xact_stats_item *commitstats = NULL;
SharedInvalidationMessage *invalmsgs;
@@ -1116,7 +1114,6 @@ StartPrepare(GlobalTransaction gxact)
hdr.owner = gxact->owner;
hdr.nsubxacts = xactGetCommittedChildren(&children);
hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
- hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
hdr.ncommitstats =
pgstat_get_transactional_drops(true, &commitstats);
hdr.nabortstats =
@@ -1146,11 +1143,6 @@ StartPrepare(GlobalTransaction gxact)
save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileLocator));
pfree(commitrels);
}
- if (hdr.nabortrels > 0)
- {
- save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileLocator));
- pfree(abortrels);
- }
if (hdr.ncommitstats > 0)
{
save_state_data(commitstats,
@@ -1532,9 +1524,6 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
TransactionId latestXid;
TransactionId *children;
RelFileLocator *commitrels;
- RelFileLocator *abortrels;
- RelFileLocator *delrels;
- int ndelrels;
xl_xact_stats_item *commitstats;
xl_xact_stats_item *abortstats;
SharedInvalidationMessage *invalmsgs;
@@ -1569,8 +1558,6 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
commitrels = (RelFileLocator *) bufptr;
bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileLocator));
- abortrels = (RelFileLocator *) bufptr;
- bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileLocator));
commitstats = (xl_xact_stats_item *) bufptr;
bufptr += MAXALIGN(hdr->ncommitstats * sizeof(xl_xact_stats_item));
abortstats = (xl_xact_stats_item *) bufptr;
@@ -1603,7 +1590,6 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels,
hdr->nabortstats,
abortstats,
gid);
@@ -1627,21 +1613,15 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
* consistency with the regular xact.c code paths, must do this before
* releasing locks, so do it before running the callbacks.
*
+ * Deletion at abort is handled by undo logs.
+ *
* NB: this code knows that we couldn't be dropping any temp rels ...
*/
if (isCommit)
{
- delrels = commitrels;
- ndelrels = hdr->ncommitrels;
+ /* Make sure files supposed to be dropped are dropped */
+ DropRelationFiles(commitrels, hdr->ncommitrels, false);
}
- else
- {
- delrels = abortrels;
- ndelrels = hdr->nabortrels;
- }
-
- /* Make sure files supposed to be dropped are dropped */
- DropRelationFiles(delrels, ndelrels, false);
if (isCommit)
pgstat_execute_transactional_drops(hdr->ncommitstats, commitstats, false);
@@ -2152,7 +2132,6 @@ RecoverPreparedTransactions(void)
subxids = (TransactionId *) bufptr;
bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId));
bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileLocator));
- bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileLocator));
bufptr += MAXALIGN(hdr->ncommitstats * sizeof(xl_xact_stats_item));
bufptr += MAXALIGN(hdr->nabortstats * sizeof(xl_xact_stats_item));
bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
@@ -2433,8 +2412,6 @@ static void
RecordTransactionAbortPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
- int nrels,
- RelFileLocator *rels,
int nstats,
xl_xact_stats_item *stats,
const char *gid)
@@ -2466,7 +2443,6 @@ RecordTransactionAbortPrepared(TransactionId xid,
*/
recptr = XactLogAbortRecord(GetCurrentTimestamp(),
nchildren, children,
- nrels, rels,
nstats, stats,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
xid, gid);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5739ef3b7f5..1a691c7a30e 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1756,8 +1756,6 @@ RecordTransactionAbort(bool isSubXact)
{
TransactionId xid = GetCurrentTransactionIdIfAny();
TransactionId latestXid;
- int nrels;
- RelFileLocator *rels;
int ndroppedstats = 0;
xl_xact_stats_item *droppedstats = NULL;
int nchildren;
@@ -1802,7 +1800,6 @@ RecordTransactionAbort(bool isSubXact)
replorigin_session_origin != DoNotReplicateId);
/* Fetch the data we need for the abort record */
- nrels = smgrGetPendingDeletes(false, &rels);
nchildren = xactGetCommittedChildren(&children);
ndroppedstats = pgstat_get_transactional_drops(false, &droppedstats);
@@ -1819,7 +1816,6 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
- nrels, rels,
ndroppedstats, droppedstats,
MyXactFlags, InvalidTransactionId,
NULL);
@@ -1870,8 +1866,6 @@ RecordTransactionAbort(bool isSubXact)
XactLastRecEnd = 0;
/* And clean up local data */
- if (rels)
- pfree(rels);
if (ndroppedstats)
pfree(droppedstats);
@@ -6009,7 +6003,6 @@ XactLogCommitRecord(TimestampTz commit_time,
XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
int ndroppedstats, xl_xact_stats_item *droppedstats,
int xactflags, TransactionId twophase_xid,
const char *twophase_gid)
@@ -6017,7 +6010,6 @@ XactLogAbortRecord(TimestampTz abort_time,
xl_xact_abort xlrec;
xl_xact_xinfo xl_xinfo;
xl_xact_subxacts xl_subxacts;
- xl_xact_relfilelocators xl_relfilelocators;
xl_xact_stats_items xl_dropped_stats;
xl_xact_twophase xl_twophase;
xl_xact_dbinfo xl_dbinfo;
@@ -6049,13 +6041,6 @@ XactLogAbortRecord(TimestampTz abort_time,
xl_subxacts.nsubxacts = nsubxacts;
}
- if (nrels > 0)
- {
- xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
- xl_relfilelocators.nrels = nrels;
- info |= XLR_SPECIAL_REL_UPDATE;
- }
-
if (ndroppedstats > 0)
{
xl_xinfo.xinfo |= XACT_XINFO_HAS_DROPPED_STATS;
@@ -6114,14 +6099,6 @@ XactLogAbortRecord(TimestampTz abort_time,
nsubxacts * sizeof(TransactionId));
}
- if (xl_xinfo.xinfo & XACT_XINFO_HAS_RELFILELOCATORS)
- {
- XLogRegisterData((char *) (&xl_relfilelocators),
- MinSizeOfXactRelfileLocators);
- XLogRegisterData((char *) rels,
- nrels * sizeof(RelFileLocator));
- }
-
if (xl_xinfo.xinfo & XACT_XINFO_HAS_DROPPED_STATS)
{
XLogRegisterData((char *) (&xl_dropped_stats),
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index e3b0aa8983c..db9cdad25f8 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -47,19 +47,14 @@
int wal_skip_threshold = 2048; /* in kilobytes */
/*
- * We keep a list of all relations (represented as RelFileLocator values)
- * that have been created or deleted in the current transaction. When
- * a relation is created, we create the physical file immediately, but
- * remember it so that we can delete the file again if the current
- * transaction is aborted. Conversely, a deletion request is NOT
- * executed immediately, but is just entered in the list. When and if
- * the transaction commits, we can delete the physical file.
+ * We keep a list of all deletion requests (represented as RelFileLocator
+ * values) that are NOT executed immediately. When and if the transaction
+ * commits, we can delete the physical file.
*
* To handle subtransactions, every entry is marked with its transaction
- * nesting level. At subtransaction commit, we reassign the subtransaction's
- * entries to the parent nesting level. At subtransaction abort, we can
- * immediately execute the abort-time actions for all entries of the current
- * nesting level.
+ * nesting level. At subtransaction commit, we reassign the subtransaction's
+ * entries to the parent nesting level. At subtransaction abort, we discard the
+ * commit-time actions for all entries of the current nesting level.
*
* NOTE: the list is kept in TopMemoryContext to be sure it won't disappear
* unbetimes. It'd probably be OK to keep it in TopTransactionContext,
@@ -895,11 +890,7 @@ smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr)
RelFileLocator *rptr;
PendingRelDelete *pending;
- if (!forCommit)
- {
- *ptr = NULL;
- return 0;
- }
+ Assert(forCommit);
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
@@ -971,9 +962,7 @@ AtSubCommit_smgr(void)
/*
* AtSubAbort_smgr() --- Take care of subtransaction abort.
*
- * Delete created relations and forget about deleted relations.
- * We can execute these operations immediately because we know this
- * subtransaction will not commit.
+ * Drop pending deletes registered during the subtransaction.
*/
void
AtSubAbort_smgr(void)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fb64d7413a2..a1ceb846ac3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -359,7 +359,6 @@ typedef struct xl_xact_prepare
Oid owner; /* user running the transaction */
int32 nsubxacts; /* number of following subxact XIDs */
int32 ncommitrels; /* number of delete-on-commit rels */
- int32 nabortrels; /* number of delete-on-abort rels */
int32 ncommitstats; /* number of stats to drop on commit */
int32 nabortstats; /* number of stats to drop on abort */
int32 ninvalmsgs; /* number of cache invalidation messages */
@@ -513,7 +512,6 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
int ndroppedstats,
xl_xact_stats_item *droppedstats,
int xactflags, TransactionId twophase_xid,
--
2.43.5
v35-0010-Rename-confusing-function-names.patchtext/x-patch; charset=us-asciiDownload
From 21315a127247a5e968810d0f651094fa7f36e9a4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 25 Oct 2024 13:02:22 +0900
Subject: [PATCH v35 10/21] Rename confusing function names
This is the final step in a series of three commits to modify
pendingDeletes.
The pendingDeletes system now handles commit-time processing
only. Previously, its structures and functions managed both commit and
abort cases, so removing abort-time processing alone could cause
confusion. Therefore, update the function names to clarify their
commit-time purpose and remove unnecessary safeguards.
---
src/backend/access/transam/twophase.c | 2 +-
src/backend/access/transam/xact.c | 2 +-
src/backend/catalog/storage.c | 44 +++++++++------------------
src/backend/commands/tablecmds.c | 2 +-
src/include/catalog/storage.h | 4 +--
5 files changed, 19 insertions(+), 35 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index d3dac67d8cd..ef94aea4ee9 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1113,7 +1113,7 @@ StartPrepare(GlobalTransaction gxact)
hdr.prepared_at = gxact->prepared_at;
hdr.owner = gxact->owner;
hdr.nsubxacts = xactGetCommittedChildren(&children);
- hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
+ hdr.ncommitrels = smgrGetCommitPendingDeletes(&commitrels);
hdr.ncommitstats =
pgstat_get_transactional_drops(true, &commitstats);
hdr.nabortstats =
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1a691c7a30e..33e99bd31ae 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1340,7 +1340,7 @@ RecordTransactionCommit(void)
LogLogicalInvalidations();
/* Get data needed for commit record */
- nrels = smgrGetPendingDeletes(true, &rels);
+ nrels = smgrGetCommitPendingDeletes(&rels);
nchildren = xactGetCommittedChildren(&children);
ndroppedstats = pgstat_get_transactional_drops(true, &droppedstats);
if (XLogStandbyInfoActive())
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index db9cdad25f8..043afa47ced 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -252,45 +252,31 @@ RelationDropStorage(Relation rel)
}
/*
- * RelationPreserveStorage
+ * RelationPreserveStorageOnCommit
* Mark a relation as not to be deleted after all.
*
- * We need this function because relation mapping changes are committed
- * separately from commit of the whole transaction, so it's still possible
- * for the transaction to abort after the mapping update is done.
- * When a new physical relation is installed in the map, it would be
- * scheduled for delete-on-abort, so we'd delete it, and be in trouble.
- * The relation mapper fixes this by telling us to not delete such relations
- * after all as part of its commit.
+ * This function cancels registered delete-on-commit actions for storage files,
+ * allowing reuse of an existing index build during ALTER TABLE.
*
- * We also use this to reuse an old build of an index during ALTER TABLE, this
- * time removing the delete-at-commit entry.
+ * Historical note:
+ * The only caller in abort paths was write_relmapper_file(), intended to
+ * preserve committed storage files for mapped relations if outer
+ * transactions aborted. However, this case has not occurred in over a decade
+ * and is unlikely to be needed in the future. Maintaining the ability for
+ * subtransaction-committed storage files to persist after a top-level
+ * transaction aborts has added unnecessary complexity and inefficiency to
+ * the UNDO log system. Therefore, this feature has been removed, and the
+ * function has been renamed.
*
* No-op if the relation is not among those scheduled for deletion.
*/
void
-RelationPreserveStorage(RelFileLocator rlocator, bool atCommit)
+RelationPreserveStorageOnCommit(RelFileLocator rlocator)
{
PendingRelDelete *pending;
PendingRelDelete *prev;
PendingRelDelete *next;
- /*
- * There is no caller that passes false for atCommit.
- *
- * The only caller that used to pass false for atCommit was
- * write_relmapper_file(), which intended to preserve committed storage
- * files for mapped relations if outer transactions aborted. However, this
- * has not occurred for more than ten years, and it is unlikely to be
- * needed in the future. The code to let storage files committed in
- * subtransactions survive after the top transaction aborts makes the UNDO
- * log system overly complex and inefficient. Therefore, this feature has
- * been removed. The function signature is left unchanged to make this
- * change less invasive and to prevent the function from being mistakenly
- * called during transaction aborts.
- */
- Assert (atCommit);
-
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
{
@@ -883,15 +869,13 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
* by upper-level transactions.
*/
int
-smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr)
+smgrGetCommitPendingDeletes(RelFileLocator **ptr)
{
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
RelFileLocator *rptr;
PendingRelDelete *pending;
- Assert(forCommit);
-
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 56c9d61aa21..251aea55d24 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -9197,7 +9197,7 @@ ATExecAddIndex(AlteredTableInfo *tab, Relation rel,
irel->rd_createSubid = stmt->oldCreateSubid;
irel->rd_firstRelfilelocatorSubid = stmt->oldFirstRelfilelocatorSubid;
- RelationPreserveStorage(irel->rd_locator, true);
+ RelationPreserveStorageOnCommit(irel->rd_locator);
index_close(irel, NoLock);
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3451d6ac80c..19b02d84a5f 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -28,7 +28,7 @@ extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
extern void RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
bool wal_log, bool undo_log);
extern void RelationDropStorage(Relation rel);
-extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
+extern void RelationPreserveStorageOnCommit(RelFileLocator rlocator);
extern void RelationPreTruncate(Relation rel);
extern void RelationTruncate(Relation rel, BlockNumber nblocks);
extern void RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
@@ -44,7 +44,7 @@ extern void RestorePendingSyncs(char *startAddress);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
-extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern int smgrGetCommitPendingDeletes(RelFileLocator **ptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
--
2.43.5
v35-0011-new-indexam-bit-for-unlogged-storage-compatibili.patchtext/x-patch; charset=us-asciiDownload
From 31ba5c92515c992322dff4a20f77cf88a0ca8522 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 24 Jul 2024 19:31:39 +0900
Subject: [PATCH v35 11/21] new indexam bit for unlogged storage compatibility
To enable the core to identify whether storage files created by an
index access method for WAL-logged and unlogged relations are
binary-compatible, add a boolean property to the index AM interface.
---
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 8 ++++++++
src/backend/access/hash/hash.c | 1 +
src/backend/access/nbtree/nbtree.c | 1 +
src/backend/access/spgist/spgutils.c | 1 +
src/include/access/amapi.h | 2 ++
src/test/modules/dummy_index_am/dummy_index_am.c | 1 +
8 files changed, 16 insertions(+)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c0b978119ac..1064ce8baca 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -272,6 +272,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = true;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 830d67fbc20..7072ff4537f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -59,6 +59,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2d7a0687d4a..4d7f36f396a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -81,6 +81,14 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP;
+
+ /*
+ * GiST uses page LSNs to figure out whether a block has been
+ * modified. UNLOGGED GiST indexes use fake LSNs, which are incompatible
+ * with the real LSNs used for LOGGED indexes.
+ */
+ amroutine->amunloggedstoragecompatible = false;
+
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 5ce36093943..7f236c47fd8 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = INT4OID;
amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 484ede8c2e1..6e95f5da365 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -122,6 +122,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 72b7661971f..1ced42342f2 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -66,6 +66,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = spgbuild;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index c51de742ea0..225fc864d15 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -261,6 +261,8 @@ typedef struct IndexAmRoutine
bool amsummarizing;
/* OR of parallel vacuum flags. See vacuum.h for flags. */
uint8 amparallelvacuumoptions;
+ /* is AM storage data compatible between LOGGED and UNLOGGED states? */
+ bool amunloggedstoragecompatible;
/* type of data stored in index, or InvalidOid if variable */
Oid amkeytype;
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index beb2c1d2542..ca302490160 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -297,6 +297,7 @@ dihandler(PG_FUNCTION_ARGS)
amroutine->amusemaintenanceworkmem = false;
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions = VACUUM_OPTION_NO_PARALLEL;
+ amroutine->amunloggedstoragecompatible = false;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = dibuild;
--
2.43.5
v35-0012-Transactional-buffer-persistence-switching.patchtext/x-patch; charset=us-asciiDownload
From dc15896c6406eab9626bf3691d9de963ecfafa96 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 16 Aug 2024 17:59:38 +0900
Subject: [PATCH v35 12/21] Transactional buffer persistence switching
This commit introduces functionality for transactional buffer
persistence switching with no user-side code. The switching is
reverted if the transaction aborts, and both the switching and
reverting are WAL-logged. Repeated back-and-forth switching within and
across subtransactions is prohibited for simplicity.
---
src/backend/access/rmgrdesc/smgrdesc.c | 13 +
src/backend/access/transam/twophase.c | 2 +
src/backend/access/transam/xact.c | 14 +
src/backend/access/transam/xlog.c | 1 +
src/backend/access/transam/xlogrecovery.c | 1 +
src/backend/catalog/storage.c | 33 +++
src/backend/storage/buffer/bufmgr.c | 328 ++++++++++++++++++++++
src/bin/pg_rewind/parsexlog.c | 6 +
src/include/catalog/storage_xlog.h | 11 +
src/include/storage/bufmgr.h | 10 +
src/tools/pgindent/typedefs.list | 2 +
11 files changed, 421 insertions(+)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 71410e0a2d3..d7b763f5297 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,16 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence \"%c\"",
+ xlrec->persistence ? 'p' : 'u');
+ pfree(path);
+ }
}
const char *
@@ -55,6 +65,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ef94aea4ee9..b4c423e449e 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1594,6 +1594,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
abortstats,
gid);
+ /* Clean up buffer persistence changes and unecessary files. */
+ PreCommit_Buffers(isCommit);
UndoLog_UndoByXid(isCommit, xid, hdr->nsubxacts, children, false);
ProcArrayRemove(proc, latestXid);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 33e99bd31ae..12e3d1c762b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2275,6 +2275,9 @@ CommitTransaction(void)
CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
: XACT_EVENT_PRE_COMMIT);
+ /* Clean up buffer persistence changes */
+ PreCommit_Buffers(true);
+
/*
* If this xact has started any unfinished parallel operation, clean up
* its workers, warning about leaked resources. (But we don't actually
@@ -2869,6 +2872,9 @@ AbortTransaction(void)
*/
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ /* Clean up buffer persistence changes */
+ PreCommit_Buffers(false);
+
/*
* check the current transaction state
*/
@@ -5136,6 +5142,9 @@ CommitSubTransaction(void)
CallSubXactCallbacks(SUBXACT_EVENT_PRE_COMMIT_SUB, s->subTransactionId,
s->parent->subTransactionId);
+ /* Clean up buffer persistence changes. */
+ PreSubCommit_Buffers(true);
+
/*
* If this subxact has started any unfinished parallel operation, clean up
* its workers and exit parallel mode. Warn about leaked resources.
@@ -5284,6 +5293,9 @@ AbortSubTransaction(void)
*/
reschedule_timeouts();
+ /* Clean up buffer persistence changes */
+ PreSubCommit_Buffers(false);
+
/*
* Re-enable signals, in case we got here by longjmp'ing out of a signal
* handler. We do this fairly early in the sequence so that the timeout
@@ -6241,6 +6253,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
}
UndoLog_UndoByXid(true, xid, parsed->nsubxacts, parsed->subxacts, true);
+ AtEOXact_Buffers_Redo(true, xid, parsed->nsubxacts, parsed->subxacts);
if (parsed->nstats > 0)
{
@@ -6354,6 +6367,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
}
UndoLog_UndoByXid(false, xid, parsed->nsubxacts, parsed->subxacts, true);
+ AtEOXact_Buffers_Redo(false, xid, parsed->nsubxacts, parsed->subxacts);
if (parsed->nstats > 0)
{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5ba4468f52d..dff6db895e0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5912,6 +5912,7 @@ StartupXLOG(void)
*/
if (!reachedConsistency)
{
+ BufmgrDoCleanupRedo();
UndoLogCleanup(true);
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
}
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 93d56dee75e..034c2e28153 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2279,6 +2279,7 @@ CheckRecoveryConsistency(void)
* backends don't try to read whatever garbage is left over from
* before.
*/
+ BufmgrDoCleanupRedo();
UndoLogCleanup(false);
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 043afa47ced..51be987a5f8 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -221,6 +221,29 @@ ulog_smgrcreate(SMgrRelation srel, ForkNumber forkNum)
UndoLogWrite(RM_SMGR_ID, ULOG_SMGR_CREATE, &ulrec, sizeof(ulrec));
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ *
+ * XXX: This function essentially belongs in bufmgr.c, but is placed here to
+ * avoid adding a new rmgr type solely for this record type.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = rlocator;
+ xlrec.persistence = persistence;
+ xlrec.topxid = GetTopTransactionId();
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -1059,6 +1082,16 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+ SetRelationBuffersPersistenceRedo(reln, xlrec->persistence,
+ XLogRecGetXid(record));
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cb524cfa42c..3ff2dabdf01 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -58,6 +58,7 @@
#include "storage/smgr.h"
#include "storage/standby.h"
#include "utils/memdebug.h"
+#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/rel.h"
#include "utils/resowner.h"
@@ -136,6 +137,22 @@ typedef struct SMgrSortArray
SMgrRelation srel;
} SMgrSortArray;
+/*
+ * We keep a list of all relations whose buffer persistence has been switched
+ * in the current transaction. This allows us to properly revert the
+ * persistence if the transaction is aborted.
+ */
+typedef struct BufMgrCleanup
+{
+ RelFileLocator rlocator; /* relation that may need to be deleted */
+ bool bufpersistence; /* buffer persistence to set */
+ int nestLevel; /* xact nesting level of request */
+ TransactionId xid; /* used during recovery */
+ struct BufMgrCleanup *next; /* linked-list link */
+} BufMgrCleanup;
+
+static BufMgrCleanup * cleanups = NULL; /* head of linked list */
+
/* GUC variables */
bool zero_damaged_pages = false;
int bgwriter_lru_maxpages = 100;
@@ -223,6 +240,8 @@ static char *ResOwnerPrintBufferIO(Datum res);
static void ResOwnerReleaseBufferPin(Datum res);
static char *ResOwnerPrintBufferPin(Datum res);
+static void set_relation_buffers_persistence(SMgrRelation srel, bool permanent);
+
const ResourceOwnerDesc buffer_io_resowner_desc =
{
.name = "buffer io",
@@ -3548,6 +3567,153 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
return result | BUF_WRITTEN;
}
+/*
+ * bufmgrDoCleanup() -- Take care of buffer persistence chages at end of xact
+ *
+ * This function is called at the end of both transactions and subtransactions,
+ * aiming to immediately clean up failed transactions.
+ */
+static void
+bufmgrDoCleanup(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ BufMgrCleanup *cu;
+ BufMgrCleanup *next;
+
+ for (cu = cleanups ; cu && cu->nestLevel <= nestLevel ; cu = next)
+ {
+ next = cu->next;
+ cleanups = next;
+
+ if (!isCommit)
+ {
+ SMgrRelation srel = smgropen(cu->rlocator, INVALID_PROC_NUMBER);
+ set_relation_buffers_persistence(srel, cu->bufpersistence);
+ }
+ pfree(cu);
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* All remaining entriespertain to upper levels. */
+ for (cu = cleanups ; cu ; cu = cu->next)
+ Assert(cu->nestLevel < nestLevel);
+#endif
+}
+
+/*
+ * AtEOXact_Buffers_Redo() -- End-of-transaction cleanup of buffer persistence
+ * chages during rcovery.
+ *
+ * Unlike normal operation, the cleanup entries are keyed by xid rather than by
+ * nestLevel. See SetRelationBuffersPersistenceRedo() for details on the
+ * registration of those entries.
+ */
+void
+AtEOXact_Buffers_Redo(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children)
+{
+ BufMgrCleanup *cu;
+ BufMgrCleanup *prev;
+ BufMgrCleanup *next;
+
+ prev = NULL;
+ for (cu = cleanups ; cu ; cu = next)
+ {
+ next = cu->next;
+
+ if (cu->xid != xid)
+ {
+ int i;
+
+ for (i = 0 ; i < nchildren && cu->xid != children[i] ; i++);
+
+ if (i == nchildren)
+ {
+ /* did not match, go to next */
+ prev = cu;
+ continue;
+ }
+ }
+
+ if (!isCommit)
+ {
+ /*
+ * Record this revert to WAL without re-registering a BufMgrCleanup
+ * entry.
+ */
+ SMgrRelation srel = smgropen(cu->rlocator, INVALID_PROC_NUMBER);
+ set_relation_buffers_persistence(srel, cu->bufpersistence);
+ }
+ if (prev)
+ prev->next = next;
+ else
+ cleanups = next;
+ pfree(cu);
+ }
+}
+
+/*
+ * BufmgrDoCleanupRedo() -- End-of-recovery cleanup of buffer persistence
+ * chages.
+ *
+ * Revert buffer persistence changes made in transactions that are not
+ * committed at the end of recovery.
+ */
+void
+BufmgrDoCleanupRedo(void)
+{
+ BufMgrCleanup *cu;
+ BufMgrCleanup *next;
+
+ for (cu = cleanups ; cu ; cu = next)
+ {
+ SMgrRelation srel = smgropen(cu->rlocator, INVALID_PROC_NUMBER);
+ set_relation_buffers_persistence(srel, cu->bufpersistence);
+
+ next = cu->next;
+ pfree(cu);
+ }
+
+ cleanups = NULL;
+}
+
+/*
+ * PreSubCommit_Buffers() -- Take care of buffer persistence changes at subxact
+ * end
+ */
+void
+PreSubCommit_Buffers(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+
+ if (!isCommit)
+ {
+ bufmgrDoCleanup(isCommit);
+ return;
+ }
+
+ /*
+ * Reassign all cleanup items at the current nestlevel to the parent
+ * transaction.
+ */
+
+ for (BufMgrCleanup *cu = cleanups ;
+ cu && cu->nestLevel >= nestLevel ;
+ cu = cu->next)
+ {
+ /* no lower-level entry is expected */
+ Assert(cu->nestLevel == nestLevel);
+
+ cu->nestLevel = nestLevel - 1;
+ }
+}
+
+void
+PreCommit_Buffers(bool isCommit)
+{
+ bufmgrDoCleanup(isCommit);
+}
+
/*
* AtEOXact_Buffers - clean up at end of transaction.
*
@@ -4142,6 +4308,168 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/*
+ * set_relation_buffers_persistence()
+ *
+ * When switching to PERMANENT, this function changes the persistence of all
+ * buffer pages for a relation, then writes all dirty pages to disk (or kernel
+ * buffers) to ensure the kernel has the latest view of the relation.
+ * Otherwise, it simply flips the persistence of every page.
+ *
+ * The caller must hold an AccessExclusiveLock on the target relation to
+ * prevent other backends from loading additional blocks.
+ *
+ * XXX: Currently, this function sequentially searches the buffer pool;
+ * consider implementing more efficient search methods. Since this routine is
+ * not used in performance-critical paths, additional optimization isn't
+ * warranted; see also DropRelationBuffers.
+ */
+static void
+set_relation_buffers_persistence(SMgrRelation srel, bool permanent)
+{
+ int i;
+ RelFileLocator rlocator = srel->smgr_rlocator.locator;
+
+ Assert(!RelFileLocatorBackendIsTemp(srel->smgr_rlocator));
+
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ /* try unlocked check to avoid locking irrelevant buffers */
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* The init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ /* Switch the buffer state to BM_PERMANENT before flushing it. */
+ Assert((buf_state & BM_PERMANENT) == 0);
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /*
+ * We haven't written WALs for this buffer. Flush this buffer to
+ * establish the epoch for subsequent WAL records.
+ */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork for this relation */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ Assert(buf_state & BM_PERMANENT);
+
+ /* Just switch the buffer state to non-permanent. */
+ buf_state &= ~BM_PERMANENT;
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a
+ * relation. See set_relation_buffers_persistence() for functionality
+ * details.
+ *
+ * This function's behavior is transactional, meaning that the changes it
+ * makes will be reverted if this or any higher-level transaction is
+ * aborted.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation
+ * to ensure no other backend is busy loading more blocks.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent)
+{
+ BufMgrCleanup *cu;
+ RelFileLocator rlocator = srel->smgr_rlocator.locator;
+
+ /*
+ * Prevent double-flipping of relation persistence within the same
+ * transaction. Performing double-flipping adds significant complexity
+ * with minimal benefit. Error out if persistence has already been flipped
+ * for this relation in the current transaction.
+ */
+ for (cu = cleanups ; cu ; cu = cu->next)
+ {
+ if (RelFileLocatorEquals(rlocator, cu->rlocator))
+ ereport(ERROR,
+ errmsg("persistence of this relation has been already changed in the current transaction"));
+ }
+
+ set_relation_buffers_persistence(srel, permanent);
+
+ /* Schedule reverting this change at abort, keying by nestLevel. */
+ cu = (BufMgrCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(BufMgrCleanup));
+ cu->rlocator = rlocator;
+ cu->bufpersistence = !permanent;
+ cu->nestLevel = GetCurrentTransactionNestLevel();
+ cu->next = cleanups;
+ cleanups = cu;
+}
+
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistenceRedo
+ *
+ * This function changes the persistence of all buffer pages for a
+ * relation during recovery. In recovery, cleanup entries are keyed by
+ * transaction ID, rather than by nestLevel.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistenceRedo(SMgrRelation srel, bool permanent,
+ TransactionId xid)
+{
+ BufMgrCleanup *cu;
+
+ set_relation_buffers_persistence(srel, permanent);
+
+ /* Schedule reverting this change at abort */
+ cu = (BufMgrCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(BufMgrCleanup));
+ cu->rlocator = srel->smgr_rlocator.locator;
+ cu->bufpersistence = !permanent;
+ cu->xid = xid;
+ cu->next = cleanups;
+ cleanups = cu;
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 7b541137dd4..b3f728a85ac 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -418,6 +418,12 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index a490e05f884..085b1bc1dff 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -29,6 +29,7 @@
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_BUFPERSISTENCE 0x30
typedef struct xl_smgr_create
{
@@ -36,6 +37,14 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+ TransactionId topxid;
+ /* subxid is in the record header */
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +60,8 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..4267098080f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -19,6 +19,7 @@
#include "storage/buf.h"
#include "storage/bufpage.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
#include "utils/relcache.h"
#include "utils/snapmgr.h"
@@ -250,7 +251,14 @@ extern Buffer ExtendBufferedRelTo(BufferManagerRelation bmr,
ReadBufferMode mode);
extern void InitBufferManagerAccess(void);
+extern void PreSubCommit_Buffers(bool isCommit);
+extern void PreCommit_Buffers(bool isCommit);
extern void AtEOXact_Buffers(bool isCommit);
+extern void SetRelationBuffersPersistenceRedo(SMgrRelation srel, bool permanent,
+ TransactionId xid);
+extern void AtEOXact_Buffers_Redo(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children);
+extern void BufmgrDoCleanupRedo(void);
extern char *DebugPrintBufferRefcount(Buffer buffer);
extern void CheckPointBuffers(int flags);
extern BlockNumber BufferGetBlockNumber(Buffer buffer);
@@ -269,6 +277,8 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
extern void DropDatabaseBuffers(Oid dbid);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent);
#define RelationGetNumberOfBlocks(reln) \
RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index edca81d88b0..82cbd451430 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -329,6 +329,7 @@ BtreeLastVisibleEntry
BtreeLevel
Bucket
BufFile
+BufMgrCleanup
Buffer
BufferAccessStrategy
BufferAccessStrategyType
@@ -4137,6 +4138,7 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
xl_smgr_truncate
xl_standby_lock
--
2.43.5
v35-0013-Prepare-for-preventing-DML-operations-on-relatio.patchtext/x-patch; charset=us-asciiDownload
From c75ecd158de5fd8377c2ac9c5d43da51bfb94325 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 15 Aug 2024 11:26:01 +0900
Subject: [PATCH v35 13/21] Prepare for preventing DML operations on relations.
Performing data manipulation on relations with in-place persistence
changes can lead to unrecoverable issues, particularly with
indexes. To prevent potential data corruption, this update sets up
mechanisms to inhibit DML operations in these cases rather than
attempting to accommodate them. No user-side code included.
---
src/backend/access/transam/xact.c | 7 ++++++
src/backend/executor/execMain.c | 5 +++-
src/backend/tcop/utility.c | 18 ++++++++++++++
src/backend/utils/cache/relcache.c | 39 +++++++++++++++++++++++++++---
src/include/access/xact.h | 2 ++
src/include/miscadmin.h | 1 +
src/include/utils/rel.h | 7 ++++++
src/include/utils/relcache.h | 1 +
8 files changed, 76 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 12e3d1c762b..c3302b4df46 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -87,6 +87,12 @@ bool XactDeferrable;
int synchronous_commit = SYNCHRONOUS_COMMIT_ON;
+/*
+ * Indicate whether relation persistence flipping was performed in the current
+ * transacion.
+ */
+bool XactPersistenceChanged;
+
/*
* CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
* transaction. Currently, it is used in logical decoding. It's possible
@@ -2124,6 +2130,7 @@ StartTransaction(void)
s->startedInRecovery = false;
XactReadOnly = DefaultXactReadOnly;
}
+ XactPersistenceChanged = false;
XactDeferrable = DefaultXactDeferrable;
XactIsoLevel = DefaultXactIsoLevel;
forceSyncCommit = false;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index cc9a594cba5..cca40110b7e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -163,7 +163,7 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
* against performing unsafe operations in parallel mode, but this gives a
* more user-friendly error message.
*/
- if ((XactReadOnly || IsInParallelMode()) &&
+ if ((XactReadOnly || XactPersistenceChanged || IsInParallelMode()) &&
!(eflags & EXEC_FLAG_EXPLAIN_ONLY))
ExecCheckXactReadOnly(queryDesc->plannedstmt);
@@ -813,6 +813,9 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
continue;
PreventCommandIfReadOnly(CreateCommandName((Node *) plannedstmt));
+
+ PreventCommandIfPersistenceChanged(
+ CreateCommandName((Node *) plannedstmt), perminfo->relid);
}
if (plannedstmt->commandType != CMD_SELECT || plannedstmt->hasModifyingCTE)
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index b2ea8125c92..94953e367ae 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -411,6 +411,24 @@ PreventCommandIfReadOnly(const char *cmdname)
cmdname)));
}
+/*
+ * PreventCommandIfPersistenceChanged: throw error if persistence changed was
+ * performed
+ */
+void
+PreventCommandIfPersistenceChanged(const char *cmdname, Oid relid)
+{
+ Relation rel;
+
+ rel = RelationIdGetRelation(relid);
+ if (rel->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot execute %s on relation \"%s\" because of its persistence change in the current transaction",
+ cmdname, get_rel_name(relid)));
+ RelationClose(rel);
+}
+
/*
* PreventCommandIfParallelMode: throw error if current (sub)transaction is
* in parallel mode.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index c326f687eb4..ec9acdf188e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1136,6 +1136,7 @@ retry:
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
@@ -1899,6 +1900,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
relation->rd_backend = INVALID_PROC_NUMBER;
relation->rd_islocaltemp = false;
@@ -2775,6 +2777,7 @@ RelationClearRelation(Relation relation, bool rebuild)
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilelocatorSubid);
SWAPFIELD(SubTransactionId, rd_firstRelfilelocatorSubid);
+ SWAPFIELD(SubTransactionId, rd_firstPersistenceChangeSubid);
SWAPFIELD(SubTransactionId, rd_droppedSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
@@ -2864,7 +2867,8 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId ||
+ relation->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2922,7 +2926,8 @@ RelationForgetRelation(Oid rid)
Assert(relation->rd_droppedSubid == InvalidSubTransactionId);
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId ||
+ relation->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
{
/*
* In the event of subtransaction rollback, we must not forget
@@ -3037,7 +3042,8 @@ RelationCacheInvalidate(bool debug_discard)
* applicable pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId ||
+ relation->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3351,6 +3357,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
if (clear_relcache)
@@ -3466,6 +3473,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
RelationClearRelation(relation, false);
return;
@@ -3512,6 +3520,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_droppedSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstPersistenceChangeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstPersistenceChangeSubid = parentSubid;
+ else
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
+ }
}
@@ -3602,6 +3618,7 @@ RelationBuildLocalRelation(const char *relname,
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
rel->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ rel->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
rel->rd_droppedSubid = InvalidSubTransactionId;
/*
@@ -3976,6 +3993,21 @@ RelationAssumeNewRelfilelocator(Relation relation)
EOXactListAdd(relation);
}
+/*
+ * RelationAssumePersistenceChange
+ *
+ * Code that changes relation persistence must call this. This call triggers
+ * abort-time cleanups and prevents further data manipulation on the relation.
+ */
+void
+RelationAssumePersistenceChange(Relation relation)
+{
+ XactPersistenceChanged = true;
+ relation->rd_firstPersistenceChangeSubid = GetCurrentSubTransactionId();
+
+ /* Flag relation as needing eoxact cleanup (to clear this field) */
+ EOXactListAdd(relation);
+}
/*
* RelationCacheInitialize
@@ -6404,6 +6436,7 @@ load_relcache_init_file(bool shared)
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
rel->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ rel->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
rel->rd_droppedSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
rel->pgstat_info = NULL;
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index a1ceb846ac3..2e09566bdda 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -55,6 +55,8 @@ extern PGDLLIMPORT int XactIsoLevel;
extern PGDLLIMPORT bool DefaultXactReadOnly;
extern PGDLLIMPORT bool XactReadOnly;
+extern PGDLLIMPORT bool XactPersistenceChanged;
+
/* flag for logging statements in this transaction */
extern PGDLLIMPORT bool xact_is_sampled;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index e26d108a470..eaf79cb06ac 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -301,6 +301,7 @@ extern bool stack_is_too_deep(void);
extern void PreventCommandIfReadOnly(const char *cmdname);
extern void PreventCommandIfParallelMode(const char *cmdname);
extern void PreventCommandDuringRecovery(const char *cmdname);
+extern void PreventCommandIfPersistenceChanged(const char *cmdname, Oid relid);
/*****************************************************************************
* pdir.h -- *
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 87002049538..a361e910509 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -108,6 +108,13 @@ typedef struct RelationData
* any value */
SubTransactionId rd_droppedSubid; /* dropped with another Subid set */
+ /*
+ * rd_firstPersistenceChangeSubid is the ID of the highest subtransaction
+ * ID the rel's persistence change has survived into.
+ */
+ SubTransactionId rd_firstPersistenceChangeSubid; /* highest subxact chaging
+ * persistence */
+
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
Oid rd_id; /* relation's object id */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 18c32ea7008..f2f26433cdd 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -119,6 +119,7 @@ extern Relation RelationBuildLocalRelation(const char *relname,
*/
extern void RelationSetNewRelfilenumber(Relation relation, char persistence);
extern void RelationAssumeNewRelfilelocator(Relation relation);
+extern void RelationAssumePersistenceChange(Relation relation);
/*
* Routines for flushing/rebuilding relcache entries in various scenarios
--
2.43.5
v35-0014-Add-a-new-version-of-copy_file-to-allow-overwrit.patchtext/x-patch; charset=us-asciiDownload
From 13147e22a7f70448ff9abb5bc9c6a40aeeb05116 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 31 Jul 2024 18:04:48 +0900
Subject: [PATCH v35 14/21] Add a new version of copy_file to allow overwrites
In subsequent patches, it will be necessary to overwrite the existing
main fork with the init fork. To facilitate this, add a version of the
copy_file function that supports overwriting.
---
src/backend/storage/file/copydir.c | 16 +++++++++++++++-
src/include/storage/copydir.h | 2 ++
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index d4fbe542077..30d0ae54ec4 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -115,6 +115,12 @@ copydir(const char *fromdir, const char *todir, bool recurse)
*/
void
copy_file(const char *fromfile, const char *tofile)
+{
+ copy_file_extended(fromfile, tofile, false);
+}
+
+void
+copy_file_extended(const char *fromfile, const char *tofile, bool overwrite)
{
char *buffer;
int srcfd;
@@ -122,6 +128,7 @@ copy_file(const char *fromfile, const char *tofile)
int nbytes;
off_t offset;
off_t flush_offset;
+ int dstflags;
/* Size of copy buffer (read and write requests) */
#define COPY_BUF_SIZE (8 * BLCKSZ)
@@ -150,7 +157,11 @@ copy_file(const char *fromfile, const char *tofile)
(errcode_for_file_access(),
errmsg("could not open file \"%s\": %m", fromfile)));
- dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ dstflags = O_RDWR | O_CREAT | PG_BINARY;
+ if (!overwrite)
+ dstflags |= O_EXCL;
+
+ dstfd = OpenTransientFile(tofile, dstflags);
if (dstfd < 0)
ereport(ERROR,
(errcode_for_file_access(),
@@ -159,6 +170,9 @@ copy_file(const char *fromfile, const char *tofile)
/*
* Do the data copying.
*/
+ if (overwrite)
+ pg_truncate(tofile, 0);
+
flush_offset = 0;
for (offset = 0;; offset += nbytes)
{
diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..1a430675428 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -15,5 +15,7 @@
extern void copydir(const char *fromdir, const char *todir, bool recurse);
extern void copy_file(const char *fromfile, const char *tofile);
+extern void copy_file_extended(const char *fromfile, const char *tofile,
+ bool overwrite);
#endif /* COPYDIR_H */
--
2.43.5
v35-0015-In-place-persistance-change-to-UNLOGGED.patchtext/x-patch; charset=us-asciiDownload
From 49827b559597ca9adcb0b8377efedf2d400568f7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 27 Aug 2024 11:19:53 +0900
Subject: [PATCH v35 15/21] In-place persistance change to UNLOGGED
This commit enables changing the persistence of relations to UNLOGGED
without creating a new storage file. ALTER TABLE LOGGED will continue
to create a new storage as before.
---
src/backend/catalog/storage.c | 64 +++++++++
src/backend/commands/tablecmds.c | 226 +++++++++++++++++++++++++------
2 files changed, 251 insertions(+), 39 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 51be987a5f8..4a2c620402b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -458,6 +458,54 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
}
+/*
+ * Reset an unlogged relation using the INIT fork, intended for use during the
+ * commit of prepared transactions. The relation is assumed to be UNLOGGED, so
+ * no WAL-logging is required.
+ */
+static void
+ResetUnloggedRelation(RelFileLocator rloc, ProcNumber backend)
+{
+ char *srcpath;
+ char *dstpath;
+ SMgrRelation srel = smgropen(rloc, backend);
+ ForkNumber forks[MAX_FORKNUM];
+ BlockNumber blocks[MAX_FORKNUM];
+ int nforks = 0;
+
+ srel = smgropen(rloc, backend);
+
+ Assert(smgrexists(srel, INIT_FORKNUM));
+
+ for (int i = 0 ; i <= MAX_FORKNUM ; i++)
+ {
+ if (i == INIT_FORKNUM || !smgrexists(srel, i))
+ continue;
+
+ forks[nforks] = i;
+ blocks[nforks] = 0;
+ nforks++;
+ }
+
+ /*
+ * This relation is unlogged. Therefore, unlike RelationTruncate(), there
+ * is no need to call RelationPreTruncate().
+ */
+ smgrtruncate(srel, forks, nforks, blocks);
+
+ /* Note that this leaves the first segment of the main fork. */
+ for (int i = 0 ; i < nforks ; i++)
+ smgrunlink(srel, forks[i], false);
+
+ /* copy init fork to main fork */
+ srcpath = GetRelationPath(rloc.dbOid, rloc.spcOid, rloc.relNumber,
+ backend, INIT_FORKNUM);
+ dstpath = GetRelationPath(rloc.dbOid, rloc.spcOid, rloc.relNumber,
+ backend, MAIN_FORKNUM);
+ copy_file_extended(srcpath, dstpath, true);
+ fsync_fname(dstpath, false);
+}
+
/*
* RelationPreTruncate
* Perform AM-independent work before a physical truncation.
@@ -1181,6 +1229,22 @@ smgr_undo(UndoLogRecord *record, ULogOp op, bool recovered, bool redo)
smgrclose(reln);
}
+ else if (!redo && recovered && ulrec->forknum == INIT_FORKNUM)
+ {
+ /*
+ * System has been crashed until the transaction was
+ * prepared. Now that the init fork is persists, the relation
+ * needs to be cleared.
+ */
+ ResetUnloggedRelation(ulrec->rlocator, ulrec->backend);
+ ereport(WARNING,
+ errmsg("unlogged relation %u/%u/%u was reset",
+ ulrec->rlocator.spcOid, ulrec->rlocator.dbOid,
+ ulrec->rlocator.relNumber),
+ errdetail("Server experinced a crash after the transaction that altered the relation was prepared."));
+
+ }
+
}
else
elog(PANIC, "smgr_undo: unknown ulogop code %d", op);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 251aea55d24..f8d240f374f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5708,6 +5708,143 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: perform in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Use ATRewriteTable instead of this function if the following condition
+ * is not satisfied.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * Initially, gather all relations that require a persistence change.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+ SMgrRelation srel;
+ bool persistent = (persistence == RELPERSISTENCE_PERMANENT);
+ bool is_index;
+
+ /*
+ * Reconstruct the storage when permanent and unlogged storage types
+ * are incompatible.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ !r->rd_indam->amunloggedstoragecompatible)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistent)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+
+ /* this doesn't fire REINDEX event triegger */
+ reindex_index(NULL, reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Currently, only allowing changes to UNLOGGED. */
+ Assert(!persistent);
+
+ RelationAssumePersistenceChange(r);
+
+ /* switch buffer persistence */
+ srel = RelationGetSmgr(r);
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, persistent);
+ SetRelationBuffersPersistence(srel, persistent);
+
+ /* then create the init fork */
+ is_index = (r->rd_rel->relkind == RELKIND_INDEX);
+ RelationCreateFork(srel, INIT_FORKNUM, !is_index, true);
+ if (is_index)
+ r->rd_indam->ambuildempty(r);
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5840,48 +5977,59 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE &&
+ persistence == RELPERSISTENCE_UNLOGGED)
+ {
+ /* Make in-place persistence change. */
+ RelationChangePersistence(tab, persistence, lockmode);
+ }
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
--
2.43.5
v35-0016-Add-test-for-ALTER-TABLE-UNLOGGED.patchtext/x-patch; charset=us-asciiDownload
From 753b89eb173bb9db179f63c3535a26869511c78e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 15 Aug 2024 16:06:34 +0900
Subject: [PATCH v35 16/21] Add test for ALTER TABLE UNLOGGED
---
src/test/recovery/t/044_persistence_change.pl | 511 ++++++++++++++++++
1 file changed, 511 insertions(+)
create mode 100644 src/test/recovery/t/044_persistence_change.pl
diff --git a/src/test/recovery/t/044_persistence_change.pl b/src/test/recovery/t/044_persistence_change.pl
new file mode 100644
index 00000000000..ad1b444cb46
--- /dev/null
+++ b/src/test/recovery/t/044_persistence_change.pl
@@ -0,0 +1,511 @@
+# Copyright (c) 2023-2024, PostgreSQL Global Development Group
+#
+# Test in-place relation persistence changes
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+my @relnames = ('t', 'i_bt', 'i_gin', 'i_gist', 'i_hash', 'i_brin', 'i_spgist');
+my @noninplace_names = ('i_gist');
+
+# This feature works differently by wal_level.
+run_test('minimal');
+run_test('replica');
+done_testing();
+
+sub run_test
+{
+ my ($wal_level) = @_;
+
+ note "## run with wal_level = $wal_level";
+
+ # Initialize primary node.
+ my $node = PostgreSQL::Test::Cluster->new("node_$wal_level");
+ $node->init;
+ # Inhibit checkpoints to run
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+checkpoint_timeout = '24h'
+max_prepared_transactions = 2
+ ));
+ $node->start;
+
+ my $datadir = $node->data_dir;
+ my $datoid = $node->safe_psql('postgres',
+ q/SELECT oid FROM pg_database WHERE datname = current_database()/);
+ my $dbdir = $node->data_dir . "/base/$datoid";
+
+ # Create a table and indexes of built-in kinds
+ $node->psql('postgres', qq(
+ CREATE TABLE t (bt int, gin int[], gist point, hash int,
+ brin int, spgist point);
+ CREATE INDEX i_bt ON t USING btree (bt);
+ CREATE INDEX i_gin ON t USING gin (gin);
+ CREATE INDEX i_gist ON t USING gist (gist);
+ CREATE INDEX i_hash ON t USING hash (hash);
+ CREATE INDEX i_brin ON t USING brin (brin);
+ CREATE INDEX i_spgist ON t USING spgist (spgist);));
+
+ my $relfilenodes1 = getrelfilenodes($node, \@relnames);
+
+ # the number must correspond to the in list above
+ is (scalar %{$relfilenodes1}, 7, "number of relations is correct");
+
+ # check initial state
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages are in logged state");
+
+ # Normal crash-recovery of LOGGED tables
+ $node->stop('immediate');
+ $node->start;
+
+ # Insert data 0 to 1999
+ $node->psql('postgres', insert_data_query(0, 2000));
+
+ # Check if the data survives a crash
+ $node->stop('immediate');
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "data loss check: crash with LOGGED table");
+
+ # Change the table to UNLOGGED then commit.
+ $node->psql('postgres', 'ALTER TABLE t SET UNLOGGED');
+
+ # Check if SET UNLOGGED above didn't change relfilenumbers.
+ my $relfilenodes2 = getrelfilenodes($node, \@relnames);
+ ok (checkrelfilenodes($relfilenodes1, $relfilenodes2),
+ "relfilenumber transition is as expected after SET UNLOGGED");
+
+ # check init-file state
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are in unlogged state");
+
+ # Check if the table is reset through recovery.
+ $node->stop('immediate');
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 0,
+ "table data is reset though recovery");
+
+ # check reset state
+ ok (check_storage_state(\&is_reset_state, $node, \@relnames),
+ "storages are in reset state");
+
+ # Insert data 0 to 1999, then set persistence to LOGGED then crash.
+ $node->psql('postgres', insert_data_query(0, 2000));
+ $node->psql('postgres', qq(ALTER TABLE t SET LOGGED));
+ $node->stop('immediate');
+ $node->start;
+
+ # Check if SET LOGGED didn't change relfilenumbers and data survive a crash
+ my $relfilenodes3 = getrelfilenodes($node, \@relnames);
+ ok (!checkrelfilenodes($relfilenodes2, $relfilenodes3),
+ "crashed SET-LOGGED relations have sane relfilenodes transition");
+
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "crashed SET-LOGGED table does not lose data");
+
+ # Change to UNLOGGED then insert data, then shutdown normally.
+ $node->psql('postgres', 'ALTER TABLE t SET UNLOGGED');
+ $node->psql('postgres', insert_data_query(2000, 2000)); # 2000 - 3999
+ $node->stop;
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 4000,
+ "UNLOGGED table does not lose data after graceful restart");
+
+ # Test for mid-transaction change to LOGGED and crash.
+ # Now, the table has data 0-3999
+ $node->psql('postgres', insert_data_query(4000, 2000)); # 4000 - 5999
+
+ my $sess = $node->interactive_psql('postgres');
+ $sess->set_query_timer_restart();
+ $sess->query('BEGIN; ALTER TABLE t SET LOGGED');
+ $sess->query(insert_data_query(6000, 2000)); # 6000-7999, no commit
+ $node->stop('immediate');
+ $sess->quit;
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 0,
+ "table is reset after in-transaction SET-LOGGED then insert");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are reverted to unlogged state");
+
+ # Test for mid-transaction change to UNLOGGED and crash.
+ # Now, the table has no data
+ $node->psql('postgres', 'ALTER TABLE t SET LOGGED');
+ $node->psql('postgres', insert_data_query(0, 2000)); # 0 - 1999
+ $sess = $node->interactive_psql('postgres');
+ $sess->set_query_timer_restart();
+ $sess->query('BEGIN; ALTER TABLE t SET UNLOGGED');
+ $sess->query(insert_data_query(2000, 2000)); # 2000-3999, no commit
+ $node->stop('immediate');
+ $sess->quit;
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "table is reset after in-transaction SET-UNLOGGED then insert");
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages are reverted to logged state");
+
+ ### Subtransactions
+ ok ($node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET UNLOGGED; -- committed
+ SAVEPOINT a;
+ ALTER TABLE t SET LOGGED; -- aborted
+ SAVEPOINT b;
+ ROLLBACK TO a;
+ COMMIT;
+ )) != 3,
+ "command succeeds 1");
+
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "table data is not changed 1");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are changed to unlogged state");
+
+ ok ($node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET LOGGED; -- aborted
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED; -- aborted
+ SAVEPOINT b;
+ RELEASE a;
+ ROLLBACK;
+ )) != 3,
+ "command succeeds 2");
+
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "table data is not changed 2");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages stay in unlogged state");
+
+ ### Prepared transactions
+ my ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ ALTER TABLE t SET LOGGED;
+ BEGIN;
+ ALTER TABLE t SET UNLOGGED;
+ PREPARE TRANSACTION 'a';
+ COMMIT PREPARED 'a';
+ ));
+ ok ($ret == 0, "prepare persistence-flipped xact");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are in unlogged state");
+
+ ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ ALTER TABLE t SET LOGGED;
+ BEGIN;
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED;
+ PREPARE TRANSACTION 'a';
+ ROLLBACK PREPARED 'a';
+ ));
+ ok ($ret == 0, "prepare persistence-flipped xact 2");
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages stay in logged state");
+
+ ### Error out DML
+ $node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET LOGGED;
+ INSERT INTO t VALUES(1); -- Succeeds
+ COMMIT;
+ ));
+
+ ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET UNLOGGED;
+ INSERT INTO t VALUES(2); -- ERROR
+ ));
+ ok ($stderr =~ m/cannot execute INSERT on relation/,
+ "errors out when DML is issued after persistence toggling");
+
+ ok ($node->psql('postgres',
+ qq(
+ BEGIN;
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED;
+ ROLLBACK TO a;
+ INSERT INTO t VALUES(3); -- Succeeds
+ COMMIT;
+ )) != 3,
+ "insert after rolled-back persistence change succeeds");
+
+ ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ BEGIN;
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED;
+ RELEASE a;
+ UPDATE t SET bt = bt + 1; -- ERROR
+ ));
+ ok ($stderr =~ m/cannot execute UPDATE on relation/,
+ "errors out when DML is issued after persistence toggling in subxact");
+
+ $node->stop;
+ $node->teardown_node;
+}
+
+#==== helper routines
+
+# Generates a query to insert data from $st to $st + $num - 1
+sub insert_data_query
+{
+ my ($st, $num) = @_;
+ my $ed = $st + $num - 1;
+ my $query = qq(
+INSERT INTO t
+ (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i)
+ FROM generate_series($st, $ed) i);
+);
+ return $query;
+}
+
+sub check_indexes
+{
+ my ($node, $st, $ed) = @_;
+ my $num_data = $ed - $st;
+
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO true;
+ SET enable_indexscan TO false;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "heap is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "btree is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gin = ARRAY[i, i * 2];)),
+ $num_data, "gin is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)),
+ $num_data, "gist is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE hash = i;)),
+ $num_data, "hash is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE brin = i;)),
+ $num_data, "brin is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)),
+ $num_data, "spgist is not broken");
+}
+
+sub getrelfilenodes
+{
+ my ($node, $relnames) = @_;
+
+ my $result = $node->safe_psql('postgres',
+ 'SELECT relname, relfilenode FROM pg_class
+ WHERE relname
+ IN (\'' .
+ join("','", @{$relnames}).
+ '\') ORDER BY oid');
+
+ my %relfilenodes;
+
+ foreach my $l (split(/\n/, $result))
+ {
+ die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/);
+ $relfilenodes{$1} = $2;
+ }
+
+ return \%relfilenodes;
+}
+
+sub checkrelfilenodes
+{
+ my ($rnodes1, $rnodes2) = @_;
+ my $result = 1;
+
+ foreach my $n (keys %{$rnodes1})
+ {
+ if (grep { $n eq $_ } @noninplace_names)
+ {
+ if ($rnodes1->{$n} == $rnodes2->{$n})
+ {
+ $result = 0;
+ note sprintf("$n: relfilenode is not changed: %d",
+ $rnodes1->{$n});
+ }
+ }
+ else
+ {
+ if ($rnodes1->{$n} != $rnodes2->{$n})
+ {
+ $result = 0;
+ note sprintf("$n: relfilenode is changed: %d => %d",
+ $rnodes1->{$n}, $rnodes2->{$n});
+ }
+ }
+ }
+ return $result;
+}
+
+sub getfilenames
+{
+ my ($dirname) = @_;
+
+ my $dir = opendir(my $dh, $dirname) or die "could not open $dirname: $!";
+ my @f = readdir($dh);
+ closedir($dh);
+
+ my @result = grep {$_ !~ /^..?$/} @f;
+
+ return \@result;
+}
+
+sub init_fork_exists
+{
+ my ($relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+
+ return $init_exists;
+}
+
+sub noninit_forks_exist
+{
+ my ($relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $noninit_exists = grep {/^${relfnumber}(_(?!init).*)?$/} @{$datafiles};
+
+ return $noninit_exists;
+}
+
+sub is_logged_state
+{
+ my ($node, $relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+ my $main_exists = grep {/^${relfnumber}$/} @{$datafiles};
+ my $persistence = $node->safe_psql('postgres',
+ qq(
+ SELECT relpersistence FROM pg_class WHERE relname = '$relname'
+ ));
+
+ if ($init_exists || !$main_exists || $persistence ne 'p')
+ {
+ # note the state if this test failed
+ note "## is_logged_state:($relname): \$init_exists=$init_exists, \$main_exists=$main_exists, \$persistence='$persistence'\n";
+ return 0 ;
+ }
+
+ return 1;
+}
+
+sub is_unlogged_state
+{
+ my ($node, $relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+ my $main_exists = grep {/^${relfnumber}$/} @{$datafiles};
+ my $persistence = $node->safe_psql('postgres',
+ qq(
+ SELECT relpersistence FROM pg_class WHERE relname = '$relname'
+ ));
+
+ if (!$init_exists || !$main_exists || $persistence ne 'u')
+ {
+ # note the state if this test failed
+ note "is_unlogged_state:($relname): \$init_exists=$init_exists, \$main_exists=$main_exists, \$persistence='$persistence'\n";
+ return 0 ;
+ }
+
+ return 1;
+}
+
+sub is_reset_state
+{
+ my ($node, $relfilenodes, $datafiles, $relname) = @_;
+
+ my $datoid = $node->safe_psql('postgres',
+ q/SELECT oid FROM pg_database WHERE datname = current_database()/);
+ my $dbdir = $node->data_dir . "/base/$datoid";
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+ my $main_exists = grep {/^${relfnumber}$/} @{$datafiles};
+ my $others_not_exist = !grep {/^${relfnumber}_(?!init).*$/} @{$datafiles};
+ my $persistence = $node->safe_psql('postgres',
+ qq(
+ SELECT relpersistence FROM pg_class WHERE relname = '$relname'
+ ));
+
+ if (!$init_exists || !$main_exists || !$others_not_exist ||
+ $persistence ne 'u')
+ {
+ # note the state if this test failed
+ note "## is_reset_state:($relname): \$init_exists=$init_exists, \$main_exists=$main_exists, \$others_not_exist=$others_not_exist, \$persistence='$persistence'\n";
+ return 0 ;
+ }
+
+ my $main_file = "$dbdir/${relfnumber}";
+ my $init_file = "$dbdir/${relfnumber}_init";
+ my $main_file_size = -s $main_file;
+ my $init_file_size = -s $init_file;
+
+ if ($main_file_size != $init_file_size)
+ {
+ note "## is_reset_state:($relname): \$main_file='$main_file', size=$main_file_size, \$init_file='$init_file', size=$init_file_size\n";
+ return 0;
+ }
+
+ return 1;
+}
+
+sub check_storage_state
+{
+ my ($func, $node, $relnames) = @_;
+ my $relfilenodes = getrelfilenodes($node, $relnames);
+ my $datoid = $node->safe_psql('postgres',
+ q/SELECT oid FROM pg_database WHERE datname = current_database()/);
+ my $dbdir = $node->data_dir . "/base/$datoid";
+ my $datafiles = getfilenames($dbdir);
+ my $result = 1;
+
+ foreach my $relname (@{$relnames})
+ {
+ if (!$func->($node, $relfilenodes, $datafiles, $relname))
+ {
+ $result = 0;
+
+ ## do not return immediately, run this test for all
+ ## relations to leave diagnosis information in the log
+ ## file.
+ }
+ }
+
+ return $result;
+}
--
2.43.5
v35-0017-Make-smgrdounlinkall-accept-fork-numbers.patchtext/x-patch; charset=us-asciiDownload
From 0da12993151c8aa0dc37180986f782a82382f202 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 2 Aug 2024 19:34:06 +0900
Subject: [PATCH v35 17/21] Make smgrdounlinkall accept fork numbers
An upcoming patch will require crash-safe file deletion on a per-fork
basis. To support this, modify smgrdounlinkall(), which efficiently
removes multiple files, to accept fork numbers. This commit also
introduces a new type, ForkBitmap, to represent multiple fork numbers
as a single integer.
---
src/backend/catalog/storage.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 92 ++++++++++++++++++++++++-----
src/backend/storage/smgr/md.c | 2 +-
src/backend/storage/smgr/smgr.c | 28 ++++++---
src/backend/utils/cache/relcache.c | 2 +-
src/include/common/relpath.h | 11 ++++
src/include/storage/bufmgr.h | 2 +-
src/include/storage/smgr.h | 3 +-
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 115 insertions(+), 28 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 4a2c620402b..279b1f7917f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -776,7 +776,7 @@ smgrDoPendingDeletes(bool isCommit)
if (nrels > 0)
{
- smgrdounlinkall(srels, nrels, false);
+ smgrdounlinkall(srels, NULL, nrels, false);
for (int i = 0; i < nrels; i++)
smgrclose(srels[i]);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3ff2dabdf01..3e5ffd70c42 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -153,6 +153,16 @@ typedef struct BufMgrCleanup
static BufMgrCleanup * cleanups = NULL; /* head of linked list */
+/*
+ * Helper struct for handling RelFileNode and ForkNumber together in
+ * DropRelationsAllBuffers.
+ */
+typedef struct RelFileForks
+{
+ RelFileLocator rloc; /* key member for qsort */
+ ForkBitmap forks; /* fork number in bitmap */
+} RelFileForks;
+
/* GUC variables */
bool zero_damaged_pages = false;
int bgwriter_lru_maxpages = 100;
@@ -4476,24 +4486,32 @@ SetRelationBuffersPersistenceRedo(SMgrRelation srel, bool permanent,
* This function removes from the buffer pool all the pages of all
* forks of the specified relations. It's equivalent to calling
* DropRelationBuffers once per fork per relation with firstDelBlock = 0.
+ * The additional parameter forks is used to identify forks if
+ * provided.
* --------------------------------------------------------------------
*/
void
-DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
+DropRelationsAllBuffers(SMgrRelation *smgr_reln, ForkBitmap *pforks,
+ int nlocators)
{
int i;
int n = 0;
SMgrRelation *rels;
BlockNumber (*block)[MAX_FORKNUM + 1];
uint64 nBlocksToInvalidate = 0;
- RelFileLocator *locators;
+ ForkBitmap *forks = NULL;
+ RelFileForks *locators;
bool cached = true;
bool use_bsearch;
if (nlocators == 0)
return;
- rels = palloc(sizeof(SMgrRelation) * nlocators); /* non-local relations */
+ /* storages for non-local relations */
+ rels = palloc(sizeof(SMgrRelation) * nlocators);
+
+ if (pforks)
+ forks = palloc(sizeof(ForkBitmap) * nlocators);
/* If it's a local relation, it's localbuf.c's problem. */
for (i = 0; i < nlocators; i++)
@@ -4504,7 +4522,12 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
DropRelationAllLocalBuffers(smgr_reln[i]->smgr_rlocator.locator);
}
else
- rels[n++] = smgr_reln[i];
+ {
+ rels[n] = smgr_reln[i];
+ if (forks)
+ forks[n] = pforks[i];
+ n++;
+ }
}
/*
@@ -4514,6 +4537,10 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
if (n == 0)
{
pfree(rels);
+
+ if (forks)
+ pfree(forks);
+
return;
}
@@ -4532,6 +4559,13 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
{
for (int j = 0; j <= MAX_FORKNUM; j++)
{
+ /* Consider only the specified fork, if provided. */
+ if (forks && !FORKBITMAP_ISSET(forks[i], j))
+ {
+ block[i][j] = InvalidBlockNumber;
+ continue;
+ }
+
/* Get the number of blocks for a relation's fork. */
block[i][j] = smgrnblocks_cached(rels[i], j);
@@ -4559,7 +4593,7 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
{
for (int j = 0; j <= MAX_FORKNUM; j++)
{
- /* ignore relation forks that doesn't exist */
+ /* ignore relation forks that doesn't exist or is ignored */
if (!BlockNumberIsValid(block[i][j]))
continue;
@@ -4575,9 +4609,13 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
}
pfree(block);
- locators = palloc(sizeof(RelFileLocator) * n); /* non-local relations */
+ locators = palloc(sizeof(RelFileForks) * n); /* non-local relations */
+
for (i = 0; i < n; i++)
- locators[i] = rels[i]->smgr_rlocator.locator;
+ {
+ locators[i].rloc = rels[i]->smgr_rlocator.locator;
+ locators[i].forks = (forks ? forks[i] : FORKBITMAP_ALLFORKS());
+ }
/*
* For low number of relations to drop just use a simple walk through, to
@@ -4587,13 +4625,34 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
*/
use_bsearch = n > RELS_BSEARCH_THRESHOLD;
- /* sort the list of rlocators if necessary */
- if (use_bsearch)
- qsort(locators, n, sizeof(RelFileLocator), rlocator_comparator);
+ /*
+ * Sort and compress the list of RelFileForks if necessary. We believe the
+ * caller passed unique rlocators if forks are not specified.
+ */
+ if (use_bsearch || forks)
+ {
+ int j = 0;
+
+ qsort(locators, n, sizeof(RelFileForks), rlocator_comparator);
+
+ /*
+ * Now the list is in rlocator increasing order, compress the list by
+ * merging fork bitmaps so that all elements have unique rlocators.
+ */
+ for (i = 1 ; i < n ; i++)
+ {
+ if (RelFileLocatorEquals(locators[j].rloc, locators[i].rloc))
+ locators[j].forks |= locators[i].forks;
+ else
+ locators[++j] = locators[i];
+ }
+
+ n = j + 1;
+ }
for (i = 0; i < NBuffers; i++)
{
- RelFileLocator *rlocator = NULL;
+ RelFileForks *rlocator = NULL;
BufferDesc *bufHdr = GetBufferDescriptor(i);
uint32 buf_state;
@@ -4608,7 +4667,8 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
for (j = 0; j < n; j++)
{
- if (BufTagMatchesRelFileLocator(&bufHdr->tag, &locators[j]))
+ if (BufTagMatchesRelFileLocator(&bufHdr->tag,
+ &locators[j].rloc))
{
rlocator = &locators[j];
break;
@@ -4621,16 +4681,18 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
locator = BufTagGetRelFileLocator(&bufHdr->tag);
rlocator = bsearch((const void *) &(locator),
- locators, n, sizeof(RelFileLocator),
+ locators, n, sizeof(RelFileForks),
rlocator_comparator);
}
/* buffer doesn't belong to any of the given relfilelocators; skip it */
- if (rlocator == NULL)
+ if (rlocator == NULL ||
+ !FORKBITMAP_ISSET(rlocator->forks, BufTagGetForkNum(&bufHdr->tag)))
continue;
buf_state = LockBufHdr(bufHdr);
- if (BufTagMatchesRelFileLocator(&bufHdr->tag, rlocator))
+ if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator->rloc) &&
+ FORKBITMAP_ISSET(rlocator->forks, BufTagGetForkNum(&bufHdr->tag)))
InvalidateBuffer(bufHdr); /* releases spinlock */
else
UnlockBufHdr(bufHdr, buf_state);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index cc8a80ee961..5cc02fdeeed 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1484,7 +1484,7 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
srels[i] = srel;
}
- smgrdounlinkall(srels, ndelrels, isRedo);
+ smgrdounlinkall(srels, NULL, ndelrels, isRedo);
for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5a403ffb04b..a03e2055eab 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -454,15 +454,19 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
/*
* smgrdounlinkall() -- Immediately unlink all forks of all given relations
*
- * All forks of all given relations are removed from the store. This
- * should not be used during transactional operations, since it can't be
- * undone.
+ * Forks of all given relations are removed from the store. This should not be
+ * used during transactional operations, since it can't be undone.
+ *
+ * If forks is NULL, all forks are removed for all relations. Otherwise, only
+ * the specified fork is removed for the relation at the corresponding position
+ * in the rels array. InvalidForkNumber means removing all forks for the
+ * corresponding relation.
*
* If isRedo is true, it is okay for the underlying file(s) to be gone
* already.
*/
void
-smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
+smgrdounlinkall(SMgrRelation *rels, ForkBitmap *forks, int nrels, bool isRedo)
{
int i = 0;
RelFileLocatorBackend *rlocators;
@@ -475,7 +479,7 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
* Get rid of any remaining buffers for the relations. bufmgr will just
* drop them without bothering to write the contents.
*/
- DropRelationsAllBuffers(rels, nrels);
+ DropRelationsAllBuffers(rels, forks, nrels);
/*
* create an array which contains all relations to be dropped, and close
@@ -489,9 +493,13 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
rlocators[i] = rlocator;
- /* Close the forks at smgr level */
+ /* Close the spacified forks at smgr level. */
for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
- smgrsw[which].smgr_close(rels[i], forknum);
+ {
+ if (!forks || FORKBITMAP_ISSET(forks[i], forknum))
+ smgrsw[which].smgr_close(rels[i], forknum);
+ continue;
+ }
}
/*
@@ -518,7 +526,11 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
int which = rels[i]->smgr_which;
for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
- smgrsw[which].smgr_unlink(rlocators[i], forknum, isRedo);
+ {
+ if (!forks || FORKBITMAP_ISSET(forks[i], forknum))
+ smgrsw[which].smgr_unlink(rlocators[i], forknum, isRedo);
+ continue;
+ }
}
pfree(rlocators);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ec9acdf188e..ebc910d542d 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3860,7 +3860,7 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
* anyway.
*/
srel = smgropen(relation->rd_locator, relation->rd_backend);
- smgrdounlinkall(&srel, 1, false);
+ smgrdounlinkall(&srel, NULL, 1, false);
smgrclose(srel);
}
else
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index 2dabbe01ecd..e9cc7673a95 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -70,6 +70,17 @@ typedef enum ForkNumber
#define MAX_FORKNUM INIT_FORKNUM
+/* ForkBitmap holds multiple forks as a bitmap */
+StaticAssertDecl(MAX_FORKNUM < 8, "MAX_FORKNUM too large for ForkBitmap");
+
+typedef uint8 ForkBitmap;
+#define FORKBITMAP_BIT(f) (1 << (f))
+#define FORKBITMAP_INIT(m, f) ((m) = FORKBITMAP_BIT((f)))
+#define FORKBITMAP_SET(m, f) ((m) |= FORKBITMAP_BIT((f)))
+#define FORKBITMAP_RESET(m, f) ((m) &= ~(FORKBITMAP_BIT(f)))
+#define FORKBITMAP_ISSET(m, f) ((m) & FORKBITMAP_BIT(f))
+#define FORKBITMAP_ALLFORKS() ((1 << (MAX_FORKNUM + 1)) - 1)
+
#define FORKNAMECHARS 4 /* max chars for a fork name */
extern PGDLLIMPORT const char *const forkNames[];
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4267098080f..5b614fb618e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -275,7 +275,7 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
- int nlocators);
+ ForkBitmap *forks, int nlocators);
extern void DropDatabaseBuffers(Oid dbid);
extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
bool permanent);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a05436a8a7c..f099298f1d0 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,7 +85,8 @@ extern void smgrreleaseall(void);
extern void smgrreleaserellocator(RelFileLocatorBackend rlocator);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
-extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
+extern void smgrdounlinkall(SMgrRelation *rels, ForkBitmap *forks, int nrels,
+ bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, const void *buffer, bool skipFsync);
extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 82cbd451430..1efc48f14ea 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -816,6 +816,7 @@ ForeignServer
ForeignServerInfo
ForeignTable
ForeignTruncateInfo
+ForkBitmap
ForkNumber
FormData_pg_aggregate
FormData_pg_am
--
2.43.5
v35-0018-Enable-commit-records-to-handle-fork-removals.patchtext/x-patch; charset=us-asciiDownload
From 10089b4184ab4af9363864bda3f003bde3f89854 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 2 Aug 2024 20:51:31 +0900
Subject: [PATCH v35 18/21] Enable commit records to handle fork removals
Currently, COMMIT/ABORT WAL records store relation locators that need
to be removed at commit. This patch adds support for handling these
removals on a per-fork basis. While the PREPARE record can store the
same information, it is not used.
---
src/backend/access/rmgrdesc/xactdesc.c | 44 ++++++++++++++++++++++----
src/backend/access/transam/twophase.c | 14 +++++---
src/backend/access/transam/xact.c | 18 ++++++++---
src/backend/storage/buffer/bufmgr.c | 2 +-
src/backend/storage/smgr/md.c | 11 +++++--
src/include/access/xact.h | 7 ++--
src/include/storage/md.h | 3 +-
7 files changed, 78 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 08172df83fd..766a0ebcee1 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -82,6 +82,12 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
data += MinSizeOfXactRelfileLocators;
data += xl_rellocators->nrels * sizeof(RelFileLocator);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ {
+ parsed->xforks = (ForkBitmap *)data;
+ data += xl_rellocators->nrels * sizeof(ForkBitmap);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -188,6 +194,12 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += MinSizeOfXactRelfileLocators;
data += xl_rellocator->nrels * sizeof(RelFileLocator);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ {
+ parsed->xforks = (ForkBitmap *)data;
+ data += xl_rellocator->nrels * sizeof(ForkBitmap);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -262,6 +274,12 @@ ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *p
parsed->xlocators = (RelFileLocator *) bufptr;
bufptr += MAXALIGN(xlrec->ncommitrels * sizeof(RelFileLocator));
+ if (xlrec->comhasforks)
+ {
+ parsed->xforks = (ForkBitmap *) bufptr;
+ bufptr += MAXALIGN(xlrec->ncommitrels * sizeof(ForkBitmap));
+ }
+
parsed->stats = (xl_xact_stats_item *) bufptr;
bufptr += MAXALIGN(xlrec->ncommitstats * sizeof(xl_xact_stats_item));
@@ -274,7 +292,7 @@ ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *p
static void
xact_desc_relations(StringInfo buf, char *label, int nrels,
- RelFileLocator *xlocators)
+ RelFileLocator *xlocators, ForkBitmap *xforks)
{
int i;
@@ -287,6 +305,19 @@ xact_desc_relations(StringInfo buf, char *label, int nrels,
appendStringInfo(buf, " %s", path);
pfree(path);
+
+ if (xforks)
+ {
+ char delim = ':';
+ for (int j = 0 ; j <= MAX_FORKNUM ; j++)
+ {
+ if (FORKBITMAP_ISSET(xforks[i], j))
+ {
+ appendStringInfo(buf, "%c%d", delim, j);
+ delim = ',';
+ }
+ }
+ }
}
}
}
@@ -339,7 +370,8 @@ xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec, RepOriginId
appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time));
- xact_desc_relations(buf, "rels", parsed.nrels, parsed.xlocators);
+ xact_desc_relations(buf, "rels",
+ parsed.nrels, parsed.xlocators, parsed.xforks);
xact_desc_subxacts(buf, parsed.nsubxacts, parsed.subxacts);
xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
@@ -375,7 +407,8 @@ xact_desc_abort(StringInfo buf, uint8 info, xl_xact_abort *xlrec, RepOriginId or
appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time));
- xact_desc_relations(buf, "rels", parsed.nrels, parsed.xlocators);
+ xact_desc_relations(buf, "rels",
+ parsed.nrels, parsed.xlocators, parsed.xforks);
xact_desc_subxacts(buf, parsed.nsubxacts, parsed.subxacts);
if (parsed.xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -399,9 +432,8 @@ xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginI
appendStringInfo(buf, "gid %s: ", parsed.twophase_gid);
appendStringInfoString(buf, timestamptz_to_str(parsed.xact_time));
- xact_desc_relations(buf, "rels(commit)", parsed.nrels, parsed.xlocators);
- xact_desc_relations(buf, "rels(abort)", parsed.nabortrels,
- parsed.abortlocators);
+ xact_desc_relations(buf, "rels(commit)", parsed.nrels,
+ parsed.xlocators, parsed.xforks);
xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);
xact_desc_subxacts(buf, parsed.nsubxacts, parsed.subxacts);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b4c423e449e..4def078b652 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -203,6 +203,7 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
TransactionId *children,
int nrels,
RelFileLocator *rels,
+ ForkBitmap *forks,
int nstats,
xl_xact_stats_item *stats,
int ninvalmsgs,
@@ -1524,6 +1525,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
TransactionId latestXid;
TransactionId *children;
RelFileLocator *commitrels;
+ ForkBitmap *commitforks = NULL;
xl_xact_stats_item *commitstats;
xl_xact_stats_item *abortstats;
SharedInvalidationMessage *invalmsgs;
@@ -1582,7 +1584,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
if (isCommit)
RecordTransactionCommitPrepared(xid,
hdr->nsubxacts, children,
- hdr->ncommitrels, commitrels,
+ hdr->ncommitrels,
+ commitrels, commitforks,
hdr->ncommitstats,
commitstats,
hdr->ninvalmsgs, invalmsgs,
@@ -1610,6 +1613,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
*/
gxact->valid = false;
+ /* Currently, prepare info should not have per-fork storage information. */
+ Assert(!commitforks);
+
/*
* We have to remove any files that were supposed to be dropped. For
* consistency with the regular xact.c code paths, must do this before
@@ -1622,7 +1628,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
if (isCommit)
{
/* Make sure files supposed to be dropped are dropped */
- DropRelationFiles(commitrels, hdr->ncommitrels, false);
+ DropRelationFiles(commitrels, commitforks, hdr->ncommitrels, false);
}
if (isCommit)
@@ -2317,7 +2323,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileLocator *rels,
+ RelFileLocator *rels, ForkBitmap *forks,
int nstats,
xl_xact_stats_item *stats,
int ninvalmsgs,
@@ -2348,7 +2354,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
* not they do.
*/
recptr = XactLogCommitRecord(committs,
- nchildren, children, nrels, rels,
+ nchildren, children, nrels, rels, forks,
nstats, stats,
ninvalmsgs, invalmsgs,
initfileinval,
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c3302b4df46..c1acb41d24a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1326,6 +1326,7 @@ RecordTransactionCommit(void)
TransactionId latestXid = InvalidTransactionId;
int nrels;
RelFileLocator *rels;
+ ForkBitmap *forks = NULL;
int nchildren;
TransactionId *children;
int ndroppedstats = 0;
@@ -1447,7 +1448,7 @@ RecordTransactionCommit(void)
* Insert the commit XLOG record.
*/
XactLogCommitRecord(GetCurrentTransactionStopTimestamp(),
- nchildren, children, nrels, rels,
+ nchildren, children, nrels, rels, forks,
ndroppedstats, droppedstats,
nmsgs, invalMessages,
RelcacheInitFileInval,
@@ -5850,7 +5851,7 @@ xactGetCommittedChildren(TransactionId **ptr)
XLogRecPtr
XactLogCommitRecord(TimestampTz commit_time,
int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
+ int nrels, RelFileLocator *rels, ForkBitmap *forks,
int ndroppedstats, xl_xact_stats_item *droppedstats,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval,
@@ -5918,6 +5919,9 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
xl_relfilelocators.nrels = nrels;
info |= XLR_SPECIAL_REL_UPDATE;
+
+ if (forks)
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILEFORKS;
}
if (ndroppedstats > 0)
@@ -5980,6 +5984,10 @@ XactLogCommitRecord(TimestampTz commit_time,
MinSizeOfXactRelfileLocators);
XLogRegisterData((char *) rels,
nrels * sizeof(RelFileLocator));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ XLogRegisterData((char *) forks,
+ nrels * sizeof(ForkBitmap));
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -6256,7 +6264,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
XLogFlush(lsn);
/* Make sure files supposed to be dropped are dropped */
- DropRelationFiles(parsed->xlocators, parsed->nrels, true);
+ DropRelationFiles(parsed->xlocators, parsed->xforks, parsed->nrels,
+ true);
}
UndoLog_UndoByXid(true, xid, parsed->nsubxacts, parsed->subxacts, true);
@@ -6370,7 +6379,8 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
*/
XLogFlush(lsn);
- DropRelationFiles(parsed->xlocators, parsed->nrels, true);
+ DropRelationFiles(parsed->xlocators, parsed->xforks, parsed->nrels,
+ true);
}
UndoLog_UndoByXid(false, xid, parsed->nsubxacts, parsed->subxacts, true);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3e5ffd70c42..803a8fd2e6d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -160,7 +160,7 @@ static BufMgrCleanup * cleanups = NULL; /* head of linked list */
typedef struct RelFileForks
{
RelFileLocator rloc; /* key member for qsort */
- ForkBitmap forks; /* fork number in bitmap */
+ ForkBitmap forks; /* fork numbers in bitmap */
} RelFileForks;
/* GUC variables */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 5cc02fdeeed..297a68d2bde 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1464,7 +1464,8 @@ ForgetDatabaseSyncRequests(Oid dbid)
* DropRelationFiles -- drop files of all given relations
*/
void
-DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
+DropRelationFiles(RelFileLocator *delrels, ForkBitmap *delforks, int ndelrels,
+ bool isRedo)
{
SMgrRelation *srels;
int i;
@@ -1478,13 +1479,17 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
{
ForkNumber fork;
+ /* Close the spacified forks at smgr level. */
for (fork = 0; fork <= MAX_FORKNUM; fork++)
- XLogDropRelation(delrels[i], fork);
+ {
+ if (!delforks || FORKBITMAP_ISSET(delforks[i], fork))
+ XLogDropRelation(delrels[i], fork);
+ }
}
srels[i] = srel;
}
- smgrdounlinkall(srels, NULL, ndelrels, isRedo);
+ smgrdounlinkall(srels, delforks, ndelrels, isRedo);
for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 2e09566bdda..16d38c40052 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -196,6 +196,7 @@ typedef struct SavedTransactionCharacteristics
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
#define XACT_XINFO_HAS_GID (1U << 7)
#define XACT_XINFO_HAS_DROPPED_STATS (1U << 8)
+#define XACT_XINFO_HAS_RELFILEFORKS (1U << 9)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -361,6 +362,7 @@ typedef struct xl_xact_prepare
Oid owner; /* user running the transaction */
int32 nsubxacts; /* number of following subxact XIDs */
int32 ncommitrels; /* number of delete-on-commit rels */
+ bool comhasforks; /* commitrels is accompanied by forknums */
int32 ncommitstats; /* number of stats to drop on commit */
int32 nabortstats; /* number of stats to drop on abort */
int32 ninvalmsgs; /* number of cache invalidation messages */
@@ -388,6 +390,7 @@ typedef struct xl_xact_parsed_commit
int nrels;
RelFileLocator *xlocators;
+ ForkBitmap *xforks;
int nstats;
xl_xact_stats_item *stats;
@@ -397,8 +400,6 @@ typedef struct xl_xact_parsed_commit
TransactionId twophase_xid; /* only for 2PC */
char twophase_gid[GIDSIZE]; /* only for 2PC */
- int nabortrels; /* only for 2PC */
- RelFileLocator *abortlocators; /* only for 2PC */
int nabortstats; /* only for 2PC */
xl_xact_stats_item *abortstats; /* only for 2PC */
@@ -421,6 +422,7 @@ typedef struct xl_xact_parsed_abort
int nrels;
RelFileLocator *xlocators;
+ ForkBitmap *xforks;
int nstats;
xl_xact_stats_item *stats;
@@ -504,6 +506,7 @@ extern int xactGetCommittedChildren(TransactionId **ptr);
extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileLocator *rels,
+ ForkBitmap *forks,
int ndroppedstats,
xl_xact_stats_item *droppedstats,
int nmsgs, SharedInvalidationMessage *msgs,
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index b72293c79a5..18dbf8a1462 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -48,7 +48,8 @@ extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
-extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
+extern void DropRelationFiles(RelFileLocator *delrels, ForkBitmap *delforks,
+ int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
--
2.43.5
v35-0019-Add-per-fork-deletion-support-to-pendingDeletes.patchtext/x-patch; charset=us-asciiDownload
From 5e48e6540a2541b01aae8514249a51729a13bf40 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 2 Aug 2024 21:39:11 +0900
Subject: [PATCH v35 19/21] Add per-fork deletion support to pendingDeletes
This patch introduces the ability to handle commit-time pending
deletes on a per-fork basis.
---
src/backend/access/transam/twophase.c | 10 ++++-
src/backend/access/transam/xact.c | 2 +-
src/backend/catalog/storage.c | 64 +++++++++++++++++++++++----
src/include/catalog/storage.h | 3 +-
4 files changed, 68 insertions(+), 11 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 4def078b652..817d8cad587 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1088,6 +1088,7 @@ StartPrepare(GlobalTransaction gxact)
TwoPhaseFileHeader hdr;
TransactionId *children;
RelFileLocator *commitrels;
+ ForkBitmap *commitforks;
xl_xact_stats_item *abortstats = NULL;
xl_xact_stats_item *commitstats = NULL;
@@ -1114,7 +1115,8 @@ StartPrepare(GlobalTransaction gxact)
hdr.prepared_at = gxact->prepared_at;
hdr.owner = gxact->owner;
hdr.nsubxacts = xactGetCommittedChildren(&children);
- hdr.ncommitrels = smgrGetCommitPendingDeletes(&commitrels);
+ hdr.ncommitrels = smgrGetCommitPendingDeletes(&commitrels, &commitforks);
+ hdr.comhasforks = (commitforks != NULL);
hdr.ncommitstats =
pgstat_get_transactional_drops(true, &commitstats);
hdr.nabortstats =
@@ -1143,6 +1145,12 @@ StartPrepare(GlobalTransaction gxact)
{
save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileLocator));
pfree(commitrels);
+
+ if (hdr.comhasforks)
+ {
+ save_state_data(commitforks, hdr.ncommitrels * sizeof(ForkBitmap));
+ pfree(commitforks);
+ }
}
if (hdr.ncommitstats > 0)
{
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c1acb41d24a..6ba9dc97aeb 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1347,7 +1347,7 @@ RecordTransactionCommit(void)
LogLogicalInvalidations();
/* Get data needed for commit record */
- nrels = smgrGetCommitPendingDeletes(&rels);
+ nrels = smgrGetCommitPendingDeletes(&rels, &forks);
nchildren = xactGetCommittedChildren(&children);
ndroppedstats = pgstat_get_transactional_drops(true, &droppedstats);
if (XLogStandbyInfoActive())
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 279b1f7917f..10ab5baf97a 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -64,6 +64,7 @@ int wal_skip_threshold = 2048; /* in kilobytes */
typedef struct PendingRelDelete
{
RelFileLocator rlocator; /* relation that may need to be deleted */
+ ForkBitmap forks; /* fork bitmap */
ProcNumber procNumber; /* INVALID_PROC_NUMBER if not a temp rel */
int nestLevel; /* xact nesting level of request */
struct PendingRelDelete *next; /* linked-list link */
@@ -257,6 +258,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->rlocator = rel->rd_locator;
+ pending->forks = FORKBITMAP_ALLFORKS();
pending->procNumber = rel->rd_backend;
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
@@ -306,6 +308,8 @@ RelationPreserveStorageOnCommit(RelFileLocator rlocator)
next = pending->next;
if (RelFileLocatorEquals(rlocator, pending->rlocator))
{
+ Assert(pending->forks == FORKBITMAP_ALLFORKS());
+
/* unlink and delete list entry */
if (prev)
prev->next = next;
@@ -677,8 +681,9 @@ SerializePendingSyncs(Size maxSize, char *startAddress)
/* remove deleted rnodes */
for (delete = pendingDeletes; delete != NULL; delete = delete->next)
- (void) hash_search(tmphash, &delete->rlocator,
- HASH_REMOVE, NULL);
+ if (delete->forks == FORKBITMAP_ALLFORKS())
+ (void) hash_search(tmphash, &delete->rlocator,
+ HASH_REMOVE, NULL);
hash_seq_init(&scan, tmphash);
while ((src = (RelFileLocator *) hash_seq_search(&scan)))
@@ -730,6 +735,7 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ ForkBitmap *forks = NULL;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -752,6 +758,8 @@ smgrDoPendingDeletes(bool isCommit)
{
SMgrRelation srel;
+ Assert(pending->forks == FORKBITMAP_ALLFORKS());
+
srel = smgropen(pending->rlocator, pending->procNumber);
/* allocate the initial array, or extend it, if needed */
@@ -764,8 +772,26 @@ smgrDoPendingDeletes(bool isCommit)
{
maxrels *= 2;
srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+
+ /* expand forks array if any */
+ if (forks)
+ forks = repalloc(forks, sizeof(ForkBitmap) * maxrels);
}
+ /* Create forks array on encountering partial forks. */
+ Assert((pending->forks & ~FORKBITMAP_ALLFORKS()) == 0);
+ if (!forks && pending->forks != FORKBITMAP_ALLFORKS())
+ {
+ forks = palloc(sizeof(ForkBitmap) * maxrels);
+
+ /* fill in the past elements */
+ for (int i = 0 ; i < nrels ; i++)
+ forks[i] = FORKBITMAP_ALLFORKS();
+ }
+
+ if (forks)
+ forks[nrels] = pending->forks;
+
srels[nrels++] = srel;
}
/* must explicitly free the list entry */
@@ -776,12 +802,15 @@ smgrDoPendingDeletes(bool isCommit)
if (nrels > 0)
{
- smgrdounlinkall(srels, NULL, nrels, false);
+ smgrdounlinkall(srels, forks, nrels, false);
for (int i = 0; i < nrels; i++)
smgrclose(srels[i]);
pfree(srels);
+
+ if (forks)
+ pfree(forks);
}
}
@@ -940,34 +969,53 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
* by upper-level transactions.
*/
int
-smgrGetCommitPendingDeletes(RelFileLocator **ptr)
+smgrGetCommitPendingDeletes(RelFileLocator **ptr, ForkBitmap **fptr)
{
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
+ bool hasforks = false;
RelFileLocator *rptr;
+ ForkBitmap *rfptr = NULL;
PendingRelDelete *pending;
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- if (pending->nestLevel >= nestLevel
- && pending->procNumber == INVALID_PROC_NUMBER)
+ Assert((pending->forks & ~FORKBITMAP_ALLFORKS()) == 0);
+
+ if (pending->nestLevel >= nestLevel)
+ {
nrels++;
+
+ if (pending->forks != FORKBITMAP_ALLFORKS())
+ hasforks = true;
+ }
}
if (nrels == 0)
{
*ptr = NULL;
+ *fptr = NULL;
return 0;
}
rptr = (RelFileLocator *) palloc(nrels * sizeof(RelFileLocator));
*ptr = rptr;
+
+ if (hasforks)
+ rfptr = (ForkBitmap *) palloc(nrels * sizeof(ForkBitmap));
+ *fptr = rfptr;
+
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- if (pending->nestLevel >= nestLevel
- && pending->procNumber == INVALID_PROC_NUMBER)
+ if (pending->nestLevel >= nestLevel)
{
*rptr = pending->rlocator;
rptr++;
+
+ if (rfptr)
+ {
+ *rfptr = pending->forks;
+ rfptr++;
+ }
}
}
return nrels;
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 19b02d84a5f..a1abd614cdc 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -44,7 +44,8 @@ extern void RestorePendingSyncs(char *startAddress);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
-extern int smgrGetCommitPendingDeletes(RelFileLocator **ptr);
+extern int smgrGetCommitPendingDeletes(RelFileLocator **ptr,
+ ForkBitmap **fptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
--
2.43.5
v35-0020-Allow-init-fork-to-be-dropped.patchtext/x-patch; charset=us-asciiDownload
From 055a53481de4ae936c77baa84d8e21c50768458e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 16 Aug 2024 23:35:43 +0900
Subject: [PATCH v35 20/21] Allow init fork to be dropped
Building on features introduced in previous commits, this commit adds
the ability to drop init fork transactionally. Dropping an init fork
is deferred until transaction commit, using the pendingDeletes
mechanism. No user side code is provided.
---
src/backend/catalog/storage.c | 50 ++++++++++++++++++++++++++++++++---
src/include/catalog/storage.h | 1 +
2 files changed, 47 insertions(+), 4 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 10ab5baf97a..f2dc413c738 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -178,6 +178,17 @@ void
RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
bool wal_log, bool undo_log)
{
+#ifdef USE_ASSERT_CHECKING
+ /* we must not have pending delete for the init fork. */
+ if (forkNum == INIT_FORKNUM)
+ {
+ for (PendingRelDelete *p = pendingDeletes ; p != NULL ; p = p->next)
+ Assert(!FORKBITMAP_ISSET(p->forks, INIT_FORKNUM) ||
+ !RelFileLocatorEquals(srel->smgr_rlocator.locator,
+ p->rlocator));
+ }
+#endif
+
/* Schedule the removal of this init fork at abort if requested. */
if (undo_log)
ulog_smgrcreate(srel, forkNum);
@@ -189,6 +200,29 @@ RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
smgrcreate(srel, forkNum, false);
}
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ */
+void
+RelationDropInitFork(SMgrRelation srel)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ RelFileLocator rlocator = srel->smgr_rlocator.locator;
+ ProcNumber procNumber = srel->smgr_rlocator.backend;
+ PendingRelDelete *pending;
+
+ /* Schedule the removal of this init fork at commit. */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->rlocator = rlocator;
+ pending->procNumber = procNumber;
+ pending->forks = FORKBITMAP_BIT(INIT_FORKNUM);
+ pending->nestLevel = nestLevel;
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
@@ -758,8 +792,6 @@ smgrDoPendingDeletes(bool isCommit)
{
SMgrRelation srel;
- Assert(pending->forks == FORKBITMAP_ALLFORKS());
-
srel = smgropen(pending->rlocator, pending->procNumber);
/* allocate the initial array, or extend it, if needed */
@@ -1057,8 +1089,18 @@ AtSubCommit_smgr(void)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- if (pending->nestLevel >= nestLevel)
- pending->nestLevel = nestLevel - 1;
+ if (pending->nestLevel < nestLevel)
+ {
+#ifdef USE_ASSERT_CHECKING
+ /* all the remaining entries must be of upper subtransactions */
+ for (; pending ; pending = pending->next)
+ Assert(pending->nestLevel < nestLevel);
+#endif
+ break;
+ }
+
+ /* move this entry to the immediately upper subtransaction */
+ pending->nestLevel = nestLevel - 1;
}
}
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index a1abd614cdc..06a32c56a88 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -27,6 +27,7 @@ extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
bool register_delete);
extern void RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
bool wal_log, bool undo_log);
+extern void RelationDropInitFork(SMgrRelation srel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorageOnCommit(RelFileLocator rlocator);
extern void RelationPreTruncate(Relation rel);
--
2.43.5
v35-0021-In-place-persistence-change-to-LOGGED.patchtext/x-patch; charset=us-asciiDownload
From a646814f23c52a09367fd08661cf72e56cd0bc39 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 27 Aug 2024 10:44:46 +0900
Subject: [PATCH v35 21/21] In-place persistence change to LOGGED
---
src/backend/commands/tablecmds.c | 27 +++++++-----
src/test/recovery/t/044_persistence_change.pl | 43 ++++++++++---------
2 files changed, 40 insertions(+), 30 deletions(-)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f8d240f374f..3908505ec71 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5804,9 +5804,6 @@ RelationChangePersistence(AlteredTableInfo *tab, char persistence,
continue;
}
- /* Currently, only allowing changes to UNLOGGED. */
- Assert(!persistent);
-
RelationAssumePersistenceChange(r);
/* switch buffer persistence */
@@ -5814,11 +5811,22 @@ RelationChangePersistence(AlteredTableInfo *tab, char persistence,
log_smgrbufpersistence(srel->smgr_rlocator.locator, persistent);
SetRelationBuffersPersistence(srel, persistent);
- /* then create the init fork */
- is_index = (r->rd_rel->relkind == RELKIND_INDEX);
- RelationCreateFork(srel, INIT_FORKNUM, !is_index, true);
- if (is_index)
- r->rd_indam->ambuildempty(r);
+ /* then create or drop the init fork */
+ if (persistent)
+ RelationDropInitFork(srel);
+ else
+ {
+ is_index = (r->rd_rel->relkind == RELKIND_INDEX);
+
+ /*
+ * If it is an index, have access methods initialize the file. In
+ * that case, WAL-logging is expected to performed by the
+ * ambuildempty() method.
+ */
+ RelationCreateFork(srel, INIT_FORKNUM, !is_index, true);
+ if (is_index)
+ r->rd_indam->ambuildempty(r);
+ }
/* Update catalog */
tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
@@ -5977,8 +5985,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE &&
- persistence == RELPERSISTENCE_UNLOGGED)
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
{
/* Make in-place persistence change. */
RelationChangePersistence(tab, persistence, lockmode);
diff --git a/src/test/recovery/t/044_persistence_change.pl b/src/test/recovery/t/044_persistence_change.pl
index ad1b444cb46..24da84d562f 100644
--- a/src/test/recovery/t/044_persistence_change.pl
+++ b/src/test/recovery/t/044_persistence_change.pl
@@ -100,8 +100,8 @@ max_prepared_transactions = 2
# Check if SET LOGGED didn't change relfilenumbers and data survive a crash
my $relfilenodes3 = getrelfilenodes($node, \@relnames);
- ok (!checkrelfilenodes($relfilenodes2, $relfilenodes3),
- "crashed SET-LOGGED relations have sane relfilenodes transition");
+ ok (checkrelfilenodes($relfilenodes2, $relfilenodes3),
+ "crashed SET-LOGGED relations have sane relfilenodes transition");
is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
"crashed SET-LOGGED table does not lose data");
@@ -147,34 +147,35 @@ max_prepared_transactions = 2
"storages are reverted to logged state");
### Subtransactions
- ok ($node->psql('postgres',
+ my ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
qq(
BEGIN;
ALTER TABLE t SET UNLOGGED; -- committed
SAVEPOINT a;
- ALTER TABLE t SET LOGGED; -- aborted
+ ALTER TABLE t SET LOGGED; -- ERROR
SAVEPOINT b;
ROLLBACK TO a;
COMMIT;
- )) != 3,
- "command succeeds 1");
-
- is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
- "table data is not changed 1");
- ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
- "storages are changed to unlogged state");
+ ));
+ ok ($stderr =~ m/persistence of this relation has been already changed/,
+ "errors out when double flip occured in a single transaction");
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages stay in logged state");
ok ($node->psql('postgres',
qq(
+ ALTER TABLE t SET UNLOGGED;
BEGIN;
+ SAVEPOINT a;
ALTER TABLE t SET LOGGED; -- aborted
+ ROLLBACK TO a;
SAVEPOINT a;
- ALTER TABLE t SET UNLOGGED; -- aborted
- SAVEPOINT b;
+ ALTER TABLE t SET LOGGED; -- no error
RELEASE a;
ROLLBACK;
)) != 3,
- "command succeeds 2");
+ "rolled-back persistence flip doesn't prevent subsequent flips");
is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
"table data is not changed 2");
@@ -182,7 +183,7 @@ max_prepared_transactions = 2
"storages stay in unlogged state");
### Prepared transactions
- my ($ret, $stdout, $stderr) =
+ ($ret, $stdout, $stderr) =
$node->psql('postgres',
qq(
ALTER TABLE t SET LOGGED;
@@ -207,16 +208,17 @@ max_prepared_transactions = 2
));
ok ($ret == 0, "prepare persistence-flipped xact 2");
ok (check_storage_state(\&is_logged_state, $node, \@relnames),
- "storages stay in logged state");
+ "storages stay in logged state 2");
### Error out DML
- $node->psql('postgres',
+ ok($node->psql('postgres',
qq(
BEGIN;
- ALTER TABLE t SET LOGGED;
+ ALTER TABLE t SET LOGGED; -- no effect
INSERT INTO t VALUES(1); -- Succeeds
COMMIT;
- ));
+ )) != 3,
+ "ineffective persistence change doesn't prevent DML");
($ret, $stdout, $stderr) =
$node->psql('postgres',
@@ -232,7 +234,7 @@ max_prepared_transactions = 2
qq(
BEGIN;
SAVEPOINT a;
- ALTER TABLE t SET UNLOGGED;
+ ALTER TABLE t SET LOGGED;
ROLLBACK TO a;
INSERT INTO t VALUES(3); -- Succeeds
COMMIT;
@@ -242,6 +244,7 @@ max_prepared_transactions = 2
($ret, $stdout, $stderr) =
$node->psql('postgres',
qq(
+ ALTER TABLE t SET LOGGED;
BEGIN;
SAVEPOINT a;
ALTER TABLE t SET UNLOGGED;
--
2.43.5
On 31/10/2024 10:01, Kyotaro Horiguchi wrote:
After some delays, here’s the new version. In this update, UNDO logs
are WAL-logged and processed in memory under most conditions. During
checkpoints, they’re flushed to files, which are then read when a
specific XID’s UNDO log is accessed for the first time during
recovery.The biggest changes are in patches 0001 through 0004 (equivalent to
the previous 0001-0002). After that, there aren’t any major
changes. Since this update involves removing some existing features,
I’ve split these parts into multiple smaller identity transformations
to make them clearer.As for changes beyond that, the main one is lifting the previous
restriction on PREPARE for transactions after a persistence
change. This was made possible because, with the shift to in-memory
processing of UNDO logs, commit-time crash recovery detection is now
simpler. Additional changes include completely removing the
abort-handling portion from the pendingDeletes mechanism (0008-0010).
In this patch version, the undo log is kept in dynamic shared memory. It
can grow indefinitely. On a checkpoint, it's flushed to disk.
If I'm reading it correctly, the undo records are kept in the DSA area
even after it's flushed to disk. That's not necessary; system never
needs to read the undo log unless there's a crash, so there's no need to
keep it in memory after it's been flushed to disk. That's true today; we
could start relying on the undo log to clean up on abort even when
there's no crash, but I think it's a good design to not do that and rely
on backend-private state for non-crash transaction abort.
I'd suggest doing this the other way 'round. Let's treat the on-disk
representation as the primary representation, not the in-memory one.
Let's use a small fixed-size shared memory area just as a write buffer
to hold the dirty undo log entries that haven't been written to disk
yet. Most transactions are short, so most undo log entries never need to
be flushed to disk, but I think it'll be simpler to think of it that
way. On checkpoint, flush all the buffered dirty entries from memory to
disk and clear the buffer. Also do that if the buffer fills up.
A high-level overview comment of the on-disk format would be nice. If I
understand correctly, there's a magic constant at the beginning of each
undo file, followed by UndoLogRecords. There are no other file headers
and no page structure within the file.
That format seems reasonable. For cross-checking, maybe add the XID to
the file header too. There is a separate CRC value on each record, which
is nice, but not strictly necessary since the writes to the UNDO log are
WAL-logged. The WAL needs CRCs on each record to detect the end of log,
but the UNDO log doesn't need that. Anyway, it's fine.
I somehow dislike the file per subxid design. I'm sure it works, it's
just more of a feeling that it doesn't feel right. I'm somewhat worried
about ending up with lots of files, if you e.g. use temporary tables
with subtransactions heavily. Could we have just one file per top-level
XID? I guess that can become a problem too, if you have a lot of aborted
subtransactions. The UNDO records for the aborted subtransactions would
bloat the undo file. But maybe that's nevertheless better?
--
Heikki Linnakangas
Neon (https://neon.tech)
Thank you for the quick comments.
At Thu, 31 Oct 2024 23:24:36 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
On 31/10/2024 10:01, Kyotaro Horiguchi wrote:
After some delays, here’s the new version. In this update, UNDO logs
are WAL-logged and processed in memory under most conditions. During
checkpoints, they’re flushed to files, which are then read when a
specific XID’s UNDO log is accessed for the first time during
recovery.
The biggest changes are in patches 0001 through 0004 (equivalent to
the previous 0001-0002). After that, there aren’t any major
changes. Since this update involves removing some existing features,
I’ve split these parts into multiple smaller identity transformations
to make them clearer.
As for changes beyond that, the main one is lifting the previous
restriction on PREPARE for transactions after a persistence
change. This was made possible because, with the shift to in-memory
processing of UNDO logs, commit-time crash recovery detection is now
simpler. Additional changes include completely removing the
abort-handling portion from the pendingDeletes mechanism (0008-0010).In this patch version, the undo log is kept in dynamic shared
memory. It can grow indefinitely. On a checkpoint, it's flushed to
disk.If I'm reading it correctly, the undo records are kept in the DSA area
even after it's flushed to disk. That's not necessary; system never
needs to read the undo log unless there's a crash, so there's no need
The system also needs to read the undo log whenever additional undo
logs are added. In this version, I’ve moved all abort-time
pendingDeletes data entirely to the undo logs. In other words, the DSA
area is expanded in exchange for reducing the pendingDelete list. As a
result, there is minimal impact on overall memory usage. Additionally,
the current flushing code is straightforward because it relies on the
in-memory primary image. If we drop the in-memory image during flush,
we might need exclusive locking or possibly some ordering
techniques. Anyway, I’ll consider that approach.
to keep it in memory after it's been flushed to disk. That's true
today; we could start relying on the undo log to clean up on abort
even when there's no crash, but I think it's a good design to not do
that and rely on backend-private state for non-crash transaction
abort.
Hmm. Sounds reasonable. In the next version, I'll revert the changes
to pendingDeletes and adjust it to just discard the log on regular
aborts.
I'd suggest doing this the other way 'round. Let's treat the on-disk
representation as the primary representation, not the in-memory
one. Let's use a small fixed-size shared memory area just as a write
buffer to hold the dirty undo log entries that haven't been written to
disk yet. Most transactions are short, so most undo log entries never
need to be flushed to disk, but I think it'll be simpler to think of
it that way. On checkpoint, flush all the buffered dirty entries from
memory to disk and clear the buffer. Also do that if the buffer fills
up.
I'd like to clarify the specific concept of these fixed-length memory
slots. Is it something like this: each slot is keyed by an XID,
followed by an in-file offset and a series of, say, 1024-byte areas?
When writing a log for a new XID, if no slot is available, the backend
would immediately evict the slot with the smallest XID to disk to free
up space. If an existing slot runs out of space while writing new
logs, the backend would flush it immediately and continue using the
area. Is this correct? Additionally, if multiple processes try to
write to a single slot, stricter locking might be needed. For example,
if a slot is evicted by a backend other than its user, exclusive
control might be required during the file write. jjjIs there any
effective way to avoid such locking? In the current patch set, I’m
avoiding any impact on the backend from checkpointer file writes by
treating the in-memory image as primary. And regarding the number of
these areas… although I’m not entirely sure, it seems unlikely we’d
have hundreds of sessions simultaneously creating tables, so would it
make sense to make this configurable, with a default of around 32
areas?
A high-level overview comment of the on-disk format would be nice. If
I understand correctly, there's a magic constant at the beginning of
each undo file, followed by UndoLogRecords. There are no other file
headers and no page structure within the file.
Right.
That format seems reasonable. For cross-checking, maybe add the XID to
the file header too. There is a separate CRC value on each record,
which is nice, but not strictly necessary since the writes to the UNDO
log are WAL-logged. The WAL needs CRCs on each record to detect the
end of log, but the UNDO log doesn't need that. Anyway, it's fine.
For the first point, I considered it when designing the previous patch
set but chose not to implement it. As for the CRC, you're right - it’s
simply a leftover from the previous design. I have no issues with
following both points.
I somehow dislike the file per subxid design. I'm sure it works, it's
just more of a feeling that it doesn't feel right. I'm somewhat
worried about ending up with lots of files, if you e.g. use temporary
tables with subtransactions heavily. Could we have just one file per
I first thought the same thing when working on the previos patch.
top-level XID? I guess that can become a problem too, if you have a
lot of aborted subtransactions. The UNDO records for the aborted
subtransactions would bloat the undo file. But maybe that's
nevertheless better?
In the current patch set, normal abort processing is handled by the
UNDO log, so maintaining the performance of the UNDO process is
essential. If we were to return this to pendingDeletes, it might also
be feasible to add an XID cancellation record to the UNDO log and scan
the entire file once before executing individual logs. I’ll give it
some thought.
At Mon, 28 Oct 2024 15:33:41 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
On 31/08/2024 19:09, Kyotaro Horiguchi wrote:
Subject: [PATCH v34 03/16] Remove function for retaining files on
outer
transaction aborts
The function RelationPreserveStorage() was initially created to keep
storage files committed in a subtransaction on the abort of outer
transactions. It was introduced by commit b9b8831ad6 in 2010, but no
use case for this behavior has emerged since then. If we move the
at-commit removal feature of storage files from pendingDeletes to the
UNDO log system, the UNDO system would need to accept the cancellation
of already logged entries, which makes the system overly complex with
no benefit. Therefore, remove the feature.I don't think that's quite right. I don't think this was meant for
subtransaction aborts, but to make sure that if the top-transaction
aborts after AtEOXact_RelationMap() has already been called, we don't
remove the new relation. AtEOXact_RelationMap() is called very late in
Hmm. I believe I wrote that. It prevents storage removal once it’s
committed in any subtransaction, even if that subtransaction is
finally aborted, including by the top transaction.
the commit process to keep the window as small as possible, but if it
nevertheless happens, the consequences are pretty bad if you remove a
relation file that is in fact needed.
However, on second thought, it does seem odd. I may have confused
something here. If pendingDeletes is restored and undo cancellation is
implemented, this change would be unnecessary.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Import Notes
Reply to msg id not found: 1f201ea8-b1e3-4606-9525-c5817e651cda@iki.fi9d45c156-caad-4197-b77d-47dfbbe74eb4@iki.fi
A bit out of the blue, but I remembered the reason why I could make
that change I previously agreed seemed off. Just thought I’d let you
know.
At Tue, 05 Nov 2024 13:25:26 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
me> > the commit process to keep the window as small as possible, but if it
me> > nevertheless happens, the consequences are pretty bad if you remove a
me> > relation file that is in fact needed.
me>
me> However, on second thought, it does seem odd. I may have confused
me> something here. If pendingDeletes is restored and undo cancellation is
me> implemented, this change would be unnecessary.
The change would indeed be incorrect if updates to mapped relations
could occur within subtransactions. However, in reality, trying to
perform such an operation raises an error (something like “cannot do
this in a subtransaction”) and is rejected. So, there’s actually no
path where the removed code would be used. That’s why I judged it was
safe to remove that part. However, from that perspective, I think the
explanations in the comments and commit messages were somewhat lacking
or missed the point.
Currently, I’m leaning toward implementing per-relation undo
cancellation. Previously, this path was active even during normal
aborts, so there were performance concerns, but now it only runs
during recovery cleanup, so there are no performance issues with
handling cancellation. In the current state, the code has been
simplified overall.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello. This is the updated version.
(Sorry for the delay; I've been a little swamped.)
- Undo logs are primarily stored in a fixed number of fixed-length
slots and are spilled into files under some conditions.
The number of slots is 32 (ULOG_SLOT_NUM), and the buffer length is
1024 (ULOG_SLOT_BUF_LEN). Both are currently non-configurable.
- Undo logs are now used only during recovery and no longer involved
in transaction ends for normal backends. Pending deletes for aborts
have been restored.
- Undo logs are stored on a per-Top-XID basis.
- RelationPreserverStorate() is no longer modified.
In this version, in the part following the introduction of orphan
storage prevention, the restriction on prepared transactions
persisting beyond server crashes (i.e., the prohibition) has been
removed. This is because handling for such cases has been reverted to
pendingDeletes.
Let me know if you have any questions or concerns.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v36-0001-Add-XLOG-resource-for-the-undo-log-system.patchtext/x-patch; charset=us-asciiDownload
From 22463d55a836958e6d17a85dad9945e93c256ef8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 30 Sep 2024 16:31:02 +0900
Subject: [PATCH v36 01/17] Add XLOG resource for the undo log system
In the upcoming UNDO log system, XLOG will be used to persist UNDO log
information. This commit adds the necessary XLOG components, leaving
out the main part of the UNDO log, to provide a minimal implementation
for easier review.
---
src/backend/access/rmgrdesc/Makefile | 1 +
src/backend/access/rmgrdesc/meson.build | 1 +
src/backend/access/rmgrdesc/undologdesc.c | 99 +++++++++++++++++++++++
src/backend/access/transam/Makefile | 1 +
src/backend/access/transam/meson.build | 1 +
src/backend/access/transam/rmgr.c | 3 +-
src/backend/access/transam/undolog.c | 38 +++++++++
src/bin/pg_rewind/parsexlog.c | 2 +-
src/bin/pg_waldump/rmgrdesc.c | 3 +-
src/include/access/rmgr.h | 2 +-
src/include/access/rmgrlist.h | 47 +++++------
src/include/access/undolog.h | 84 +++++++++++++++++++
src/tools/pgindent/typedefs.list | 5 ++
13 files changed, 260 insertions(+), 27 deletions(-)
create mode 100644 src/backend/access/rmgrdesc/undologdesc.c
create mode 100644 src/backend/access/transam/undolog.c
create mode 100644 src/include/access/undolog.h
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index cd95eec37f1..542fd3d6a8e 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -29,6 +29,7 @@ OBJS = \
spgdesc.o \
standbydesc.o \
tblspcdesc.o \
+ undologdesc.o \
xactdesc.o \
xlogdesc.o
diff --git a/src/backend/access/rmgrdesc/meson.build b/src/backend/access/rmgrdesc/meson.build
index e8b7a65fc76..d19c2c3b7ca 100644
--- a/src/backend/access/rmgrdesc/meson.build
+++ b/src/backend/access/rmgrdesc/meson.build
@@ -22,6 +22,7 @@ rmgr_desc_sources = files(
'spgdesc.c',
'standbydesc.c',
'tblspcdesc.c',
+ 'undologdesc.c',
'xactdesc.c',
'xlogdesc.c',
)
diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c
new file mode 100644
index 00000000000..e7559cdd33c
--- /dev/null
+++ b/src/backend/access/rmgrdesc/undologdesc.c
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * undologdesc.c
+ * rmgr descriptor routines for access/transam/undolog.c
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/undologdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/undolog.h"
+
+typedef struct UndoDescData
+{
+ const char *rm_name;
+ void (*rm_undodesc) (StringInfo buf, UndoLogRecord *record);
+ const char *(*rm_undoidentify) (uint8 info);
+} UndoDescData;
+
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_event) \
+ { name, undo_desc, undo_identify },
+
+static UndoDescData UndoRoutines[RM_MAX_ID + 1] = {
+#include "access/rmgrlist.h"
+};
+#undef PG_RMGR
+
+void
+undolog_desc(StringInfo buf, XLogReaderState *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_ULOG_CREATE)
+ {
+ xl_ulog_create *crec = (xl_ulog_create *) rec;
+ char fname[MAXPGPATH];
+
+ UndoLogSetFilename(fname, crec->xid);
+ appendStringInfo(buf, "\"%s\"", fname);
+ }
+ else if (info == XLOG_ULOG_WRITE)
+ {
+ xl_ulog_write *wrec = (xl_ulog_write *) rec;
+ UndoLogRecord *urec = (UndoLogRecord *) wrec->bytes;
+
+ /*
+ * The file header and records are recovered in the same way without
+ * using resource manager routines. However, while description routines
+ * are typically provided as resource routines, the file header does
+ * not have one. Therefore, it requires explicit handling here.
+ */
+ if (wrec->off == 0)
+ {
+ /* This is the file header. No extra data is currently stored. */
+ appendStringInfo(buf, "HEADER");
+ }
+ else
+ {
+ /* This is a ulog record. Let rmgr routines handle it. */
+ UndoDescData rmgr = UndoRoutines[urec->ul_rmid];
+ const char *id = rmgr.rm_undoidentify(ULogRecGetInfo(urec));
+
+ Assert(UndoRoutines[urec->ul_rmid].rm_undoidentify);
+
+ if (id == NULL)
+ appendStringInfo(buf, "UNKNOWN (%X): ",
+ ULogRecGetInfo(urec));
+ else
+ appendStringInfo(buf, "%s: ", id);
+
+ if (UndoRoutines[urec->ul_rmid].rm_undodesc)
+ UndoRoutines[urec->ul_rmid].rm_undodesc(buf, urec);
+ }
+ }
+}
+
+const char *
+undolog_identify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case XLOG_ULOG_CREATE:
+ id = "CREATE";
+ break;
+ case XLOG_ULOG_WRITE:
+ id = "WRITE";
+ break;
+ }
+
+ return id;
+}
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 661c55a9db7..57fca954ca6 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -25,6 +25,7 @@ OBJS = \
transam.o \
twophase.o \
twophase_rmgr.o \
+ undolog.o \
varsup.o \
xact.o \
xlog.o \
diff --git a/src/backend/access/transam/meson.build b/src/backend/access/transam/meson.build
index 8a3522557cd..2cdb7feeb1b 100644
--- a/src/backend/access/transam/meson.build
+++ b/src/backend/access/transam/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
'transam.c',
'twophase.c',
'twophase_rmgr.c',
+ 'undolog.c',
'varsup.c',
'xact.c',
'xlog.c',
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 1b7499726eb..1fc5a1e5059 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -30,6 +30,7 @@
#include "access/multixact.h"
#include "access/nbtxlog.h"
#include "access/spgxlog.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "catalog/storage_xlog.h"
#include "commands/dbcommands_xlog.h"
@@ -44,7 +45,7 @@
/* must be kept in sync with RmgrData definition in xlog_internal.h */
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_event) \
{ name, redo, desc, identify, startup, cleanup, mask, decode },
RmgrData RmgrTable[RM_MAX_ID + 1] = {
diff --git a/src/backend/access/transam/undolog.c b/src/backend/access/transam/undolog.c
new file mode 100644
index 00000000000..c32f5cd0b6f
--- /dev/null
+++ b/src/backend/access/transam/undolog.c
@@ -0,0 +1,38 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.c
+ * Undo log manager for PostgreSQL
+ *
+ * This module logs the cleanup procedures required during a transaction abort.
+ * The information is recorded in WAL-logged files to ensure post-crash
+ * recovery runs the necessary cleanup procedures.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/undolog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/undolog.h"
+
+/*
+ * undollg_redo()
+ *
+ * Recovery routine for undo logs.
+ */
+void
+undolog_redo(XLogReaderState *record)
+{
+ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == XLOG_ULOG_CREATE)
+ {
+ }
+ else if (info == XLOG_ULOG_WRITE)
+ {
+ }
+}
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 242326c97a7..64901967d2a 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -28,7 +28,7 @@
* RmgrNames is an array of the built-in resource manager names, to make error
* messages a bit nicer.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_event) \
name,
static const char *const RmgrNames[RM_MAX_ID + 1] = {
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 6b8c17bb4c4..df0eedb2146 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -20,6 +20,7 @@
#include "access/nbtxlog.h"
#include "access/rmgr.h"
#include "access/spgxlog.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "catalog/storage_xlog.h"
@@ -32,7 +33,7 @@
#include "storage/standbydefs.h"
#include "utils/relmapper.h"
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_event) \
{ name, desc, identify},
static const RmgrDescData RmgrDescTable[RM_N_BUILTIN_IDS] = {
diff --git a/src/include/access/rmgr.h b/src/include/access/rmgr.h
index 3b6a497e1b4..20919e834ab 100644
--- a/src/include/access/rmgr.h
+++ b/src/include/access/rmgr.h
@@ -19,7 +19,7 @@ typedef uint8 RmgrId;
* Note: RM_MAX_ID must fit in RmgrId; widening that type will affect the XLOG
* file format.
*/
-#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_event) \
symname,
typedef enum RmgrIds
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 78e6b908c6e..5909d87d599 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -24,26 +24,27 @@
* Changes to this list possibly need an XLOG_PAGE_MAGIC bump.
*/
-/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */
-PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode)
-PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode)
-PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode)
-PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)
-PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL)
-PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL)
-PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL)
-PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
-PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
+/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode, undo, undo_desc, undo_identify, undo_event */
+PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_MULTIXACT_ID, "MultiXact", multixact_redo, multixact_desc, multixact_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_ULOG_ID, "UndoLog", undolog_redo, undolog_desc, undolog_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
new file mode 100644
index 00000000000..35f7619a121
--- /dev/null
+++ b/src/include/access/undolog.h
@@ -0,0 +1,84 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.h
+ * Definitions for undolog module of PostgresSQL
+ *
+ * Copyright (c) 2000-2024, PostgreSQL Global Development Group
+ *
+ * src/include/access/undolog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_H
+#define UNDOLOG_H
+
+#include "access/transam.h"
+#include "access/xlogreader.h"
+
+/* Directory for storing undo logs */
+#define UNDOLOG_DIR "pg_ulog"
+
+typedef struct UndoLogFileHeader
+{
+ int32 magic; /* fixed ULOG file magic number */
+ bool crashed; /* this transaction experienced a crash */
+ /* UndoLogRecord follows */
+} UndoLogFileHeader;
+
+typedef struct UndoLogRecord
+{
+ uint32 ul_tot_len; /* total length of entire record */
+ RmgrId ul_rmid; /* resource manager for this record */
+ uint8 ul_info; /* record info */
+ FullTransactionId ul_xid; /* subtransaction id */
+ /* rmgr-specific data follow, no padding */
+} UndoLogRecord;
+
+/*
+ * The high 4 bits in ul_info may be used freely by rmgr. The lower 4 bits are
+ * not used for now.
+ */
+#define ULR_INFO_MASK 0x0F
+#define ULR_RMGR_INFO_MASK 0xF0
+
+/* XLOG stuff */
+#define XLOG_ULOG_CREATE 0x00
+#define XLOG_ULOG_WRITE 0x10
+
+typedef struct xl_ulog_create
+{
+ FullTransactionId xid;
+} xl_ulog_create;
+
+typedef struct xl_ulog_write
+{
+ FullTransactionId topxid;
+ FullTransactionId subxid;
+ off_t off;
+ Size len;
+ unsigned char bytes[FLEXIBLE_ARRAY_MEMBER];
+} xl_ulog_write;
+
+extern void undolog_redo(XLogReaderState *record);
+extern void undolog_desc(StringInfo buf, XLogReaderState *record);
+extern const char *undolog_identify(uint8 info);
+
+#define ULogRecGetData(record) ((char *)record + sizeof(UndoLogRecord))
+#define ULogRecGetInfo(record) ((record)->ul_info)
+
+/*
+ * UndoLogSetFilename()
+ *
+ * Generates undo log file name for the xid. Used in undolog.c and
+ * undologdesc.c.
+ */
+static inline void
+UndoLogSetFilename(char *buf, FullTransactionId xid)
+{
+ StaticAssertDecl(sizeof(FullTransactionId) == 8,
+ "width of FullTrasactionId is not 8");
+ snprintf(buf, MAXPGPATH, "%s/%016" PRIx64,
+ UNDOLOG_DIR, U64FromFullTransactionId(xid));
+}
+
+#endif /* UNDOLOG_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e1c4f913f84..9c78c72841a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3043,6 +3043,9 @@ ULONG
ULONG_PTR
UV
UVersionInfo
+UndoDescData
+UndoLogFileHeader
+UndoLogRecord
UnicodeNormalizationForm
UnicodeNormalizationQC
Unique
@@ -4147,6 +4150,8 @@ xl_standby_locks
xl_tblspc_create_rec
xl_tblspc_drop_rec
xl_testcustomrmgrs_message
+xl_ulog_create
+xl_ulog_write
xl_xact_abort
xl_xact_assignment
xl_xact_commit
--
2.43.5
v36-0002-Delay-the-reset-of-UNLOGGED-relations.patchtext/x-patch; charset=us-asciiDownload
From ef5c1a71cbf54de1abcd4984f6b0c7d39af51af2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 30 Sep 2024 17:56:46 +0900
Subject: [PATCH v36 02/17] Delay the reset of UNLOGGED relations
This patch set enables the creation of INIT forks within
transactions. Cleanup of such forks after a crash will be managed by
the UNDO log system, which will be introduced in a subsequent
patch. Since the consistency of UNDO logs relies on WAL, operations
involving INIT forks - specifically reinit - must only occur after
recovery has reached consistency.
To prepare for the introduction of the UNDO log system, this commit
adjusts the timing of UNLOGGED relation cleanup. Instead of occurring
before recovery begins, it is now performed either when the
consistency point is reached or at the end of recovery if hot standby
is disabled.
---
src/backend/access/transam/xlog.c | 17 +++++++++--------
src/backend/access/transam/xlogrecovery.c | 9 +++++++++
2 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bcab..62767a4a2b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5761,14 +5761,6 @@ StartupXLOG(void)
/* Check that the GUCs used to generate the WAL allow recovery */
CheckRequiredParameterValues();
- /*
- * We're in recovery, so unlogged relations may be trashed and must be
- * reset. This should be done BEFORE allowing Hot Standby
- * connections, so that read-only backends don't try to read whatever
- * garbage is left over from before.
- */
- ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
-
/*
* Likewise, delete any saved transaction snapshot files that got left
* behind by crashed backends.
@@ -5916,7 +5908,16 @@ StartupXLOG(void)
* end-of-recovery steps fail.
*/
if (InRecovery)
+ {
+ /*
+ * Clean up unlogged relations if not already done. If consistency has
+ * been established, this cleanup would have occurred when entering hot
+ * standby mode (see CheckRecoveryConsistency for details).
+ */
+ if (!reachedConsistency)
+ ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+ }
/*
* Pre-scan prepared transactions to find out the range of XIDs present.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c6994b78282..5ceebce5a19 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -56,6 +56,7 @@
#include "storage/latch.h"
#include "storage/pmsignal.h"
#include "storage/procarray.h"
+#include "storage/reinit.h"
#include "storage/spin.h"
#include "utils/datetime.h"
#include "utils/fmgrprotos.h"
@@ -2260,6 +2261,14 @@ CheckRecoveryConsistency(void)
reachedConsistency &&
IsUnderPostmaster)
{
+ /*
+ * Unlogged relations may be trashed and must be reset. This should be
+ * done BEFORE allowing Hot Standby connections, so that read-only
+ * backends don't try to read whatever garbage is left over from
+ * before.
+ */
+ ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
+
SpinLockAcquire(&XLogRecoveryCtl->info_lck);
XLogRecoveryCtl->SharedHotStandbyActive = true;
SpinLockRelease(&XLogRecoveryCtl->info_lck);
--
2.43.5
v36-0003-Add-new-function-TwoPhaseXidExists.patchtext/x-patch; charset=us-asciiDownload
From d5ba35a4e3183172443f015b590556686fdc3966 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 3 Oct 2024 17:46:06 +0900
Subject: [PATCH v36 03/17] Add new function TwoPhaseXidExists
The undo log system needs to know whether a transaction is in the
prepared state or not. Add a new function TwoPhaseXidExists to
accommodate this requirement.
---
src/backend/access/transam/twophase.c | 31 +++++++++++++++++++++------
src/include/access/twophase.h | 1 +
2 files changed, 26 insertions(+), 6 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 49be1df91c1..af9d522f094 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -794,10 +794,11 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
* specified by XID
*
* If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
- * caller had better hold it.
+ * caller had better hold it. If noerror is true, returns NULL if the global
+ * transaction does not exist.
*/
static GlobalTransaction
-TwoPhaseGetGXact(TransactionId xid, bool lock_held)
+TwoPhaseGetGXact(TransactionId xid, bool lock_held, bool noerror)
{
GlobalTransaction result = NULL;
int i;
@@ -831,8 +832,13 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
if (!lock_held)
LWLockRelease(TwoPhaseStateLock);
- if (result == NULL) /* should not happen */
- elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+ if (result == NULL)
+ {
+ if (!noerror)
+ elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
+
+ return NULL;
+ }
cached_xid = xid;
cached_gxact = result;
@@ -902,7 +908,7 @@ TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
ProcNumber
TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
{
- GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+ GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held, false);
return gxact->pgprocno;
}
@@ -917,11 +923,24 @@ TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held)
PGPROC *
TwoPhaseGetDummyProc(TransactionId xid, bool lock_held)
{
- GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held);
+ GlobalTransaction gxact = TwoPhaseGetGXact(xid, lock_held, false);
return GetPGProcByNumber(gxact->pgprocno);
}
+/*
+ * TwoPhaseXidExists
+ * Returns whether the prepared transaction specified by XID exists
+ *
+ * If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
+ * caller had better hold it.
+ */
+bool
+TwoPhaseXidExists(TransactionId xid, bool lock_held)
+{
+ return TwoPhaseGetGXact(xid, lock_held, true) != NULL;
+}
+
/************************************************************************/
/* State file support */
/************************************************************************/
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index b85b65c604e..c6298332d36 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -38,6 +38,7 @@ extern TransactionId TwoPhaseGetXidByVirtualXID(VirtualTransactionId vxid,
bool *have_more);
extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
extern int TwoPhaseGetDummyProcNumber(TransactionId xid, bool lock_held);
+extern bool TwoPhaseXidExists(TransactionId xid, bool lock_held);
extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
TimestampTz prepared_at,
--
2.43.5
v36-0004-Introduce-undo-log-implementation.patchtext/x-patch; charset=iso-8859-7Download
From 3e49ed3955de9bd3007997175d402c90ee89eb24 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 3 Oct 2024 18:24:54 +0900
Subject: [PATCH v36 04/17] Introduce undo log implementation
This commit implements the UNDO log system, where undo information is
primarily stored in a fixed number of slots allocated in static shared
memory, keyed by top transaction IDs. As long as the data fits within
the slots, it is not written to files; instead, every undo record
write is WAL-logged. Data in the slots is written to files during
checkpoints, ensuring persistence across checkpoints.
During recovery, undo log files are recovered from WAL and removed at
transaction ends. Any remaining undo logs at the end of recovery are
intended to be processed by future undo routines.
While recovery could be faster if undo-log recovery used the same
in-memory mechanism as normal operations, it currently does not for
the sake of simplicity.
---
src/backend/access/transam/twophase.c | 3 +
src/backend/access/transam/undolog.c | 989 ++++++++++++++++++
src/backend/access/transam/xact.c | 10 +
src/backend/access/transam/xlog.c | 14 +-
src/backend/access/transam/xlogrecovery.c | 2 +
src/backend/storage/ipc/ipci.c | 3 +
src/backend/storage/lmgr/lwlock.c | 3 +
.../utils/activity/wait_event_names.txt | 1 +
src/backend/utils/init/postinit.c | 4 +
src/bin/initdb/initdb.c | 17 +
src/bin/pg_waldump/t/001_basic.pl | 3 +-
src/include/access/undolog.h | 29 +
src/include/storage/lwlock.h | 3 +
src/include/storage/lwlocklist.h | 1 +
src/tools/pgindent/typedefs.list | 6 +
15 files changed, 1084 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index af9d522f094..ba1a8bd875c 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -82,6 +82,7 @@
#include "access/transam.h"
#include "access/twophase.h"
#include "access/twophase_rmgr.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -1607,6 +1608,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
abortstats,
gid);
+ UndoLog_UndoByXid(isCommit, xid, hdr->nsubxacts, children);
+
ProcArrayRemove(proc, latestXid);
/*
diff --git a/src/backend/access/transam/undolog.c b/src/backend/access/transam/undolog.c
index c32f5cd0b6f..196e02e652f 100644
--- a/src/backend/access/transam/undolog.c
+++ b/src/backend/access/transam/undolog.c
@@ -17,7 +17,923 @@
#include "postgres.h"
+#include <sys/stat.h>
+
+#include "lib/stringinfo.h"
+#include "access/parallel.h"
#include "access/undolog.h"
+#include "access/twophase_rmgr.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "lib/dshash.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/procarray.h"
+#include "utils/memutils.h"
+
+
+#define ULOG_FILE_MAGIC 0x474f4c55 /* 'ULOG' in big-endian */
+
+/* Resource manager definition */
+typedef struct RmgrUndoData
+{
+ const char *rm_name;
+ void (*rm_undo) (UndoLogRecord *record, ULogContext cxt, bool redo,
+ bool crashed);
+ void (*rm_undo_event) (ULogEvent event);
+} RmgrUndoData;
+
+#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode,undo,undo_desc,undo_identify,undo_event) \
+ { name, undo, undo_event },
+
+static RmgrUndoData RmgrUndo[RM_MAX_ID + 1] = {
+#include "access/rmgrlist.h"
+};
+#undef PG_RMGR
+
+/*
+ * Undo log shared data.
+ *
+ * Undo log records are first stored in a fixed number of fixed-length slots,
+ * each placed in shared memory and keyed by top transaction IDs. When a slot
+ * becomes full, its data is flushed to a file named after the full transaction
+ * ID, and the slot then stores the subsequent part. During each checkpoint,
+ * all slots are flushed to disk then emptied. If a new transaction arrives and
+ * no slot is available, the slot with the oldest transaction ID is
+ * evicted. Although the number of slots and the buffer length could be made
+ * configurable, they are currently fixed.
+ */
+#define ULOG_SLOT_NUM 32
+#define ULOG_SLOT_BUF_LEN 1024
+
+typedef struct UndoLogSlot
+{
+ LWLock lock;
+ FullTransactionId xid;
+ off_t off;
+ int len;
+ uint8 buf[ULOG_SLOT_BUF_LEN];
+} UndoLogSlot;
+
+/* struct for data stored in shared memory */
+typedef struct ULogSharedData
+{
+ UndoLogSlot slots[ULOG_SLOT_NUM];
+} ULogSharedData;
+
+/*
+ * Struct for top-level management local variables.
+ *
+ * Stored in local memory. current_slot is the ID of the slot (described later)
+ * that this backend currently considers itself to be using. current_xid is the
+ * transaction ID that the slot should hold. Since another process can steal a
+ * slot under use, current_xid is used to verify if the slot pointed by
+ * current_slot is indeed the one this process is using. buf holds a memory
+ * area of length buflen for various purposes in this module.
+ */
+typedef struct ULogLocalData
+{
+ MemoryContext cxt; /* working memroy context */
+ int current_slot; /* slot id currently used by me */
+ FullTransactionId current_xid; /* the current xid */
+ ULogSharedData *shared_area; /* shared memory area */
+ void *buf; /* working buffer */
+ int buflen; /* length of the buffer */
+} ULogLocalData;
+
+static ULogLocalData ULogLocal;
+
+/* short cut macros */
+#define UndoLogContext (ULogLocal.cxt)
+#define ULogShared (ULogLocal.shared_area)
+
+/*
+ * Shared memory intializer functions
+ */
+Size
+UndoLogShmemSize(void)
+{
+ return MAXALIGN(sizeof(ULogSharedData));
+}
+
+void
+UndoLogShmemInit(void)
+{
+ bool found;
+
+ ULogShared = (ULogSharedData *) ShmemInitStruct("UNDO Log Data",
+ UndoLogShmemSize(),
+ &found);
+ if (!found)
+ {
+ /* initialize all slots */
+ for (int i = 0 ; i < ULOG_SLOT_NUM ; i++)
+ {
+ UndoLogSlot *slot = &ULogShared->slots[i];
+ slot->xid = InvalidFullTransactionId;
+ LWLockInitialize(&slot->lock, LWTRANCHE_UNDOLOG_DATA);
+ }
+ }
+}
+
+/*
+ * InitUndoLog() - initialize undo log system
+ */
+void
+InitUndoLog(void)
+{
+ /* shouldn't be called from postmaster */
+ Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
+
+ ULogLocal.cxt = AllocSetContextCreate(TopMemoryContext,
+ "Undo log system",
+ ALLOCSET_DEFAULT_SIZES);
+ ULogLocal.current_slot = -1;
+ ULogLocal.current_xid = InvalidFullTransactionId;
+
+ /*
+ * While this buffer could be made flexible in size, a fixed-size buffer is
+ * allocated to avoid pallocs within critical sections.
+ */
+ ULogLocal.buflen = 1024;
+ ULogLocal.buf = MemoryContextAlloc(UndoLogContext, ULogLocal.buflen);
+}
+
+/*
+ * undolog_ensure_buffer()
+ *
+ * Ensures that the data buffer in ULogLocal is larger than the specified size.
+ */
+static void *
+undolog_ensure_buffer(Size size)
+{
+ /*
+ * The buffer is not reallocated for the reasons mentioned above. It is
+ * unlikely that the current buffer size will be insufficient, but a
+ * mechanism to determine the maximum required buffer size for the entire
+ * system in advance may be necessary.
+ */
+ Assert (size <= ULogLocal.buflen);
+
+ return ULogLocal.buf;
+}
+
+/*
+ * undolog_file_exists()
+ *
+ * Checks for the existence of the file corresponding to the specified xid.
+ */
+static bool
+undolog_file_exists(FullTransactionId xid)
+{
+ char fname[MAXPGPATH];
+ struct stat statbuf;
+
+ UndoLogSetFilename(fname, xid);
+
+ if (stat(fname, &statbuf) < 0)
+ {
+ if (errno == ENOENT)
+ return false;
+
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("stat failed for undo file \"%s\": %m", fname));
+ }
+
+ return true;
+}
+
+/*
+ * undolog_file_size()
+ *
+ * Returns the size of the file corresponding to the specified xid.
+ * If the file does not exist, returns 0.
+ */
+static Size
+undolog_file_size(FullTransactionId xid)
+{
+ char fname[MAXPGPATH];
+ struct stat sbuf;
+ int fd;
+
+ UndoLogSetFilename(fname, xid);
+ fd = BasicOpenFile(fname, PG_BINARY | O_RDONLY);
+
+ if (fd < 0)
+ {
+ if (errno == ENOENT)
+ return 0;
+
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to open ulog file \"%s\": %m", fname));
+ }
+
+ if (fstat(fd, &sbuf) < 0)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to stat ulog file \"%s\": %m", fname));
+
+ close(fd);
+
+ return sbuf.st_size;
+}
+
+/*
+ * undolog_flush_slot - flush a slot into file
+ *
+ * Flushes the slot data to the undo log file, then clears the slot.
+ * The caller must ensure that the slot is not modified while this function is
+ * executing.
+ * Creates the file if it does not already exist.
+ * If keep is true, the emptied slot remains assigned to the previous xid.
+ */
+static void
+undolog_flush_slot(UndoLogSlot *slot, bool keep)
+{
+ char fname[MAXPGPATH];
+ int fd;
+ int ret;
+
+ /* return if no data */
+ if (slot->len == 0)
+ return;
+
+ UndoLogSetFilename(fname, slot->xid);
+
+ fd = BasicOpenFile(fname, PG_BINARY | O_WRONLY | O_CREAT);
+ if (fd < 0)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to open or create undo file \"%s\": %m", fname));
+
+ /*
+ * Write the file header if this is the first time the undo log is
+ * being written. The slot buffer doesn't include the header part, so we
+ * write it manually. The header write has already been WAL-logged in
+ * undolog_find_slot().
+ */
+ if (slot->off == sizeof(UndoLogFileHeader))
+ {
+ UndoLogFileHeader fheader;
+ Size len = sizeof(fheader);
+
+ fheader.magic = ULOG_FILE_MAGIC;
+ fheader.crashed = false;
+ ret = pg_pwrite(fd, &fheader, len, 0);
+ if (ret != len)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to write to undo file \"%s\": %m", fname));
+ }
+
+ ret = pg_pwrite(fd, slot->buf, slot->len, slot->off);
+
+ if (ret != slot->len)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to write to undo file \"%s\": %m", fname));
+
+ close(fd);
+
+ /* clear the slot buffer */
+ slot->off += slot->len;
+ slot->len = 0;
+
+ /* release this slot */
+ if (!keep)
+ slot->xid = InvalidFullTransactionId;
+
+ return;
+}
+
+/*
+ * undolog_find_slot - returns a slot for the specified top-xid
+ *
+ * If acquire is true, acquires a slot and creates a new undo log if this is
+ * the first call for the xid. Otherwise, returns NULL if no slot is found for
+ * the xid.
+ *
+ * The returned slot is exclusively locked.
+ */
+static UndoLogSlot *
+undolog_find_slot(FullTransactionId xid, bool acquire)
+{
+ UndoLogSlot *slot;
+ int slot_to_use;
+
+ Assert(FullTransactionIdIsValid(xid));
+
+ /* fast path for currently active slot */
+ if (ULogLocal.current_slot > -1)
+ {
+ slot = &ULogShared->slots[ULogLocal.current_slot];
+
+ LWLockAcquire(&slot->lock, LW_EXCLUSIVE);
+
+ /* return it if the slot has not been stolen by another transaction */
+ if (FullTransactionIdEquals(slot->xid, xid))
+ return slot;
+
+ LWLockRelease(&slot->lock);
+
+ /*
+ * No other processes are expected to acquire a slot for this xid.
+ * Continue searching for an available slot.
+ */
+ ULogLocal.current_slot = -1;
+ ULogLocal.current_xid = InvalidFullTransactionId;
+ }
+
+ /* no active slot found; return NULL if not set to acquire a new one */
+ if (!acquire)
+ return NULL;
+
+ /* Search for an invalid slot or the slot with the oldest xid. */
+ slot_to_use = -1;
+ for (int i = 0 ; i < ULOG_SLOT_NUM ; i++)
+ {
+ slot = &ULogShared->slots[i];
+
+ LWLockAcquire(&slot->lock, LW_EXCLUSIVE);
+
+ /* slot for this xid should not exist */
+ Assert(!FullTransactionIdEquals(slot->xid, xid));
+
+ /* use invalid slot unconditionally */
+ if (!FullTransactionIdIsValid(slot->xid))
+ {
+ /* Replace the slot to use. Release the previous lock if any. */
+ if (slot_to_use >= 0)
+ LWLockRelease(&ULogShared->slots[slot_to_use].lock);
+
+ slot_to_use = i;
+ break;
+ }
+
+ /* determine the oldest slot */
+ if (slot_to_use < 0)
+ slot_to_use = i;
+ else if (FullTransactionIdPrecedes(slot->xid,
+ ULogShared->slots[slot_to_use].xid))
+ {
+ /* Replace the slot to use. Release the previous lock if any. */
+ if (slot_to_use >= 0)
+ LWLockRelease(&ULogShared->slots[slot_to_use].lock);
+
+ slot_to_use = i;
+ }
+ else
+ LWLockRelease(&slot->lock);
+ }
+
+ Assert(slot_to_use >= 0);
+ ULogLocal.current_slot = slot_to_use;
+ slot = &ULogShared->slots[slot_to_use];
+
+ /* flush the buffered data if any */
+ if (FullTransactionIdIsValid(slot->xid))
+ undolog_flush_slot(slot, false);
+
+ /*
+ * A partially written file may exist for this xid. In that case, set the
+ * offset based on the file size.
+ */
+ ULogLocal.current_xid = slot->xid = xid;
+ slot->off = undolog_file_size(xid);
+ slot->len = 0;
+
+ if (slot->off == 0)
+ {
+ /*
+ * This is the first time the undo log is being written. Emit WAL
+ * records for the creation of the file and a write for the header
+ * part. We don’t waste slot space for the header part. It will be
+ * written by undolog_flush_slot().
+ */
+ xl_ulog_create crec;
+ xl_ulog_write *wrec;
+ UndoLogFileHeader *fheader;
+ Size bodylen;
+ Size wreclen;
+ XLogRecPtr recptr;
+
+ crec.xid = xid;
+ XLogBeginInsert();
+ XLogRegisterData((char *) &crec, sizeof(crec));
+ (void) XLogInsert(RM_ULOG_ID, XLOG_ULOG_CREATE);
+
+ bodylen = sizeof(UndoLogFileHeader);
+ wreclen = sizeof(xl_ulog_write) + bodylen;
+ wrec = undolog_ensure_buffer(wreclen);
+
+ wrec->topxid = xid;
+ wrec->subxid = InvalidFullTransactionId;
+ wrec->off = 0;
+ wrec->len = bodylen;
+ fheader = (UndoLogFileHeader *) &wrec->bytes;
+ fheader->magic = ULOG_FILE_MAGIC;
+ fheader->crashed = false;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) wrec, wreclen);
+ recptr = XLogInsert(RM_ULOG_ID, XLOG_ULOG_WRITE);
+ XLogFlush(recptr);
+
+ /* adjust file offset */
+ slot->off = bodylen;
+ }
+
+ return slot;
+}
+
+/*
+ * undolog_remove_file() - Removes a file specified by full transaction ID.
+ *
+ * The file must already have been closed.
+ */
+static void
+undolog_remove_file(FullTransactionId xid)
+{
+ char fname[MAXPGPATH];
+
+ UndoLogSetFilename(fname, xid);
+
+ durable_unlink(fname, FATAL);
+}
+
+/*
+ * undolog_mark_xid_as_crashed() - Mark an undo log file as "crashed"
+ *
+ * When executing an undo log file, this attribute will be passed to the "undo"
+ * rmgr callback functions.
+ */
+static void
+undolog_mark_xid_as_crashed(FullTransactionId xid)
+{
+ char fname[MAXPGPATH];
+ int fd;
+ UndoLogFileHeader fheader;
+
+ /* no slot should not exist */
+ Assert(undolog_find_slot(xid, false) == NULL);
+
+ UndoLogSetFilename(fname, xid);
+ fd = BasicOpenFile(fname, PG_BINARY | O_RDWR);
+ if (fd < 0)
+ {
+ if (errno == ENOENT)
+ return;
+
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to open undo file \"%s\": %m", fname));
+ }
+
+ if (read(fd, &fheader, sizeof(fheader)) < sizeof(fheader))
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to read undo log file \"%s\": %m", fname));
+
+ if (fheader.magic != ULOG_FILE_MAGIC)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("magic does not match for undo log file \"%s\"", fname));
+
+ fheader.crashed = true;
+
+ if (pg_pwrite(fd, &fheader, sizeof(fheader), 0) != sizeof(fheader))
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to write to undo file \"%s\": %m", fname));
+ close(fd);
+}
+
+/*
+ * undolog_drop_ulog() - Release slot then remove file if any.
+ */
+static void
+undolog_drop_ulog(FullTransactionId xid)
+{
+ UndoLogSlot *slot;
+
+ Assert(FullTransactionIdIsValid(xid));
+
+ /*
+ * If the shortcut is dangling, it means our slot has been stolen, and no
+ * slot is currently associated with our XID. In this case, the contents
+ * have already been written to the corresponding file.
+ */
+ if (ULogLocal.current_slot >= 0)
+ {
+ slot = &ULogShared->slots[ULogLocal.current_slot];
+
+ LWLockAcquire(&slot->lock, LW_EXCLUSIVE);
+
+ if (FullTransactionIdEquals(slot->xid, xid))
+ slot->xid = InvalidFullTransactionId;
+
+ LWLockRelease(&slot->lock);
+
+ ULogLocal.current_slot = -1;
+ ULogLocal.current_xid = InvalidFullTransactionId;
+ }
+
+ Assert(!FullTransactionIdIsValid(ULogLocal.current_xid));
+
+ /* remove the file if any */
+ if (undolog_file_exists(xid))
+ undolog_remove_file(xid);
+}
+
+
+/*
+ * UndoLogWrite() - Writes an undolog record
+ *
+ * This write is WAL-logged.
+ */
+void
+UndoLogWrite(RmgrId rmgr, uint8 info, void *data, int len)
+{
+ FullTransactionId topxid;
+ FullTransactionId subxid;
+ int reclen = sizeof(UndoLogRecord) + len;
+ int wreclen = sizeof(xl_ulog_write) + reclen;
+ xl_ulog_write *wrec;
+ UndoLogRecord *rec;
+ XLogRecPtr recptr;
+ UndoLogSlot *slot;
+
+ Assert(!RecoveryInProgress());
+ Assert(!IsParallelWorker());
+
+ if (!IsUnderPostmaster)
+ return;
+
+ /*
+ * The following lines may assign new transaction IDs. While this is
+ * somewhat clumsy, the caller needs to assign them soon.
+ */
+ topxid = GetTopFullTransactionId();
+ subxid = GetCurrentFullTransactionId();
+
+ /* the caller may set rmgr bits only */
+ Assert((info & ~ULR_RMGR_INFO_MASK) == 0);
+
+ /*
+ * Since this call uses the common buffer returned by
+ * undolog_ensure_buffer(), which is also used immediately below, it must
+ * be placed before the buffer is used there.
+ */
+ slot = undolog_find_slot(topxid, true);
+
+ /* build undo record as a part of WAL record to avoid copying */
+ wrec = undolog_ensure_buffer(wreclen);
+ rec = (UndoLogRecord *) &wrec->bytes;
+ rec->ul_tot_len = reclen;
+ rec->ul_rmid = rmgr;
+ rec->ul_info = info;
+ rec->ul_xid = subxid;
+
+ if (len > 0)
+ memcpy((char *)rec + sizeof(UndoLogRecord), data, len);
+
+ /*
+ * Write an XLOG record for this undo log record. It is crucial to flush
+ * immediately, as this record is needed to cancel the action taken right
+ * after if this transaction crashes before the commit.
+ */
+ wrec->topxid = topxid;
+ wrec->subxid = subxid;
+ wrec->off = slot->off + slot->len;
+ wrec->len = reclen;
+ XLogBeginInsert();
+ XLogRegisterData((char *) wrec, wreclen);
+ recptr = XLogInsert(RM_ULOG_ID, XLOG_ULOG_WRITE);
+ XLogFlush(recptr);
+
+ /* flush if the slot is about to overflow */
+ if (slot->len + reclen > ULOG_SLOT_BUF_LEN)
+ undolog_flush_slot(slot, true);
+
+ /* append the record to the slot */
+ Assert (slot->len + reclen <= ULOG_SLOT_BUF_LEN);
+ memcpy(slot->buf + slot->len, rec, reclen);
+ slot->len += reclen;
+
+ LWLockRelease(&slot->lock);
+}
+
+/* Helper routine for calling event callbacks */
+static void
+undolog_event_call(ULogEvent event)
+{
+ for (int rmid = 0; rmid <= RM_MAX_ID; rmid++)
+ {
+ if (RmgrUndo[rmid].rm_name == NULL)
+ continue;
+
+ if (RmgrUndo[rmid].rm_undo_event != NULL)
+ RmgrUndo[rmid].rm_undo_event(event);
+ }
+}
+
+/*
+ * ulog_process_ulog() - Processes an undo log file.
+ *
+ * 'cxt' specifies the operation context. redo is true during recovery.
+ *
+ * XXX: While redo is determined solely by cxt, the two parameters are
+ * currently provided separately.
+ */
+static void
+undolog_process_ulog(char *fname, ULogContext cxt, bool redo)
+{
+ int fd;
+ struct stat sb;
+ char *buf;
+ char *p;
+ char *endptr;
+ UndoLogFileHeader *phead;
+
+ fd = BasicOpenFile(fname, PG_BINARY | O_RDONLY);
+ if (stat(fname, &sb) < 0)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("could not stat undo log file \"%s\": %m", fname));
+ buf = palloc(sb.st_size);
+ if (read(fd, buf, sb.st_size) < sb.st_size)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to read undo log file \"%s\": %m", fname));
+ close(fd);
+
+ p = buf;
+ endptr = buf + sb.st_size;
+ phead = (UndoLogFileHeader *) p;
+ p += sizeof(*phead);
+ if (phead->magic != ULOG_FILE_MAGIC)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("magic does not match for undo log file \"%s\"", fname));
+
+ while (p < endptr)
+ {
+ UndoLogRecord *rec = (UndoLogRecord *)p;
+
+ RmgrUndo[rec->ul_rmid].rm_undo(rec, cxt, redo, phead->crashed);
+
+ p += rec->ul_tot_len;
+ }
+ pfree(buf);
+
+ /* invoke end-of-xact callbacks */
+ undolog_event_call(ULOGEVENT_XACTEND);
+}
+
+/*
+ * UndoLog_UndoByXid()
+ *
+ * Processes undo logs for the specified transactions, used when finishing
+ * prepared transactions or during commits and aborts in recovery.
+ *
+ * children is the list of subtransaction IDs of the xid, with a length of
+ * nchildren.
+ */
+void
+UndoLog_UndoByXid(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children)
+{
+ uint32 nextepoch;
+ TransactionId nextxid;
+ uint32 epoch;
+ FullTransactionId fxid;
+ UndoLogSlot *slot;
+
+ nextepoch = EpochFromFullTransactionId(TransamVariables->nextXid);
+ nextxid = XidFromFullTransactionId(TransamVariables->nextXid);
+
+ /* Adjust epoch, if needed. */
+ if (xid <= nextxid)
+ epoch = nextepoch;
+ else
+ epoch = nextepoch - 1;
+
+ /* process undo logs */
+ fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
+
+ slot = undolog_find_slot(fxid, false);
+ if (slot)
+ {
+ undolog_flush_slot(slot, false);
+ LWLockRelease(&slot->lock);
+ }
+
+ if (undolog_file_exists(fxid))
+ {
+ char fname[MAXPGPATH];
+ ULogContext cxt;
+
+ if (isCommit)
+ cxt = ULOGCXT_COMMIT;
+ else
+ cxt = ULOGCXT_ABORT;
+
+ UndoLogSetFilename(fname, fxid);
+ undolog_process_ulog(fname, cxt, false);
+ undolog_drop_ulog(fxid);
+ }
+}
+
+/*
+ * AtEOXact_UndoLog() - At end-of-xact processing of undo logs.
+ *
+ * Cleans up any undo logs emitted by the transaction, if present.
+ * During normal operation, the caller should pass InvalidTransactionId as xid.
+ * During recovery, it should pass the target transaction ID.
+ */
+void
+AtEOXact_UndoLog(TransactionId xid)
+{
+ FullTransactionId fxid = ULogLocal.current_xid;
+
+ if (TransactionIdIsValid(xid))
+ {
+ FullTransactionId next_fxid;
+ TransactionId oldest_xid;
+ TransactionId next_xid;
+ uint32 oldest_epoch;
+
+ LWLockAcquire(XactTruncationLock, LW_SHARED);
+ next_fxid = TransamVariables->nextXid;
+ oldest_xid = TransamVariables->oldestClogXid;
+ LWLockRelease(XactTruncationLock);
+
+ /* Generate full xid for oldest_xid based on next_fxid */
+ next_xid = XidFromFullTransactionId(next_fxid);
+ oldest_epoch = EpochFromFullTransactionId(next_fxid);
+
+ /* adjust epoch for oldest xid */
+ if (oldest_xid > next_xid)
+ oldest_epoch--;
+
+ fxid = FullTransactionIdFromEpochAndXid(oldest_epoch, xid);
+ }
+
+ if (FullTransactionIdIsValid(fxid))
+ undolog_drop_ulog(fxid);
+}
+
+/*
+ * AtPrepare_UndoLog()
+ *
+ * Flush undo log slot if used then blow away the xid list for the current
+ * transaction. The file will be removed at commit of the prepared transaction.
+ */
+void
+AtPrepare_UndoLog(void)
+{
+ FullTransactionId xid = GetTopFullTransactionId();
+ UndoLogSlot *slot = undolog_find_slot(xid, false);
+
+ if (slot)
+ {
+ undolog_flush_slot(slot, false);
+ LWLockRelease(&slot->lock);
+ }
+}
+
+static void
+undolog_cleanup_init(void)
+{
+ undolog_event_call(ULOGEVENT_CLEANUP_INIT);
+}
+
+void
+UndoLogRecoveryEnd(void)
+{
+ undolog_event_call(ULOGEVENT_RECOVERY_END);
+}
+
+/*
+ * UndoLogCleanup() - On-recovery cleanup of undo log
+ *
+ * This function is called after ULOG file consistency is established, either
+ * when recovery reaches consistency or after recovery finishes if hot standby
+ * is not active.
+ */
+void
+UndoLogCleanup(bool end_of_recovery)
+{
+ DIR *dirdesc;
+ struct dirent *de;
+ char fname[MAXPGPATH];
+ char *p;
+
+ undolog_cleanup_init();
+
+ /*
+ * flush all in-memory undo logs, no need for locking sinced we're the only
+ * process working on this array.
+ */
+ for (int i = 0 ; i < ULOG_SLOT_NUM ; i++)
+ {
+ UndoLogSlot *slot = &ULogShared->slots[i];
+
+ if (FullTransactionIdIsValid(slot->xid))
+ undolog_flush_slot(slot, false);
+
+ slot->xid = InvalidFullTransactionId;
+ }
+
+ snprintf(fname, MAXPGPATH, "%s/", UNDOLOG_DIR);
+ p = fname + strlen(fname);
+
+ /* scan through all undo log files */
+ dirdesc = AllocateDir(UNDOLOG_DIR);
+ while ((de = ReadDir(dirdesc, UNDOLOG_DIR)) != NULL)
+ {
+ FullTransactionId log_fxid;
+ TransactionId log_xid;
+ FullTransactionId next_fxid;
+ FullTransactionId oldest_fxid;
+ TransactionId oldest_xid;
+ TransactionId next_xid;
+ uint32 oldest_epoch;
+ ULogContext cxt;
+ bool xact_prepared;
+
+ if (strlen(de->d_name) != 16 ||
+ strspn(de->d_name, "01234567890abcdef") < strlen(de->d_name))
+ continue;
+
+ strncpy(p, de->d_name, 32);
+
+ /*
+ * Make sure the log's xid is valid.
+ */
+ log_fxid = FullTransactionIdFromU64(strtou64(de->d_name, NULL, 16));
+ log_xid = XidFromFullTransactionId(log_fxid);
+
+ LWLockAcquire(XactTruncationLock, LW_SHARED);
+ next_fxid = TransamVariables->nextXid;
+ oldest_xid = TransamVariables->oldestClogXid;
+ LWLockRelease(XactTruncationLock);
+
+ /* Generate full xid for oldest_xid based on next_fxid */
+ next_xid = XidFromFullTransactionId(next_fxid);
+ oldest_epoch = EpochFromFullTransactionId(next_fxid);
+
+ /* adjust epoch for oldest xid */
+ if (oldest_xid > next_xid)
+ oldest_epoch--;
+
+ oldest_fxid =
+ FullTransactionIdFromEpochAndXid(oldest_epoch, oldest_xid);
+
+ /* check the ulog xid */
+ if (FullTransactionIdPrecedes(log_fxid, oldest_fxid))
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("undolog found for too-old transaction %llu",
+ (long long unsigned int) U64FromFullTransactionId(log_fxid)));
+
+ /* All transactions with undo log must be in-progress. */
+ if (!TransactionIdIsInProgress(log_xid))
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("undolog found for non-acitve transaction: %llu",
+ (long long unsigned int) U64FromFullTransactionId(log_fxid)));
+
+
+ /*
+ * Let undo routines perform cleanup tasks with appropriate
+ * assumptions. If the transaction is prepared or when recovery is
+ * reaching consistency, assume it is active; otherwise, perform abort
+ * cleanups.
+ */
+ xact_prepared = TwoPhaseXidExists(log_xid, false);
+
+ if (!end_of_recovery || xact_prepared)
+ cxt = ULOGCXT_PREPARED;
+ else
+ cxt = ULOGCXT_CLEANUP;
+
+ undolog_process_ulog(fname, cxt, true);
+
+ if (undolog_file_exists(log_fxid))
+ {
+ if (!xact_prepared)
+ undolog_remove_file(log_fxid);
+ else
+ undolog_mark_xid_as_crashed(log_fxid);
+ }
+ }
+ FreeDir(dirdesc);
+}
/*
* undollg_redo()
@@ -31,8 +947,81 @@ undolog_redo(XLogReaderState *record)
if (info == XLOG_ULOG_CREATE)
{
+ xl_ulog_create *rec = (xl_ulog_create *) XLogRecGetData(record);
+ char fname[MAXPGPATH];
+ int fd;
+
+ /*
+ * We don't check for the existence of the log. Although the log should
+ * not be found in a consistent state, it may appear during the
+ * inconsistent period in recovery.
+ */
+ UndoLogSetFilename(fname, rec->xid);
+ fd = BasicOpenFile(fname, PG_BINARY | O_WRONLY | O_CREAT);
+ if (fd < 0)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to open or create undo file \"%s\": %m", fname));
+ close(fd);
}
else if (info == XLOG_ULOG_WRITE)
{
+ char fname[MAXPGPATH];
+ int fd;
+ xl_ulog_write *rec = (xl_ulog_write *) XLogRecGetData(record);
+ ssize_t ret;
+
+ UndoLogSetFilename(fname, rec->topxid);
+ fd = BasicOpenFile(fname, PG_BINARY | O_WRONLY);
+ if (fd < 0)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to open or create undo file \"%s\": %m", fname));
+ ret = pg_pwrite(fd, rec->bytes, rec->len, rec->off);
+ if (ret != rec->len)
+ ereport(FATAL,
+ errcode_for_file_access(),
+ errmsg("failed to write to undo file \"%s\": %m", fname));
+
+ close(fd);
}
}
+
+/*
+ * CheckPointUndoLog
+ *
+ * This is called during a checkpoint. It must ensure that any undo log writes
+ * that were WAL-logged before the start of the checkpoint are securely flushed
+ * to disk so that we won't lose their existence and content before this
+ * checkpoint start.
+ */
+void
+CheckPointUndoLog(void)
+{
+ bool written = false;
+
+ if (!IsUnderPostmaster)
+ return;
+
+ for (int i = 0 ; i < ULOG_SLOT_NUM ; i++)
+ {
+ UndoLogSlot *slot = &ULogShared->slots[i];
+
+ LWLockAcquire(&slot->lock, LW_EXCLUSIVE);
+
+ if (FullTransactionIdIsValid(slot->xid))
+ {
+ undolog_flush_slot(slot, true);
+ written = true;
+ }
+
+ slot->xid = InvalidFullTransactionId;
+
+ LWLockRelease(&slot->lock);
+ }
+
+
+ /* Sync the directory if any files have been written to it. */
+ if (written)
+ fsync_fname(UNDOLOG_DIR, true);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 3ebd7c40418..8b383b424a1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -26,6 +26,7 @@
#include "access/subtrans.h"
#include "access/transam.h"
#include "access/twophase.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -2431,6 +2432,7 @@ CommitTransaction(void)
AtEOXact_MultiXact();
+
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_LOCKS,
true, true);
@@ -2466,6 +2468,7 @@ CommitTransaction(void)
AtEOXact_on_commit_actions(true);
AtEOXact_Namespace(true, is_parallel_worker);
AtEOXact_SMgr();
+ AtEOXact_UndoLog(InvalidTransactionId);
AtEOXact_Files(true);
AtEOXact_ComboCid();
AtEOXact_HashTables(true);
@@ -2669,6 +2672,7 @@ PrepareTransaction(void)
AtPrepare_PgStat();
AtPrepare_MultiXact();
AtPrepare_RelationMap();
+ AtPrepare_UndoLog();
/*
* Here is where we really truly prepare.
@@ -2965,6 +2969,7 @@ AbortTransaction(void)
AtEOXact_TypeCache();
AtEOXact_Inval(false);
AtEOXact_MultiXact();
+
ResourceOwnerRelease(TopTransactionResourceOwner,
RESOURCE_RELEASE_LOCKS,
false, true);
@@ -2979,6 +2984,7 @@ AbortTransaction(void)
AtEOXact_on_commit_actions(false);
AtEOXact_Namespace(false, is_parallel_worker);
AtEOXact_SMgr();
+ AtEOXact_UndoLog(InvalidTransactionId);
AtEOXact_Files(false);
AtEOXact_ComboCid();
AtEOXact_HashTables(false);
@@ -6227,6 +6233,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ AtEOXact_UndoLog(xid);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
@@ -6338,6 +6346,8 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
DropRelationFiles(parsed->xlocators, parsed->nrels, true);
}
+ AtEOXact_UndoLog(xid);
+
if (parsed->nstats > 0)
{
/* see equivalent call for relations above */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 62767a4a2b9..81ec5510f09 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -55,6 +55,7 @@
#include "access/timeline.h"
#include "access/transam.h"
#include "access/twophase.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
@@ -5910,13 +5911,19 @@ StartupXLOG(void)
if (InRecovery)
{
/*
- * Clean up unlogged relations if not already done. If consistency has
- * been established, this cleanup would have occurred when entering hot
- * standby mode (see CheckRecoveryConsistency for details).
+ * If consistency has not been established, process undo log files to
+ * clean up storage files from unfinished transactions and clean up
+ * unlogged relations. (See CheckRecoveryConsistency for details.)
*/
if (!reachedConsistency)
+ {
+ UndoLogCleanup(true);
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
+ }
+
ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+
+ UndoLogRecoveryEnd();
}
/*
@@ -7515,6 +7522,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
CheckPointReplicationOrigin();
+ CheckPointUndoLog();
/* Write out all dirty data in SLRUs and the main buffer pool */
TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5ceebce5a19..730804340b9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -33,6 +33,7 @@
#include "access/timeline.h"
#include "access/transam.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
#include "access/xlogarchive.h"
@@ -2267,6 +2268,7 @@ CheckRecoveryConsistency(void)
* backends don't try to read whatever garbage is left over from
* before.
*/
+ UndoLogCleanup(false);
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
SpinLockAcquire(&XLogRecoveryCtl->info_lck);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 7783ba854fc..2c48a36a25a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
#include "access/syncscan.h"
#include "access/transam.h"
#include "access/twophase.h"
+#include "access/undolog.h"
#include "access/xlogprefetcher.h"
#include "access/xlogrecovery.h"
#include "commands/async.h"
@@ -148,6 +149,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, WaitEventCustomShmemSize());
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
+ size = add_size(size, UndoLogShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -285,6 +287,7 @@ CreateOrAttachShmemStructs(void)
XLogPrefetchShmemInit();
XLogRecoveryShmemInit();
CLOGShmemInit();
+ UndoLogShmemInit();
CommitTsShmemInit();
SUBTRANSShmemInit();
MultiXactShmemInit();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 9cf3e4f4f3a..e369309c5d7 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -166,6 +166,9 @@ static const char *const BuiltinTrancheNames[] = {
[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
[LWTRANCHE_XACT_SLRU] = "XactSLRU",
[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
+ [LWTRANCHE_UNDOLOG_DSA] = "UndoLogDSA",
+ [LWTRANCHE_UNDOLOG_HASH] = "UndoLogHash",
+ [LWTRANCHE_UNDOLOG_DATA] = "UndoLogData",
};
StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 16144c2b72d..c069f5304b5 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -345,6 +345,7 @@ WALSummarizer "Waiting to read or update WAL summarization state."
DSMRegistry "Waiting to read or update the dynamic shared memory registry."
InjectionPoint "Waiting to read or update information related to injection points."
SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state."
+UndoLog "Waiting to read or update shared UNDO log state."
#
# END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 770ab6906e7..e033faf1da2 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -25,6 +25,7 @@
#include "access/parallel.h"
#include "access/session.h"
#include "access/tableam.h"
+#include "access/undolog.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
@@ -641,6 +642,9 @@ BaseInit(void)
*/
InitXLogInsert();
+ /* Initialize undo log system */
+ InitUndoLog();
+
/* Initialize lock manager's local structs */
InitLockManagerAccess();
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 1d012b255ac..104be43eb9b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -307,6 +307,7 @@ void setup_signals(void);
void setup_text_search(void);
void create_data_directory(void);
void create_xlog_or_symlink(void);
+void create_ulog(void);
void warn_on_mount_point(int error);
void initialize_data_directory(void);
@@ -3012,6 +3013,21 @@ create_xlog_or_symlink(void)
free(subdirloc);
}
+/* Create undo log directory */
+void
+create_ulog(void)
+{
+ char *subdirloc;
+
+ /* form name of the place for the subdirectory */
+ subdirloc = psprintf("%s/pg_ulog", pg_data);
+
+ if (mkdir(subdirloc, pg_dir_create_mode) < 0)
+ pg_fatal("could not create directory \"%s\": %m",
+ subdirloc);
+
+ free(subdirloc);
+}
void
warn_on_mount_point(int error)
@@ -3046,6 +3062,7 @@ initialize_data_directory(void)
create_data_directory();
create_xlog_or_symlink();
+ create_ulog();
/* Create required subdirectories (other than pg_wal) */
printf(_("creating subdirectories ... "));
diff --git a/src/bin/pg_waldump/t/001_basic.pl b/src/bin/pg_waldump/t/001_basic.pl
index 578e4731394..09396f065fa 100644
--- a/src/bin/pg_waldump/t/001_basic.pl
+++ b/src/bin/pg_waldump/t/001_basic.pl
@@ -73,7 +73,8 @@ BRIN
CommitTs
ReplicationOrigin
Generic
-LogicalMessage$/,
+LogicalMessage
+UndoLog$/,
'rmgr list');
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
index 35f7619a121..19badc852a0 100644
--- a/src/include/access/undolog.h
+++ b/src/include/access/undolog.h
@@ -34,6 +34,23 @@ typedef struct UndoLogRecord
/* rmgr-specific data follow, no padding */
} UndoLogRecord;
+/* Operation contexts for calling rm_undo() resource manager routines. */
+typedef enum ULogContext
+{
+ ULOGCXT_COMMIT, /* on-commit action */
+ ULOGCXT_ABORT, /* on-abort action */
+ ULOGCXT_PREPARED, /* action for prepared transactions */
+ ULOGCXT_CLEANUP /* post-recovery clean up */
+} ULogContext;
+
+/* Event types for calling rm_undo_event() resource manager routines. */
+typedef enum ULogEvent
+{
+ ULOGEVENT_XACTEND, /* transaction end */
+ ULOGEVENT_CLEANUP_INIT, /* before starting recovery */
+ ULOGEVENT_RECOVERY_END /* after finishing recovery */
+} ULogEvent;
+
/*
* The high 4 bits in ul_info may be used freely by rmgr. The lower 4 bits are
* not used for now.
@@ -59,6 +76,18 @@ typedef struct xl_ulog_write
unsigned char bytes[FLEXIBLE_ARRAY_MEMBER];
} xl_ulog_write;
+extern Size UndoLogShmemSize(void);
+extern void UndoLogShmemInit(void);
+extern void InitUndoLog(void);
+extern void UndoLogWrite(RmgrId rmgr, uint8 info, void *data, int len);
+extern void AtEOXact_UndoLog(TransactionId xid);
+extern void AtPrepare_UndoLog(void);
+extern void UndoLog_UndoByXid(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children);
+extern void UndoLogCleanup(bool recovery_end);
+extern void UndoLogRecoveryEnd(void);
+extern void CheckPointUndoLog(void);
+
extern void undolog_redo(XLogReaderState *record);
extern void undolog_desc(StringInfo buf, XLogReaderState *record);
extern const char *undolog_identify(uint8 info);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d70e6d37e09..d3cbfefda60 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -215,6 +215,9 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_SUBTRANS_SLRU,
LWTRANCHE_XACT_SLRU,
LWTRANCHE_PARALLEL_VACUUM_DSA,
+ LWTRANCHE_UNDOLOG_DSA,
+ LWTRANCHE_UNDOLOG_HASH,
+ LWTRANCHE_UNDOLOG_DATA,
LWTRANCHE_FIRST_USER_DEFINED,
} BuiltinTrancheIds;
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 6a2f64c54fb..c36441e22c6 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
PG_LWLOCK(50, DSMRegistry)
PG_LWLOCK(51, InjectionPoint)
PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, UndoLog)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9c78c72841a..4e635bce942 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2485,6 +2485,7 @@ RewriteState
RmgrData
RmgrDescData
RmgrId
+RmgrUndoData
RoleNameEntry
RoleNameItem
RoleSpec
@@ -3041,10 +3042,15 @@ UINT
ULARGE_INTEGER
ULONG
ULONG_PTR
+ULogOp
+ULogStateData
UV
UVersionInfo
UndoDescData
+UndoLogCtrlStruct
+UndoLogEntry
UndoLogFileHeader
+UndoLogHashEntry
UndoLogRecord
UnicodeNormalizationForm
UnicodeNormalizationQC
--
2.43.5
v36-0005-Prevent-orphan-storage-files-after-server-crash.patchtext/x-patch; charset=us-asciiDownload
From 5322b9aa0d4601753fb783669dfe83fe1ae7526d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 6 Nov 2024 17:35:33 +0900
Subject: [PATCH v36 05/17] Prevent orphan storage files after server crash
When a server crashes during a transaction that creates tables, newly
created but unused storage files are not removed. This patch prevents
such orphan files by utilizing the UNDO log system for storage files.
---
src/backend/access/heap/heapam_handler.c | 22 +--
src/backend/access/rmgrdesc/Makefile | 1 +
src/backend/access/rmgrdesc/smgrundodesc.c | 62 ++++++
src/backend/access/rmgrdesc/undologdesc.c | 2 +
src/backend/access/transam/undolog.c | 1 +
src/backend/catalog/index.c | 4 +-
src/backend/catalog/storage.c | 212 +++++++++++++++++++--
src/backend/commands/sequence.c | 4 +-
src/backend/commands/tablecmds.c | 19 +-
src/backend/storage/buffer/bufmgr.c | 4 +-
src/backend/storage/file/reinit.c | 92 +++++++++
src/backend/storage/smgr/smgr.c | 9 +
src/include/access/rmgrlist.h | 2 +-
src/include/catalog/storage.h | 2 +
src/include/catalog/storage_ulog.h | 48 +++++
src/include/storage/reinit.h | 4 +
src/include/storage/smgr.h | 1 +
src/test/recovery/t/013_crash_restart.pl | 19 ++
18 files changed, 465 insertions(+), 43 deletions(-)
create mode 100644 src/backend/access/rmgrdesc/smgrundodesc.c
create mode 100644 src/include/catalog/storage_ulog.h
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 53f572f384b..239442f0cb2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -611,8 +611,7 @@ heapam_relation_set_new_filelocator(Relation rel,
{
Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
rel->rd_rel->relkind == RELKIND_TOASTVALUE);
- smgrcreate(srel, INIT_FORKNUM, false);
- log_smgrcreate(newrlocator, INIT_FORKNUM);
+ RelationCreateFork(srel, INIT_FORKNUM, true, true);
}
smgrclose(srel);
@@ -656,16 +655,17 @@ heapam_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
{
if (smgrexists(RelationGetSmgr(rel), forkNum))
{
- smgrcreate(dstrel, forkNum, false);
-
- /*
- * WAL log creation if the relation is persistent, or this is the
- * init fork of an unlogged relation.
- */
- if (RelationIsPermanent(rel) ||
+ bool wal_log = RelationIsPermanent(rel) |
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
- forkNum == INIT_FORKNUM))
- log_smgrcreate(newrlocator, forkNum);
+ forkNum == INIT_FORKNUM);
+
+ /*
+ * Usually, we don't use UNDO log for FSM or VM forks, as their
+ * creation is not transactional. However, we're currently copying
+ * the entire relation in a transactional manner, which requires
+ * after-crash cleanup.
+ */
+ RelationCreateFork(dstrel, forkNum, wal_log, true);
RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
rel->rd_rel->relpersistence);
}
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 542fd3d6a8e..fc4605bd30b 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -26,6 +26,7 @@ OBJS = \
rmgrdesc_utils.o \
seqdesc.o \
smgrdesc.o \
+ smgrundodesc.o \
spgdesc.o \
standbydesc.o \
tblspcdesc.o \
diff --git a/src/backend/access/rmgrdesc/smgrundodesc.c b/src/backend/access/rmgrdesc/smgrundodesc.c
new file mode 100644
index 00000000000..9939ef2b61d
--- /dev/null
+++ b/src/backend/access/rmgrdesc/smgrundodesc.c
@@ -0,0 +1,62 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrundodesc.c
+ * rmgr undolog descriptor routines for catalog/storage.c
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/smgrundodesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+#include "catalog/storage_ulog.h"
+#include "lib/stringinfo.h"
+
+void
+smgr_undodesc(StringInfo buf, UndoLogRecord *record)
+{
+ uint8 info = ULogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+ if (info == ULOG_SMGR_CREATE)
+ {
+ ul_smgr_create *urec = (ul_smgr_create *) ULogRecGetData(record);
+
+ appendStringInfo(buf, ": %d/%d/%d, fork %d, backend %d",
+ urec->rlocator.spcOid,
+ urec->rlocator.dbOid,
+ urec->rlocator.relNumber,
+ urec->forknum, urec->backend);
+ }
+ else if (info == ULOG_SMGR_PRESERVE)
+ {
+ ul_smgr_preserve *urec = (ul_smgr_preserve *) ULogRecGetData(record);
+
+ appendStringInfo(buf, ": %d/%d/%d, fork %d, backend %d",
+ urec->rlocator.spcOid,
+ urec->rlocator.dbOid,
+ urec->rlocator.relNumber,
+ urec->forknum, urec->backend);
+ }
+}
+
+const char *
+smgr_undoidentify(uint8 info)
+{
+ const char *id = NULL;
+
+ switch (info & ~XLR_INFO_MASK)
+ {
+ case ULOG_SMGR_CREATE:
+ id = "SMGRCREATE";
+ break;
+ case ULOG_SMGR_PRESERVE:
+ id = "SMGRPRESERVE";
+ break;
+ }
+
+ return id;
+}
diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c
index e7559cdd33c..fa88705f99e 100644
--- a/src/backend/access/rmgrdesc/undologdesc.c
+++ b/src/backend/access/rmgrdesc/undologdesc.c
@@ -14,6 +14,8 @@
#include "postgres.h"
#include "access/undolog.h"
+#include "catalog/storage.h"
+#include "catalog/storage_ulog.h"
typedef struct UndoDescData
{
diff --git a/src/backend/access/transam/undolog.c b/src/backend/access/transam/undolog.c
index 196e02e652f..b2fdbfcd0f9 100644
--- a/src/backend/access/transam/undolog.c
+++ b/src/backend/access/transam/undolog.c
@@ -28,6 +28,7 @@
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "lib/dshash.h"
+#include "catalog/storage_ulog.h"
#include "miscadmin.h"
#include "storage/fd.h"
#include "storage/procarray.h"
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6976249e9e9..7613192e343 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3059,8 +3059,8 @@ index_build(Relation heapRelation,
if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
!smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
{
- smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
- log_smgrcreate(&indexRelation->rd_locator, INIT_FORKNUM);
+ RelationCreateFork(RelationGetSmgr(indexRelation),
+ INIT_FORKNUM, true, true);
indexRelation->rd_indam->ambuildempty(indexRelation);
}
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5b22cf10990..d546d169d34 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -19,13 +19,16 @@
#include "postgres.h"
+#include "access/undolog.h"
#include "access/visibilitymap.h"
#include "access/xact.h"
#include "access/xlog.h"
#include "access/xloginsert.h"
#include "access/xlogutils.h"
#include "catalog/storage.h"
+#include "catalog/storage_ulog.h"
#include "catalog/storage_xlog.h"
+#include "common/hashfn_unstable.h"
#include "miscadmin.h"
#include "storage/bulk_write.h"
#include "storage/freespace.h"
@@ -76,6 +79,14 @@ typedef struct PendingRelSync
static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
static HTAB *pendingSyncHash = NULL;
+/* Storage for smgr_undo()/smgr_undoevent() */
+static RelFileLocator *rlocs = NULL;
+static int rlocs_cap = 0;
+static int rlocs_len = 0;
+
+/* local functions */
+static void ulog_smgrcreate(SMgrRelation srel, ForkNumber forkNum);
+static void ulog_smgrpreserve(RelFileLocator rloc, ForkNumber forkNum);
/*
* AddPendingSync
@@ -147,36 +158,54 @@ RelationCreateStorage(RelFileLocator rlocator, char relpersistence,
}
srel = smgropen(rlocator, procNumber);
- smgrcreate(srel, MAIN_FORKNUM, false);
- if (needs_wal)
- log_smgrcreate(&srel->smgr_rlocator.locator, MAIN_FORKNUM);
+ RelationCreateFork(srel, MAIN_FORKNUM, needs_wal, register_delete);
- /*
- * Add the relation to the list of stuff to delete at abort, if we are
- * asked to do so.
- */
- if (register_delete)
+ if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
+ {
+ Assert(procNumber == INVALID_PROC_NUMBER);
+ AddPendingSync(&rlocator);
+ }
+
+ return srel;
+}
+
+/*
+ * RelationCreateFork
+ * Create physical storage for a fork of a relation.
+ *
+ * This function creates a relation fork in a transactional manner. When
+ * undo_log is true, the creation is UNDO-logged so that in case of transaction
+ * aborts or server crashes later on, the fork will be removed. If the caller
+ * plans to remove the fork in another way, it should pass false. Additionally,
+ * it is WAL-logged if wal_log is true.
+ */
+void
+RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
+ bool wal_log, bool undo_log)
+{
+ /* Schedule the removal of this init fork at abort if requested. */
+ if (undo_log)
{
PendingRelDelete *pending;
+ ulog_smgrcreate(srel, forkNum);
+
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = rlocator;
- pending->procNumber = procNumber;
+ pending->rlocator = srel->smgr_rlocator.locator;
+ pending->procNumber = INVALID_PROC_NUMBER;
pending->atCommit = false; /* delete if abort */
pending->nestLevel = GetCurrentTransactionNestLevel();
pending->next = pendingDeletes;
pendingDeletes = pending;
}
- if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded())
- {
- Assert(procNumber == INVALID_PROC_NUMBER);
- AddPendingSync(&rlocator);
- }
+ /* WAL-log this creation if requested. */
+ if (wal_log)
+ log_smgrcreate(&srel->smgr_rlocator.locator, forkNum);
- return srel;
+ smgrcreate(srel, forkNum, false);
}
/*
@@ -198,6 +227,35 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
}
+/*
+ * Perform UndoLogWrite of an XLOG_SMGR_CREATE record to UNDO log.
+ */
+void
+ulog_smgrcreate(SMgrRelation srel, ForkNumber forkNum)
+{
+ ul_smgr_create ulrec;
+
+ ulrec.rlocator = srel->smgr_rlocator.locator;
+ ulrec.backend = srel->smgr_rlocator.backend;
+ ulrec.forknum = forkNum;
+ UndoLogWrite(RM_SMGR_ID, ULOG_SMGR_CREATE, &ulrec, sizeof(ulrec));
+}
+
+/*
+ * Perform UndoLogWrite of an XLOG_SMGR_PRESERVE record to UNDO log.
+ */
+void
+ulog_smgrpreserve(RelFileLocator rloc, ForkNumber forkNum)
+{
+ ul_smgr_preserve ulrec;
+
+ Assert(forkNum == MAIN_FORKNUM);
+ ulrec.rlocator = rloc;
+ ulrec.backend = INVALID_PROC_NUMBER;
+ ulrec.forknum = forkNum;
+ UndoLogWrite(RM_SMGR_ID, ULOG_SMGR_PRESERVE, &ulrec, sizeof(ulrec));
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -253,6 +311,7 @@ RelationPreserveStorage(RelFileLocator rlocator, bool atCommit)
PendingRelDelete *pending;
PendingRelDelete *prev;
PendingRelDelete *next;
+ bool found = false;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -261,6 +320,8 @@ RelationPreserveStorage(RelFileLocator rlocator, bool atCommit)
if (RelFileLocatorEquals(rlocator, pending->rlocator)
&& pending->atCommit == atCommit)
{
+ found = true;
+
/* unlink and delete list entry */
if (prev)
prev->next = next;
@@ -275,6 +336,9 @@ RelationPreserveStorage(RelFileLocator rlocator, bool atCommit)
prev = pending;
}
}
+
+ if (found)
+ ulog_smgrpreserve(rlocator, MAIN_FORKNUM);
}
/*
@@ -1077,3 +1141,119 @@ smgr_redo(XLogReaderState *record)
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
+
+void
+smgr_undo(UndoLogRecord *record, ULogContext cxt, bool redo, bool crashed)
+{
+ uint8 info;
+
+ Assert(CritSectionCount == 0);
+
+ if (cxt == ULOGCXT_CLEANUP)
+ {
+ Assert(record);
+ info = record->ul_info & ~ULR_INFO_MASK;
+
+ if (info == ULOG_SMGR_CREATE)
+ {
+ ul_smgr_create *ulrec = (ul_smgr_create *) ULogRecGetData(record);
+
+ Assert(ulrec->forknum == MAIN_FORKNUM);
+ if (rlocs_cap < rlocs_len + 1)
+ {
+ if (rlocs_cap == 0)
+ {
+ rlocs_cap = 32;
+ rlocs = palloc(sizeof(RelFileLocator) * rlocs_cap);
+ }
+ else
+ {
+ rlocs_cap *= 2;
+ rlocs = repalloc(rlocs, sizeof(RelFileLocator) * rlocs_cap);
+ }
+ }
+ rlocs[rlocs_len++] = ulrec->rlocator;
+ }
+ else if (info == ULOG_SMGR_PRESERVE)
+ {
+ ul_smgr_preserve *ulrec =
+ (ul_smgr_preserve *) ULogRecGetData(record);
+ int j = 0;
+
+ for (int i = 0 ; i < rlocs_len ; i++)
+ {
+ if (RelFileLocatorEquals(ulrec->rlocator, rlocs[i]))
+ continue;
+
+ if (i != j)
+ rlocs[j] = rlocs[i];
+ j++;
+ }
+
+ rlocs_len = j;
+ }
+ else
+ elog(PANIC, "smgr_undo: unknown op code %d", info);
+ }
+ else if (cxt == ULOGCXT_COMMIT || cxt == ULOGCXT_ABORT ||
+ cxt == ULOGCXT_PREPARED)
+ {
+ /* nothing to do here */
+ }
+ else
+ elog(PANIC, "smgr_undo: unknown context code %u", cxt);
+}
+
+void
+smgr_undoevent(ULogEvent event)
+{
+ if (event == ULOGEVENT_XACTEND)
+ {
+ SMgrRelation reln;
+ ForkNumber forks[3];
+ BlockNumber firstblocks[3] = {0};
+ int nforks = 0;
+
+ for (int i = 0 ; i < rlocs_len ; i++)
+ {
+ forks[nforks++] = MAIN_FORKNUM;
+
+ /*
+ * Since the MAIN fork was created in this transaction, rollback
+ * should remove all forks of this relation. Although we could
+ * register an undo record individually for each fork, this may be
+ * more complex because VM and FSM can be created
+ * non-transactionally outside the transaction that created the
+ * MAIN fork.
+ */
+ forks[nforks++] = VISIBILITYMAP_FORKNUM;
+ forks[nforks++] = FSM_FORKNUM;
+
+ /*
+ * Drop buffers, then the files. This can be improved by using
+ * smgrdounlinkall(), but currently I take the simpler way.
+ */
+ reln = smgropen(rlocs[i], INVALID_PROC_NUMBER);
+ DropRelationBuffers(reln, forks, nforks, firstblocks);
+ for (int j = 0 ; j < nforks ; j++)
+ smgrunlink(reln, forks[j], true);
+
+ smgrclose(reln);
+ }
+
+ if (rlocs)
+ {
+ pfree(rlocs);
+ rlocs = NULL;
+ rlocs_cap = rlocs_len = 0;
+ }
+ }
+ else if (event == ULOGEVENT_CLEANUP_INIT ||
+ event == ULOGEVENT_RECOVERY_END)
+ {
+ /* Nothing to do */
+ }
+ else
+ elog(PANIC, "smgr_undoevent: unknown event code %u", event);
+
+}
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 0188e8bbd5b..be6afc7df52 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -31,6 +31,7 @@
#include "catalog/objectaccess.h"
#include "catalog/pg_sequence.h"
#include "catalog/pg_type.h"
+#include "catalog/storage.h"
#include "catalog/storage_xlog.h"
#include "commands/defrem.h"
#include "commands/sequence.h"
@@ -344,8 +345,7 @@ fill_seq_with_data(Relation rel, HeapTuple tuple)
SMgrRelation srel;
srel = smgropen(rel->rd_locator, INVALID_PROC_NUMBER);
- smgrcreate(srel, INIT_FORKNUM, false);
- log_smgrcreate(&rel->rd_locator, INIT_FORKNUM);
+ RelationCreateFork(srel, INIT_FORKNUM, true, true);
fill_seq_fork_with_data(rel, tuple, INIT_FORKNUM);
FlushRelationBuffers(rel);
smgrclose(srel);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 49374782625..b5766989d8e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15965,16 +15965,17 @@ index_copy_data(Relation rel, RelFileLocator newrlocator)
{
if (smgrexists(RelationGetSmgr(rel), forkNum))
{
- smgrcreate(dstrel, forkNum, false);
-
- /*
- * WAL log creation if the relation is persistent, or this is the
- * init fork of an unlogged relation.
- */
- if (RelationIsPermanent(rel) ||
+ bool wal_log = RelationIsPermanent(rel) |
(rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
- forkNum == INIT_FORKNUM))
- log_smgrcreate(&newrlocator, forkNum);
+ forkNum == INIT_FORKNUM);
+
+ /*
+ * Usually, we don't use UNDO log for FSM or VM forks, as their
+ * creation is not transactional. However, we're currently copying
+ * the entire relation in a transactional manner, which requires
+ * after-crash cleanup.
+ */
+ RelationCreateFork(dstrel, forkNum, wal_log, true);
RelationCopyStorage(RelationGetSmgr(rel), dstrel, forkNum,
rel->rd_rel->relpersistence);
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2622221809c..1a9c794374f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4812,8 +4812,7 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
/*
* Create and copy all forks of the relation. During create database we
* have a separate cleanup mechanism which deletes complete database
- * directory. Therefore, each individual relation doesn't need to be
- * registered for cleanup.
+ * directory. Therefore, do not issue an UNDO log for this relation.
*/
RelationCreateStorage(dst_rlocator, relpersistence, false);
@@ -4827,6 +4826,7 @@ CreateAndCopyRelationData(RelFileLocator src_rlocator,
{
if (smgrexists(src_rel, forkNum))
{
+ /* Use smgrcreate() directly as no UNDO log is required. */
smgrcreate(dst_rel, forkNum, false);
/*
diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 01e267abf9b..d3a42d3f566 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -34,6 +34,39 @@ typedef struct
RelFileNumber relnumber; /* hash key */
} unlogged_relation_entry;
+static char **ignore_files = NULL;
+static int nignore_elems = 0;
+static int nignore_files = 0;
+
+/*
+ * determine if the file should be ignored when resetting unlogged relations
+ */
+static bool
+reinit_ignore_file(const char *dirname, const char *name)
+{
+ char fnamebuf[MAXPGPATH];
+ int len;
+
+ if (nignore_files == 0)
+ return false;
+
+ strncpy(fnamebuf, dirname, MAXPGPATH - 1);
+ strncat(fnamebuf, "/", MAXPGPATH - 1);
+ strncat(fnamebuf, name, MAXPGPATH - 1);
+ fnamebuf[MAXPGPATH - 1] = 0;
+
+ for (int i = 0 ; i < nignore_files ; i++)
+ {
+ /* match ignoring fork part */
+ len = strlen(ignore_files[i]);
+ if (strncmp(fnamebuf, ignore_files[i], len) == 0 &&
+ (fnamebuf[len] == 0 || fnamebuf[len] == '_'))
+ return true;
+ }
+
+ return false;
+}
+
/*
* Reset unlogged relations from before the last restart.
*
@@ -204,6 +237,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -243,6 +280,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* We never remove the init fork. */
if (forkNum == INIT_FORKNUM)
continue;
@@ -294,6 +335,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -337,6 +382,10 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
&forkNum, &segno))
continue;
+ /* Skip anything that undo log suggested to ignore */
+ if (reinit_ignore_file(dbspacedirname, de->d_name))
+ continue;
+
/* Also skip it unless this is the init fork. */
if (forkNum != INIT_FORKNUM)
continue;
@@ -366,6 +415,49 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
}
}
+/*
+ * Record relfilenodes that should be left alone during reinitializing unlogged
+ * relations.
+ */
+void
+ResetUnloggedRelationIgnore(RelFileLocator rloc, ProcNumber backend)
+{
+ RelFileLocatorBackend rbloc;
+
+ if (nignore_files >= nignore_elems)
+ {
+ if (ignore_files == NULL)
+ {
+ nignore_elems = 16;
+ ignore_files = palloc(sizeof(char *) * nignore_elems);
+ }
+ else
+ {
+ nignore_elems *= 2;
+ ignore_files = repalloc(ignore_files,
+ sizeof(char *) * nignore_elems);
+ }
+ }
+
+ rbloc.backend = backend;
+ rbloc.locator = rloc;
+ ignore_files[nignore_files++] = relpath(rbloc, MAIN_FORKNUM);
+}
+
+/*
+ * Clear the ignore list
+ */
+void
+ResetUnloggedRelationIgnoreClear(void)
+{
+ if (nignore_elems == 0)
+ return;
+
+ pfree(ignore_files);
+ ignore_files = NULL;
+ nignore_elems = 0;
+}
+
/*
* Basic parsing of putative relation filenames.
*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 36ad34aa6ac..8a7654118fe 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -819,6 +819,15 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
}
+/*
+ * smgrunlink() -- unlink the storage file
+ */
+void
+smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+ smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rlocator, forknum, isRedo);
+}
+
/*
* AtEOXact_SMgr
*
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 5909d87d599..b0c4e689950 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -27,7 +27,7 @@
/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode, undo, undo_desc, undo_identify, undo_event */
PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode, NULL, NULL, NULL, NULL)
PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode, NULL, NULL, NULL, NULL)
-PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
+PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL, smgr_undo, smgr_undodesc, smgr_undoidentify, smgr_undoevent)
PG_RMGR(RM_CLOG_ID, "CLOG", clog_redo, clog_desc, clog_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
PG_RMGR(RM_DBASE_ID, "Database", dbase_redo, dbase_desc, dbase_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
PG_RMGR(RM_TBLSPC_ID, "Tablespace", tblspc_redo, tblspc_desc, tblspc_identify, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 72ef3ee92c0..3451d6ac80c 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -25,6 +25,8 @@ extern PGDLLIMPORT int wal_skip_threshold;
extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
char relpersistence,
bool register_delete);
+extern void RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
+ bool wal_log, bool undo_log);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
diff --git a/src/include/catalog/storage_ulog.h b/src/include/catalog/storage_ulog.h
new file mode 100644
index 00000000000..9568ab24cfb
--- /dev/null
+++ b/src/include/catalog/storage_ulog.h
@@ -0,0 +1,48 @@
+/*-------------------------------------------------------------------------
+ *
+ * storage_ulog.h
+ * prototypes for Undo Log support for backend/catalog/storage.c
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/catalog/storage_ulog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STORAGE_ULOG_H
+#define STORAGE_ULOG_H
+
+#include "access/undolog.h"
+#include "storage/smgr.h"
+
+/* ULOG gives us high 4 bits (just following xlog) */
+#define ULOG_SMGR_CREATE 0x10
+#define ULOG_SMGR_PRESERVE 0x20
+
+/* undo log entry for storage file creation */
+typedef struct ul_smgr_create
+{
+ RelFileLocator rlocator;
+ ProcNumber backend;
+ ForkNumber forknum;
+} ul_smgr_create;
+
+typedef struct ul_smgr_preserve
+{
+ RelFileLocator rlocator;
+ ProcNumber backend;
+ ForkNumber forknum;
+} ul_smgr_preserve;
+
+extern void smgr_undo(UndoLogRecord *record, ULogContext cxt, bool redo,
+ bool crashed);
+extern void smgr_undodesc(StringInfo buf, UndoLogRecord *record);
+extern const char *smgr_undoidentify(uint8 info);
+extern void smgr_undoevent(ULogEvent event);
+
+#define ULogRecGetData(record) ((char *)record + sizeof(UndoLogRecord))
+#define ULogRecGetInfo(record) ((record)->ul_info)
+
+#endif /* STORAGE_XLOG_H */
diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h
index 1373d509df2..02bf55d3a6b 100644
--- a/src/include/storage/reinit.h
+++ b/src/include/storage/reinit.h
@@ -16,9 +16,13 @@
#define REINIT_H
#include "common/relpath.h"
+#include "storage/relfilelocator.h"
extern void ResetUnloggedRelations(int op);
+extern void ResetUnloggedRelationIgnore(RelFileLocator rloc,
+ ProcNumber backend);
+extern void ResetUnloggedRelationIgnoreClear(void);
extern bool parse_filename_for_nontemp_relation(const char *name,
RelFileNumber *relnumber,
ForkNumber *fork,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 63a186bd346..a2c15d6af90 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -110,6 +110,7 @@ extern void smgrtruncate(SMgrRelation reln, ForkNumber *forknum, int nforks,
BlockNumber *nblocks);
extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void smgrregistersync(SMgrRelation reln, ForkNumber forknum);
+extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void AtEOXact_SMgr(void);
extern bool ProcessBarrierSmgrRelease(void);
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index d5d24e31d90..4df88efeb3d 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -86,6 +86,23 @@ ok( pump_until(
$killme_stdout = '';
$killme_stderr = '';
+#also, create a table whose storage should *not* survive.
+$killme_stdin .= q[
+CREATE TABLE should_not_survive (a int);
+SELECT pg_relation_filepath('should_not_survive');
+];
+ok( pump_until(
+ $killme, $psql_timeout, \$killme_stdout,
+ qr/base\/[[:digit:]\/]+[\r\n]$/m),
+ 'created a table');
+my $relfilerelpath = $killme_stdout;
+chomp($relfilerelpath);
+$killme_stdout = '';
+$killme_stderr = '';
+
+my $relfilepath = $node->data_dir . "/" . $relfilerelpath;
+ok( -e $relfilepath,
+ "storage file is created in xact that is going to crash");
# Start longrunning query in second session; its failure will signal that
# crash-restart has occurred. The initial wait for the trivial select is to
@@ -144,6 +161,8 @@ $killme->run();
($monitor_stdin, $monitor_stdout, $monitor_stderr) = ('', '', '');
$monitor->run();
+ok( ! -e $relfilepath,
+ "orphaned storage file is correctly removed");
# Acquire pid of new backend
$killme_stdin .= q[
--
2.43.5
v36-0006-new-indexam-bit-for-unlogged-storage-compatibili.patchtext/x-patch; charset=us-asciiDownload
From f2397682d52eb83fd9d3440d226f3262db0235f3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 24 Jul 2024 19:31:39 +0900
Subject: [PATCH v36 06/17] new indexam bit for unlogged storage compatibility
To enable the core to identify whether storage files created by an
index access method for WAL-logged and unlogged relations are
binary-compatible, add a boolean property to the index AM interface.
---
src/backend/access/brin/brin.c | 1 +
src/backend/access/gin/ginutil.c | 1 +
src/backend/access/gist/gist.c | 8 ++++++++
src/backend/access/hash/hash.c | 1 +
src/backend/access/nbtree/nbtree.c | 1 +
src/backend/access/spgist/spgutils.c | 1 +
src/include/access/amapi.h | 2 ++
src/test/modules/dummy_index_am/dummy_index_am.c | 1 +
8 files changed, 16 insertions(+)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 9af445cdcdd..e401ffa5201 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -272,6 +272,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = true;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 830d67fbc20..7072ff4537f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -59,6 +59,7 @@ ginhandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 272390ff67d..fc7b2a05fe5 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -81,6 +81,14 @@ gisthandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP;
+
+ /*
+ * GiST uses page LSNs to figure out whether a block has been
+ * modified. UNLOGGED GiST indexes use fake LSNs, which are incompatible
+ * with the real LSNs used for LOGGED indexes.
+ */
+ amroutine->amunloggedstoragecompatible = false;
+
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 42c73ea5eb9..a43abcf7368 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,6 +79,7 @@ hashhandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = INT4OID;
amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 77afa148942..a60239a0080 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -122,6 +122,7 @@ bthandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index da858182173..f76c7c9ff7f 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -66,6 +66,7 @@ spghandler(PG_FUNCTION_ARGS)
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions =
VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP;
+ amroutine->amunloggedstoragecompatible = true;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = spgbuild;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index a4c0b43aa92..e946d9d4363 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -265,6 +265,8 @@ typedef struct IndexAmRoutine
bool amsummarizing;
/* OR of parallel vacuum flags. See vacuum.h for flags. */
uint8 amparallelvacuumoptions;
+ /* is AM storage data compatible between LOGGED and UNLOGGED states? */
+ bool amunloggedstoragecompatible;
/* type of data stored in index, or InvalidOid if variable */
Oid amkeytype;
diff --git a/src/test/modules/dummy_index_am/dummy_index_am.c b/src/test/modules/dummy_index_am/dummy_index_am.c
index beb2c1d2542..ca302490160 100644
--- a/src/test/modules/dummy_index_am/dummy_index_am.c
+++ b/src/test/modules/dummy_index_am/dummy_index_am.c
@@ -297,6 +297,7 @@ dihandler(PG_FUNCTION_ARGS)
amroutine->amusemaintenanceworkmem = false;
amroutine->amsummarizing = false;
amroutine->amparallelvacuumoptions = VACUUM_OPTION_NO_PARALLEL;
+ amroutine->amunloggedstoragecompatible = false;
amroutine->amkeytype = InvalidOid;
amroutine->ambuild = dibuild;
--
2.43.5
v36-0007-Transactional-buffer-persistence-switching.patchtext/x-patch; charset=us-asciiDownload
From 3276258bb2dbad1c27cbac74e296b79f8aa53fa8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 16 Aug 2024 17:59:38 +0900
Subject: [PATCH v36 07/17] Transactional buffer persistence switching
This commit introduces functionality for transactional buffer
persistence switching with no user-side code. The switching is
reverted if the transaction aborts, and both the switching and
reverting are WAL-logged. Repeated back-and-forth switching within and
across subtransactions is prohibited for simplicity.
---
src/backend/access/rmgrdesc/smgrdesc.c | 13 +
src/backend/access/transam/twophase.c | 2 +
src/backend/access/transam/xact.c | 14 +
src/backend/access/transam/xlog.c | 1 +
src/backend/access/transam/xlogrecovery.c | 1 +
src/backend/catalog/storage.c | 33 +++
src/backend/storage/buffer/bufmgr.c | 328 ++++++++++++++++++++++
src/bin/pg_rewind/parsexlog.c | 6 +
src/include/catalog/storage_xlog.h | 11 +
src/include/storage/bufmgr.h | 10 +
src/tools/pgindent/typedefs.list | 2 +
11 files changed, 421 insertions(+)
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 71410e0a2d3..d7b763f5297 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,16 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
xlrec->blkno, xlrec->flags);
pfree(path);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+ char *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
+
+ appendStringInfoString(buf, path);
+ appendStringInfo(buf, " persistence \"%c\"",
+ xlrec->persistence ? 'p' : 'u');
+ pfree(path);
+ }
}
const char *
@@ -55,6 +65,9 @@ smgr_identify(uint8 info)
case XLOG_SMGR_TRUNCATE:
id = "TRUNCATE";
break;
+ case XLOG_SMGR_BUFPERSISTENCE:
+ id = "BUFPERSISTENCE";
+ break;
}
return id;
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ba1a8bd875c..7e18be8025e 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1608,6 +1608,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
abortstats,
gid);
+ /* Clean up buffer persistence changes and unecessary files. */
+ PreCommit_Buffers(isCommit);
UndoLog_UndoByXid(isCommit, xid, hdr->nsubxacts, children);
ProcArrayRemove(proc, latestXid);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8b383b424a1..fa9b1185f88 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2280,6 +2280,9 @@ CommitTransaction(void)
CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
: XACT_EVENT_PRE_COMMIT);
+ /* Clean up buffer persistence changes */
+ PreCommit_Buffers(true);
+
/*
* If this xact has started any unfinished parallel operation, clean up
* its workers, warning about leaked resources. (But we don't actually
@@ -2865,6 +2868,9 @@ AbortTransaction(void)
*/
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+ /* Clean up buffer persistence changes */
+ PreCommit_Buffers(false);
+
/*
* check the current transaction state
*/
@@ -5114,6 +5120,9 @@ CommitSubTransaction(void)
CallSubXactCallbacks(SUBXACT_EVENT_PRE_COMMIT_SUB, s->subTransactionId,
s->parent->subTransactionId);
+ /* Clean up buffer persistence changes. */
+ PreSubCommit_Buffers(true);
+
/*
* If this subxact has started any unfinished parallel operation, clean up
* its workers and exit parallel mode. Warn about leaked resources.
@@ -5261,6 +5270,9 @@ AbortSubTransaction(void)
*/
reschedule_timeouts();
+ /* Clean up buffer persistence changes */
+ PreSubCommit_Buffers(false);
+
/*
* Re-enable signals, in case we got here by longjmp'ing out of a signal
* handler. We do this fairly early in the sequence so that the timeout
@@ -6234,6 +6246,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
}
AtEOXact_UndoLog(xid);
+ AtEOXact_Buffers_Redo(true, xid, parsed->nsubxacts, parsed->subxacts);
if (parsed->nstats > 0)
{
@@ -6347,6 +6360,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
}
AtEOXact_UndoLog(xid);
+ AtEOXact_Buffers_Redo(false, xid, parsed->nsubxacts, parsed->subxacts);
if (parsed->nstats > 0)
{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 81ec5510f09..a6c6ea29612 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5917,6 +5917,7 @@ StartupXLOG(void)
*/
if (!reachedConsistency)
{
+ BufmgrDoCleanupRedo();
UndoLogCleanup(true);
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
}
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 730804340b9..2d0f5df0c77 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2268,6 +2268,7 @@ CheckRecoveryConsistency(void)
* backends don't try to read whatever garbage is left over from
* before.
*/
+ BufmgrDoCleanupRedo();
UndoLogCleanup(false);
ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d546d169d34..a1ac06cc2bf 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -256,6 +256,29 @@ ulog_smgrpreserve(RelFileLocator rloc, ForkNumber forkNum)
UndoLogWrite(RM_SMGR_ID, ULOG_SMGR_PRESERVE, &ulrec, sizeof(ulrec));
}
+/*
+ * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL.
+ *
+ * XXX: This function essentially belongs in bufmgr.c, but is placed here to
+ * avoid adding a new rmgr type solely for this record type.
+ */
+void
+log_smgrbufpersistence(const RelFileLocator rlocator, bool persistence)
+{
+ xl_smgr_bufpersistence xlrec;
+
+ /*
+ * Make an XLOG entry reporting the change of buffer persistence.
+ */
+ xlrec.rlocator = rlocator;
+ xlrec.persistence = persistence;
+ xlrec.topxid = GetTopTransactionId();
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+ XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE);
+}
+
/*
* RelationDropStorage
* Schedule unlinking of physical storage at transaction commit.
@@ -1138,6 +1161,16 @@ smgr_redo(XLogReaderState *record)
FreeFakeRelcacheEntry(rel);
}
+ else if (info == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ xl_smgr_bufpersistence *xlrec =
+ (xl_smgr_bufpersistence *) XLogRecGetData(record);
+ SMgrRelation reln;
+
+ reln = smgropen(xlrec->rlocator, INVALID_PROC_NUMBER);
+ SetRelationBuffersPersistenceRedo(reln, xlrec->persistence,
+ XLogRecGetXid(record));
+ }
else
elog(PANIC, "smgr_redo: unknown op code %u", info);
}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1a9c794374f..af526280576 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -58,6 +58,7 @@
#include "storage/smgr.h"
#include "storage/standby.h"
#include "utils/memdebug.h"
+#include "utils/memutils.h"
#include "utils/ps_status.h"
#include "utils/rel.h"
#include "utils/resowner.h"
@@ -136,6 +137,22 @@ typedef struct SMgrSortArray
SMgrRelation srel;
} SMgrSortArray;
+/*
+ * We keep a list of all relations whose buffer persistence has been switched
+ * in the current transaction. This allows us to properly revert the
+ * persistence if the transaction is aborted.
+ */
+typedef struct BufMgrCleanup
+{
+ RelFileLocator rlocator; /* relation that may need to be deleted */
+ bool bufpersistence; /* buffer persistence to set */
+ int nestLevel; /* xact nesting level of request */
+ TransactionId xid; /* used during recovery */
+ struct BufMgrCleanup *next; /* linked-list link */
+} BufMgrCleanup;
+
+static BufMgrCleanup * cleanups = NULL; /* head of linked list */
+
/* GUC variables */
bool zero_damaged_pages = false;
int bgwriter_lru_maxpages = 100;
@@ -223,6 +240,8 @@ static char *ResOwnerPrintBufferIO(Datum res);
static void ResOwnerReleaseBufferPin(Datum res);
static char *ResOwnerPrintBufferPin(Datum res);
+static void set_relation_buffers_persistence(SMgrRelation srel, bool permanent);
+
const ResourceOwnerDesc buffer_io_resowner_desc =
{
.name = "buffer io",
@@ -3548,6 +3567,153 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
return result | BUF_WRITTEN;
}
+/*
+ * bufmgrDoCleanup() -- Take care of buffer persistence chages at end of xact
+ *
+ * This function is called at the end of both transactions and subtransactions,
+ * aiming to immediately clean up failed transactions.
+ */
+static void
+bufmgrDoCleanup(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ BufMgrCleanup *cu;
+ BufMgrCleanup *next;
+
+ for (cu = cleanups ; cu && cu->nestLevel <= nestLevel ; cu = next)
+ {
+ next = cu->next;
+ cleanups = next;
+
+ if (!isCommit)
+ {
+ SMgrRelation srel = smgropen(cu->rlocator, INVALID_PROC_NUMBER);
+ set_relation_buffers_persistence(srel, cu->bufpersistence);
+ }
+ pfree(cu);
+ }
+
+#ifdef USE_ASSERT_CHECKING
+ /* All remaining entriespertain to upper levels. */
+ for (cu = cleanups ; cu ; cu = cu->next)
+ Assert(cu->nestLevel < nestLevel);
+#endif
+}
+
+/*
+ * AtEOXact_Buffers_Redo() -- End-of-transaction cleanup of buffer persistence
+ * chages during rcovery.
+ *
+ * Unlike normal operation, the cleanup entries are keyed by xid rather than by
+ * nestLevel. See SetRelationBuffersPersistenceRedo() for details on the
+ * registration of those entries.
+ */
+void
+AtEOXact_Buffers_Redo(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children)
+{
+ BufMgrCleanup *cu;
+ BufMgrCleanup *prev;
+ BufMgrCleanup *next;
+
+ prev = NULL;
+ for (cu = cleanups ; cu ; cu = next)
+ {
+ next = cu->next;
+
+ if (cu->xid != xid)
+ {
+ int i;
+
+ for (i = 0 ; i < nchildren && cu->xid != children[i] ; i++);
+
+ if (i == nchildren)
+ {
+ /* did not match, go to next */
+ prev = cu;
+ continue;
+ }
+ }
+
+ if (!isCommit)
+ {
+ /*
+ * Record this revert to WAL without re-registering a BufMgrCleanup
+ * entry.
+ */
+ SMgrRelation srel = smgropen(cu->rlocator, INVALID_PROC_NUMBER);
+ set_relation_buffers_persistence(srel, cu->bufpersistence);
+ }
+ if (prev)
+ prev->next = next;
+ else
+ cleanups = next;
+ pfree(cu);
+ }
+}
+
+/*
+ * BufmgrDoCleanupRedo() -- End-of-recovery cleanup of buffer persistence
+ * chages.
+ *
+ * Revert buffer persistence changes made in transactions that are not
+ * committed at the end of recovery.
+ */
+void
+BufmgrDoCleanupRedo(void)
+{
+ BufMgrCleanup *cu;
+ BufMgrCleanup *next;
+
+ for (cu = cleanups ; cu ; cu = next)
+ {
+ SMgrRelation srel = smgropen(cu->rlocator, INVALID_PROC_NUMBER);
+ set_relation_buffers_persistence(srel, cu->bufpersistence);
+
+ next = cu->next;
+ pfree(cu);
+ }
+
+ cleanups = NULL;
+}
+
+/*
+ * PreSubCommit_Buffers() -- Take care of buffer persistence changes at subxact
+ * end
+ */
+void
+PreSubCommit_Buffers(bool isCommit)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+
+ if (!isCommit)
+ {
+ bufmgrDoCleanup(isCommit);
+ return;
+ }
+
+ /*
+ * Reassign all cleanup items at the current nestlevel to the parent
+ * transaction.
+ */
+
+ for (BufMgrCleanup *cu = cleanups ;
+ cu && cu->nestLevel >= nestLevel ;
+ cu = cu->next)
+ {
+ /* no lower-level entry is expected */
+ Assert(cu->nestLevel == nestLevel);
+
+ cu->nestLevel = nestLevel - 1;
+ }
+}
+
+void
+PreCommit_Buffers(bool isCommit)
+{
+ bufmgrDoCleanup(isCommit);
+}
+
/*
* AtEOXact_Buffers - clean up at end of transaction.
*
@@ -4142,6 +4308,168 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
}
}
+/*
+ * set_relation_buffers_persistence()
+ *
+ * When switching to PERMANENT, this function changes the persistence of all
+ * buffer pages for a relation, then writes all dirty pages to disk (or kernel
+ * buffers) to ensure the kernel has the latest view of the relation.
+ * Otherwise, it simply flips the persistence of every page.
+ *
+ * The caller must hold an AccessExclusiveLock on the target relation to
+ * prevent other backends from loading additional blocks.
+ *
+ * XXX: Currently, this function sequentially searches the buffer pool;
+ * consider implementing more efficient search methods. Since this routine is
+ * not used in performance-critical paths, additional optimization isn't
+ * warranted; see also DropRelationBuffers.
+ */
+static void
+set_relation_buffers_persistence(SMgrRelation srel, bool permanent)
+{
+ int i;
+ RelFileLocator rlocator = srel->smgr_rlocator.locator;
+
+ Assert(!RelFileLocatorBackendIsTemp(srel->smgr_rlocator));
+
+ ResourceOwnerEnlarge(CurrentResourceOwner);
+
+ for (i = 0; i < NBuffers; i++)
+ {
+ BufferDesc *bufHdr = GetBufferDescriptor(i);
+ uint32 buf_state;
+
+ /* try unlocked check to avoid locking irrelevant buffers */
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator))
+ continue;
+
+ ReservePrivateRefCountEntry();
+
+ buf_state = LockBufHdr(bufHdr);
+
+ if (!RelFileLocatorEquals(BufTagGetRelFileLocator(&bufHdr->tag),
+ rlocator))
+ {
+ UnlockBufHdr(bufHdr, buf_state);
+ continue;
+ }
+
+ if (permanent)
+ {
+ /* The init fork is being dropped, drop buffers for it. */
+ if (BufTagGetForkNum(&bufHdr->tag) == INIT_FORKNUM)
+ {
+ InvalidateBuffer(bufHdr);
+ continue;
+ }
+
+ /* Switch the buffer state to BM_PERMANENT before flushing it. */
+ Assert((buf_state & BM_PERMANENT) == 0);
+ buf_state |= BM_PERMANENT;
+ pg_atomic_write_u32(&bufHdr->state, buf_state);
+
+ /*
+ * We haven't written WALs for this buffer. Flush this buffer to
+ * establish the epoch for subsequent WAL records.
+ */
+ if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
+ {
+ PinBuffer_Locked(bufHdr);
+ LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+ LW_SHARED);
+ FlushBuffer(bufHdr, srel, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+ UnpinBuffer(bufHdr);
+ }
+ else
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ else
+ {
+ /* There shouldn't be an init fork for this relation */
+ Assert(BufTagGetForkNum(&bufHdr->tag) != INIT_FORKNUM);
+ Assert(buf_state & BM_PERMANENT);
+
+ /* Just switch the buffer state to non-permanent. */
+ buf_state &= ~BM_PERMANENT;
+ UnlockBufHdr(bufHdr, buf_state);
+ }
+ }
+}
+
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistence
+ *
+ * This function changes the persistence of all buffer pages of a
+ * relation. See set_relation_buffers_persistence() for functionality
+ * details.
+ *
+ * This function's behavior is transactional, meaning that the changes it
+ * makes will be reverted if this or any higher-level transaction is
+ * aborted.
+ *
+ * The caller must be holding AccessExclusiveLock on the target relation
+ * to ensure no other backend is busy loading more blocks.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistence(SMgrRelation srel, bool permanent)
+{
+ BufMgrCleanup *cu;
+ RelFileLocator rlocator = srel->smgr_rlocator.locator;
+
+ /*
+ * Prevent double-flipping of relation persistence within the same
+ * transaction. Performing double-flipping adds significant complexity
+ * with minimal benefit. Error out if persistence has already been flipped
+ * for this relation in the current transaction.
+ */
+ for (cu = cleanups ; cu ; cu = cu->next)
+ {
+ if (RelFileLocatorEquals(rlocator, cu->rlocator))
+ ereport(ERROR,
+ errmsg("persistence of this relation has been already changed in the current transaction"));
+ }
+
+ set_relation_buffers_persistence(srel, permanent);
+
+ /* Schedule reverting this change at abort, keying by nestLevel. */
+ cu = (BufMgrCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(BufMgrCleanup));
+ cu->rlocator = rlocator;
+ cu->bufpersistence = !permanent;
+ cu->nestLevel = GetCurrentTransactionNestLevel();
+ cu->next = cleanups;
+ cleanups = cu;
+}
+
+/* ---------------------------------------------------------------------
+ * SetRelationBuffersPersistenceRedo
+ *
+ * This function changes the persistence of all buffer pages for a
+ * relation during recovery. In recovery, cleanup entries are keyed by
+ * transaction ID, rather than by nestLevel.
+ * --------------------------------------------------------------------
+ */
+void
+SetRelationBuffersPersistenceRedo(SMgrRelation srel, bool permanent,
+ TransactionId xid)
+{
+ BufMgrCleanup *cu;
+
+ set_relation_buffers_persistence(srel, permanent);
+
+ /* Schedule reverting this change at abort */
+ cu = (BufMgrCleanup *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(BufMgrCleanup));
+ cu->rlocator = srel->smgr_rlocator.locator;
+ cu->bufpersistence = !permanent;
+ cu->xid = xid;
+ cu->next = cleanups;
+ cleanups = cu;
+}
+
/* ---------------------------------------------------------------------
* DropRelationsAllBuffers
*
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 64901967d2a..68ade5b6710 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -439,6 +439,12 @@ extractPageInfo(XLogReaderState *record)
* source system.
*/
}
+ else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE)
+ {
+ /*
+ * We can safely ignore these. These don't make any on-disk changes.
+ */
+ }
else if (rmid == RM_XACT_ID &&
((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT ||
(rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED ||
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index a490e05f884..085b1bc1dff 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -29,6 +29,7 @@
/* XLOG gives us high 4 bits */
#define XLOG_SMGR_CREATE 0x10
#define XLOG_SMGR_TRUNCATE 0x20
+#define XLOG_SMGR_BUFPERSISTENCE 0x30
typedef struct xl_smgr_create
{
@@ -36,6 +37,14 @@ typedef struct xl_smgr_create
ForkNumber forkNum;
} xl_smgr_create;
+typedef struct xl_smgr_bufpersistence
+{
+ RelFileLocator rlocator;
+ bool persistence;
+ TransactionId topxid;
+ /* subxid is in the record header */
+} xl_smgr_bufpersistence;
+
/* flags for xl_smgr_truncate */
#define SMGR_TRUNCATE_HEAP 0x0001
#define SMGR_TRUNCATE_VM 0x0002
@@ -51,6 +60,8 @@ typedef struct xl_smgr_truncate
} xl_smgr_truncate;
extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
+extern void log_smgrbufpersistence(const RelFileLocator rlocator,
+ bool persistence);
extern void smgr_redo(XLogReaderState *record);
extern void smgr_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index eb0fba4230b..4267098080f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -19,6 +19,7 @@
#include "storage/buf.h"
#include "storage/bufpage.h"
#include "storage/relfilelocator.h"
+#include "storage/smgr.h"
#include "utils/relcache.h"
#include "utils/snapmgr.h"
@@ -250,7 +251,14 @@ extern Buffer ExtendBufferedRelTo(BufferManagerRelation bmr,
ReadBufferMode mode);
extern void InitBufferManagerAccess(void);
+extern void PreSubCommit_Buffers(bool isCommit);
+extern void PreCommit_Buffers(bool isCommit);
extern void AtEOXact_Buffers(bool isCommit);
+extern void SetRelationBuffersPersistenceRedo(SMgrRelation srel, bool permanent,
+ TransactionId xid);
+extern void AtEOXact_Buffers_Redo(bool isCommit, TransactionId xid,
+ int nchildren, TransactionId *children);
+extern void BufmgrDoCleanupRedo(void);
extern char *DebugPrintBufferRefcount(Buffer buffer);
extern void CheckPointBuffers(int flags);
extern BlockNumber BufferGetBlockNumber(Buffer buffer);
@@ -269,6 +277,8 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
int nlocators);
extern void DropDatabaseBuffers(Oid dbid);
+extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
+ bool permanent);
#define RelationGetNumberOfBlocks(reln) \
RelationGetNumberOfBlocksInFork(reln, MAIN_FORKNUM)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4e635bce942..795e5d4d018 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -329,6 +329,7 @@ BtreeLastVisibleEntry
BtreeLevel
Bucket
BufFile
+BufMgrCleanup
Buffer
BufferAccessStrategy
BufferAccessStrategyType
@@ -4149,6 +4150,7 @@ xl_replorigin_set
xl_restore_point
xl_running_xacts
xl_seq_rec
+xl_smgr_bufpersistence
xl_smgr_create
xl_smgr_truncate
xl_standby_lock
--
2.43.5
v36-0008-Make-smgrdounlinkall-accept-fork-numbers.patchtext/x-patch; charset=us-asciiDownload
From 9cd7c6c35deafc8c1bf439128e9fdb72c5869687 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 2 Aug 2024 19:34:06 +0900
Subject: [PATCH v36 08/17] Make smgrdounlinkall accept fork numbers
An upcoming patch will require crash-safe file deletion on a per-fork
basis. To support this, modify smgrdounlinkall(), which efficiently
removes multiple files, to accept fork numbers. This commit also
introduces a new type, ForkBitmap, to represent multiple fork numbers
as a single integer.
---
src/backend/catalog/storage.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 92 ++++++++++++++++++++++++-----
src/backend/storage/smgr/md.c | 2 +-
src/backend/storage/smgr/smgr.c | 28 ++++++---
src/backend/utils/cache/relcache.c | 2 +-
src/include/common/relpath.h | 11 ++++
src/include/storage/bufmgr.h | 2 +-
src/include/storage/smgr.h | 3 +-
src/tools/pgindent/typedefs.list | 1 +
9 files changed, 115 insertions(+), 28 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index a1ac06cc2bf..5b20c583d16 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -796,7 +796,7 @@ smgrDoPendingDeletes(bool isCommit)
if (nrels > 0)
{
- smgrdounlinkall(srels, nrels, false);
+ smgrdounlinkall(srels, NULL, nrels, false);
for (int i = 0; i < nrels; i++)
smgrclose(srels[i]);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index af526280576..df114e8e0c0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -153,6 +153,16 @@ typedef struct BufMgrCleanup
static BufMgrCleanup * cleanups = NULL; /* head of linked list */
+/*
+ * Helper struct for handling RelFileNode and ForkNumber together in
+ * DropRelationsAllBuffers.
+ */
+typedef struct RelFileForks
+{
+ RelFileLocator rloc; /* key member for qsort */
+ ForkBitmap forks; /* fork number in bitmap */
+} RelFileForks;
+
/* GUC variables */
bool zero_damaged_pages = false;
int bgwriter_lru_maxpages = 100;
@@ -4476,24 +4486,32 @@ SetRelationBuffersPersistenceRedo(SMgrRelation srel, bool permanent,
* This function removes from the buffer pool all the pages of all
* forks of the specified relations. It's equivalent to calling
* DropRelationBuffers once per fork per relation with firstDelBlock = 0.
+ * The additional parameter forks is used to identify forks if
+ * provided.
* --------------------------------------------------------------------
*/
void
-DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
+DropRelationsAllBuffers(SMgrRelation *smgr_reln, ForkBitmap *pforks,
+ int nlocators)
{
int i;
int n = 0;
SMgrRelation *rels;
BlockNumber (*block)[MAX_FORKNUM + 1];
uint64 nBlocksToInvalidate = 0;
- RelFileLocator *locators;
+ ForkBitmap *forks = NULL;
+ RelFileForks *locators;
bool cached = true;
bool use_bsearch;
if (nlocators == 0)
return;
- rels = palloc(sizeof(SMgrRelation) * nlocators); /* non-local relations */
+ /* storages for non-local relations */
+ rels = palloc(sizeof(SMgrRelation) * nlocators);
+
+ if (pforks)
+ forks = palloc(sizeof(ForkBitmap) * nlocators);
/* If it's a local relation, it's localbuf.c's problem. */
for (i = 0; i < nlocators; i++)
@@ -4504,7 +4522,12 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
DropRelationAllLocalBuffers(smgr_reln[i]->smgr_rlocator.locator);
}
else
- rels[n++] = smgr_reln[i];
+ {
+ rels[n] = smgr_reln[i];
+ if (forks)
+ forks[n] = pforks[i];
+ n++;
+ }
}
/*
@@ -4514,6 +4537,10 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
if (n == 0)
{
pfree(rels);
+
+ if (forks)
+ pfree(forks);
+
return;
}
@@ -4532,6 +4559,13 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
{
for (int j = 0; j <= MAX_FORKNUM; j++)
{
+ /* Consider only the specified fork, if provided. */
+ if (forks && !FORKBITMAP_ISSET(forks[i], j))
+ {
+ block[i][j] = InvalidBlockNumber;
+ continue;
+ }
+
/* Get the number of blocks for a relation's fork. */
block[i][j] = smgrnblocks_cached(rels[i], j);
@@ -4559,7 +4593,7 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
{
for (int j = 0; j <= MAX_FORKNUM; j++)
{
- /* ignore relation forks that doesn't exist */
+ /* ignore relation forks that doesn't exist or is ignored */
if (!BlockNumberIsValid(block[i][j]))
continue;
@@ -4575,9 +4609,13 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
}
pfree(block);
- locators = palloc(sizeof(RelFileLocator) * n); /* non-local relations */
+ locators = palloc(sizeof(RelFileForks) * n); /* non-local relations */
+
for (i = 0; i < n; i++)
- locators[i] = rels[i]->smgr_rlocator.locator;
+ {
+ locators[i].rloc = rels[i]->smgr_rlocator.locator;
+ locators[i].forks = (forks ? forks[i] : FORKBITMAP_ALLFORKS());
+ }
/*
* For low number of relations to drop just use a simple walk through, to
@@ -4587,13 +4625,34 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
*/
use_bsearch = n > RELS_BSEARCH_THRESHOLD;
- /* sort the list of rlocators if necessary */
- if (use_bsearch)
- qsort(locators, n, sizeof(RelFileLocator), rlocator_comparator);
+ /*
+ * Sort and compress the list of RelFileForks if necessary. We believe the
+ * caller passed unique rlocators if forks are not specified.
+ */
+ if (use_bsearch || forks)
+ {
+ int j = 0;
+
+ qsort(locators, n, sizeof(RelFileForks), rlocator_comparator);
+
+ /*
+ * Now the list is in rlocator increasing order, compress the list by
+ * merging fork bitmaps so that all elements have unique rlocators.
+ */
+ for (i = 1 ; i < n ; i++)
+ {
+ if (RelFileLocatorEquals(locators[j].rloc, locators[i].rloc))
+ locators[j].forks |= locators[i].forks;
+ else
+ locators[++j] = locators[i];
+ }
+
+ n = j + 1;
+ }
for (i = 0; i < NBuffers; i++)
{
- RelFileLocator *rlocator = NULL;
+ RelFileForks *rlocator = NULL;
BufferDesc *bufHdr = GetBufferDescriptor(i);
uint32 buf_state;
@@ -4608,7 +4667,8 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
for (j = 0; j < n; j++)
{
- if (BufTagMatchesRelFileLocator(&bufHdr->tag, &locators[j]))
+ if (BufTagMatchesRelFileLocator(&bufHdr->tag,
+ &locators[j].rloc))
{
rlocator = &locators[j];
break;
@@ -4621,16 +4681,18 @@ DropRelationsAllBuffers(SMgrRelation *smgr_reln, int nlocators)
locator = BufTagGetRelFileLocator(&bufHdr->tag);
rlocator = bsearch(&locator,
- locators, n, sizeof(RelFileLocator),
+ locators, n, sizeof(RelFileForks),
rlocator_comparator);
}
/* buffer doesn't belong to any of the given relfilelocators; skip it */
- if (rlocator == NULL)
+ if (rlocator == NULL ||
+ !FORKBITMAP_ISSET(rlocator->forks, BufTagGetForkNum(&bufHdr->tag)))
continue;
buf_state = LockBufHdr(bufHdr);
- if (BufTagMatchesRelFileLocator(&bufHdr->tag, rlocator))
+ if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator->rloc) &&
+ FORKBITMAP_ISSET(rlocator->forks, BufTagGetForkNum(&bufHdr->tag)))
InvalidateBuffer(bufHdr); /* releases spinlock */
else
UnlockBufHdr(bufHdr, buf_state);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 11fccda475f..92d77dffc92 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1478,7 +1478,7 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
srels[i] = srel;
}
- smgrdounlinkall(srels, ndelrels, isRedo);
+ smgrdounlinkall(srels, NULL, ndelrels, isRedo);
for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 8a7654118fe..d507101bb6c 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -454,15 +454,19 @@ smgrdosyncall(SMgrRelation *rels, int nrels)
/*
* smgrdounlinkall() -- Immediately unlink all forks of all given relations
*
- * All forks of all given relations are removed from the store. This
- * should not be used during transactional operations, since it can't be
- * undone.
+ * Forks of all given relations are removed from the store. This should not be
+ * used during transactional operations, since it can't be undone.
+ *
+ * If forks is NULL, all forks are removed for all relations. Otherwise, only
+ * the specified fork is removed for the relation at the corresponding position
+ * in the rels array. InvalidForkNumber means removing all forks for the
+ * corresponding relation.
*
* If isRedo is true, it is okay for the underlying file(s) to be gone
* already.
*/
void
-smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
+smgrdounlinkall(SMgrRelation *rels, ForkBitmap *forks, int nrels, bool isRedo)
{
int i = 0;
RelFileLocatorBackend *rlocators;
@@ -475,7 +479,7 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
* Get rid of any remaining buffers for the relations. bufmgr will just
* drop them without bothering to write the contents.
*/
- DropRelationsAllBuffers(rels, nrels);
+ DropRelationsAllBuffers(rels, forks, nrels);
/*
* create an array which contains all relations to be dropped, and close
@@ -489,9 +493,13 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
rlocators[i] = rlocator;
- /* Close the forks at smgr level */
+ /* Close the spacified forks at smgr level. */
for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
- smgrsw[which].smgr_close(rels[i], forknum);
+ {
+ if (!forks || FORKBITMAP_ISSET(forks[i], forknum))
+ smgrsw[which].smgr_close(rels[i], forknum);
+ continue;
+ }
}
/*
@@ -518,7 +526,11 @@ smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo)
int which = rels[i]->smgr_which;
for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
- smgrsw[which].smgr_unlink(rlocators[i], forknum, isRedo);
+ {
+ if (!forks || FORKBITMAP_ISSET(forks[i], forknum))
+ smgrsw[which].smgr_unlink(rlocators[i], forknum, isRedo);
+ continue;
+ }
}
pfree(rlocators);
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 1ce7eb9da8f..46a5ddfb3ae 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3794,7 +3794,7 @@ RelationSetNewRelfilenumber(Relation relation, char persistence)
* anyway.
*/
srel = smgropen(relation->rd_locator, relation->rd_backend);
- smgrdounlinkall(&srel, 1, false);
+ smgrdounlinkall(&srel, NULL, 1, false);
smgrclose(srel);
}
else
diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h
index a267f67b446..1d91c41bc7b 100644
--- a/src/include/common/relpath.h
+++ b/src/include/common/relpath.h
@@ -70,6 +70,17 @@ typedef enum ForkNumber
#define MAX_FORKNUM INIT_FORKNUM
+/* ForkBitmap holds multiple forks as a bitmap */
+StaticAssertDecl(MAX_FORKNUM < 8, "MAX_FORKNUM too large for ForkBitmap");
+
+typedef uint8 ForkBitmap;
+#define FORKBITMAP_BIT(f) (1 << (f))
+#define FORKBITMAP_INIT(m, f) ((m) = FORKBITMAP_BIT((f)))
+#define FORKBITMAP_SET(m, f) ((m) |= FORKBITMAP_BIT((f)))
+#define FORKBITMAP_RESET(m, f) ((m) &= ~(FORKBITMAP_BIT(f)))
+#define FORKBITMAP_ISSET(m, f) ((m) & FORKBITMAP_BIT(f))
+#define FORKBITMAP_ALLFORKS() ((1 << (MAX_FORKNUM + 1)) - 1)
+
#define FORKNAMECHARS 4 /* max chars for a fork name */
extern PGDLLIMPORT const char *const forkNames[];
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 4267098080f..5b614fb618e 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -275,7 +275,7 @@ extern void DropRelationBuffers(struct SMgrRelationData *smgr_reln,
ForkNumber *forkNum,
int nforks, BlockNumber *firstDelBlock);
extern void DropRelationsAllBuffers(struct SMgrRelationData **smgr_reln,
- int nlocators);
+ ForkBitmap *forks, int nlocators);
extern void DropDatabaseBuffers(Oid dbid);
extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel,
bool permanent);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a2c15d6af90..1a210f6af08 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,7 +85,8 @@ extern void smgrreleaseall(void);
extern void smgrreleaserellocator(RelFileLocatorBackend rlocator);
extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrdosyncall(SMgrRelation *rels, int nrels);
-extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
+extern void smgrdounlinkall(SMgrRelation *rels, ForkBitmap *forks, int nrels,
+ bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, const void *buffer, bool skipFsync);
extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 795e5d4d018..d21588f2d5d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -818,6 +818,7 @@ ForeignServer
ForeignServerInfo
ForeignTable
ForeignTruncateInfo
+ForkBitmap
ForkNumber
FormData_pg_aggregate
FormData_pg_am
--
2.43.5
v36-0009-Enable-commit-records-to-handle-fork-removals.patchtext/x-patch; charset=us-asciiDownload
From c580dceaa20e7288940df7d7cb442bffe988908d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 2 Aug 2024 20:51:31 +0900
Subject: [PATCH v36 09/17] Enable commit records to handle fork removals
Currently, COMMIT/ABORT WAL records store relation locators that need
to be removed at commit. This patch adds support for handling these
removals on a per-fork basis. While the PREPARE record can store the
same information, it is not used.
---
src/backend/access/rmgrdesc/xactdesc.c | 48 ++++++++++++++++++++++----
src/backend/access/transam/twophase.c | 42 ++++++++++++++++++----
src/backend/access/transam/xact.c | 30 ++++++++++++----
src/backend/storage/buffer/bufmgr.c | 2 +-
src/backend/storage/smgr/md.c | 11 ++++--
src/include/access/xact.h | 8 +++++
src/include/storage/md.h | 3 +-
7 files changed, 120 insertions(+), 24 deletions(-)
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 889cb955c18..a086809dc75 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -82,6 +82,12 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
data += MinSizeOfXactRelfileLocators;
data += xl_rellocators->nrels * sizeof(RelFileLocator);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ {
+ parsed->xforks = (ForkBitmap *)data;
+ data += xl_rellocators->nrels * sizeof(ForkBitmap);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -188,6 +194,12 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
data += MinSizeOfXactRelfileLocators;
data += xl_rellocator->nrels * sizeof(RelFileLocator);
+
+ if (parsed->xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ {
+ parsed->xforks = (ForkBitmap *)data;
+ data += xl_rellocator->nrels * sizeof(ForkBitmap);
+ }
}
if (parsed->xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -262,9 +274,19 @@ ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *p
parsed->xlocators = (RelFileLocator *) bufptr;
bufptr += MAXALIGN(xlrec->ncommitrels * sizeof(RelFileLocator));
+ if (xlrec->comhasforks)
+ {
+ parsed->xforks = (ForkBitmap *) bufptr;
+ bufptr += MAXALIGN(xlrec->ncommitrels * sizeof(ForkBitmap));
+ }
parsed->abortlocators = (RelFileLocator *) bufptr;
bufptr += MAXALIGN(xlrec->nabortrels * sizeof(RelFileLocator));
+ if (xlrec->abohasforks)
+ {
+ parsed->abortforks = (ForkBitmap *) bufptr;
+ bufptr += MAXALIGN(xlrec->nabortrels * sizeof(ForkBitmap));
+ }
parsed->stats = (xl_xact_stats_item *) bufptr;
bufptr += MAXALIGN(xlrec->ncommitstats * sizeof(xl_xact_stats_item));
@@ -278,7 +300,7 @@ ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *p
static void
xact_desc_relations(StringInfo buf, char *label, int nrels,
- RelFileLocator *xlocators)
+ RelFileLocator *xlocators, ForkBitmap *xforks)
{
int i;
@@ -291,6 +313,19 @@ xact_desc_relations(StringInfo buf, char *label, int nrels,
appendStringInfo(buf, " %s", path);
pfree(path);
+
+ if (xforks)
+ {
+ char delim = ':';
+ for (int j = 0 ; j <= MAX_FORKNUM ; j++)
+ {
+ if (FORKBITMAP_ISSET(xforks[i], j))
+ {
+ appendStringInfo(buf, "%c%d", delim, j);
+ delim = ',';
+ }
+ }
+ }
}
}
}
@@ -343,7 +378,8 @@ xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec, RepOriginId
appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time));
- xact_desc_relations(buf, "rels", parsed.nrels, parsed.xlocators);
+ xact_desc_relations(buf, "rels",
+ parsed.nrels, parsed.xlocators, parsed.xforks);
xact_desc_subxacts(buf, parsed.nsubxacts, parsed.subxacts);
xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
@@ -379,7 +415,8 @@ xact_desc_abort(StringInfo buf, uint8 info, xl_xact_abort *xlrec, RepOriginId or
appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time));
- xact_desc_relations(buf, "rels", parsed.nrels, parsed.xlocators);
+ xact_desc_relations(buf, "rels",
+ parsed.nrels, parsed.xlocators, parsed.xforks);
xact_desc_subxacts(buf, parsed.nsubxacts, parsed.subxacts);
if (parsed.xinfo & XACT_XINFO_HAS_ORIGIN)
@@ -403,9 +440,8 @@ xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginI
appendStringInfo(buf, "gid %s: ", parsed.twophase_gid);
appendStringInfoString(buf, timestamptz_to_str(parsed.xact_time));
- xact_desc_relations(buf, "rels(commit)", parsed.nrels, parsed.xlocators);
- xact_desc_relations(buf, "rels(abort)", parsed.nabortrels,
- parsed.abortlocators);
+ xact_desc_relations(buf, "rels(commit)", parsed.nrels,
+ parsed.xlocators, parsed.xforks);
xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);
xact_desc_subxacts(buf, parsed.nsubxacts, parsed.subxacts);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 7e18be8025e..5c20065e408 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -203,6 +203,7 @@ static void RecordTransactionCommitPrepared(TransactionId xid,
TransactionId *children,
int nrels,
RelFileLocator *rels,
+ ForkBitmap *forks,
int nstats,
xl_xact_stats_item *stats,
int ninvalmsgs,
@@ -214,6 +215,7 @@ static void RecordTransactionAbortPrepared(TransactionId xid,
TransactionId *children,
int nrels,
RelFileLocator *rels,
+ ForkBitmap *forks,
int nstats,
xl_xact_stats_item *stats,
const char *gid);
@@ -1089,7 +1091,9 @@ StartPrepare(GlobalTransaction gxact)
TwoPhaseFileHeader hdr;
TransactionId *children;
RelFileLocator *commitrels;
+ ForkBitmap *commitforks = NULL;
RelFileLocator *abortrels;
+ ForkBitmap *abortforks = NULL;
xl_xact_stats_item *abortstats = NULL;
xl_xact_stats_item *commitstats = NULL;
SharedInvalidationMessage *invalmsgs;
@@ -1116,7 +1120,9 @@ StartPrepare(GlobalTransaction gxact)
hdr.owner = gxact->owner;
hdr.nsubxacts = xactGetCommittedChildren(&children);
hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
+ hdr.comhasforks = false;
hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
+ hdr.abohasforks = false;
hdr.ncommitstats =
pgstat_get_transactional_drops(true, &commitstats);
hdr.nabortstats =
@@ -1145,11 +1151,23 @@ StartPrepare(GlobalTransaction gxact)
{
save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileLocator));
pfree(commitrels);
+
+ if (hdr.comhasforks)
+ {
+ save_state_data(commitforks, hdr.ncommitrels * sizeof(ForkBitmap));
+ pfree(commitforks);
+ }
}
if (hdr.nabortrels > 0)
{
save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileLocator));
pfree(abortrels);
+
+ if (hdr.abohasforks)
+ {
+ save_state_data(abortforks, hdr.nabortrels * sizeof(ForkBitmap));
+ pfree(abortforks);
+ }
}
if (hdr.ncommitstats > 0)
{
@@ -1532,8 +1550,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
TransactionId latestXid;
TransactionId *children;
RelFileLocator *commitrels;
+ ForkBitmap *commitforks = NULL;
RelFileLocator *abortrels;
+ ForkBitmap *abortforks = NULL;
RelFileLocator *delrels;
+ ForkBitmap *delforks = NULL;
int ndelrels;
xl_xact_stats_item *commitstats;
xl_xact_stats_item *abortstats;
@@ -1595,7 +1616,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
if (isCommit)
RecordTransactionCommitPrepared(xid,
hdr->nsubxacts, children,
- hdr->ncommitrels, commitrels,
+ hdr->ncommitrels,
+ commitrels, commitforks,
hdr->ncommitstats,
commitstats,
hdr->ninvalmsgs, invalmsgs,
@@ -1603,7 +1625,8 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
else
RecordTransactionAbortPrepared(xid,
hdr->nsubxacts, children,
- hdr->nabortrels, abortrels,
+ hdr->nabortrels,
+ abortrels, abortforks,
hdr->nabortstats,
abortstats,
gid);
@@ -1624,6 +1647,9 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
*/
gxact->valid = false;
+ /* Currently, prepare info should not have per-fork storage information. */
+ Assert(!commitforks);
+
/*
* We have to remove any files that were supposed to be dropped. For
* consistency with the regular xact.c code paths, must do this before
@@ -1635,15 +1661,17 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
{
delrels = commitrels;
ndelrels = hdr->ncommitrels;
+ delforks = commitforks;
}
else
{
delrels = abortrels;
ndelrels = hdr->nabortrels;
+ delforks = abortforks;
}
/* Make sure files supposed to be dropped are dropped */
- DropRelationFiles(delrels, ndelrels, false);
+ DropRelationFiles(delrels, delforks, ndelrels, false);
if (isCommit)
pgstat_execute_transactional_drops(hdr->ncommitstats, commitstats, false);
@@ -2337,7 +2365,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
int nchildren,
TransactionId *children,
int nrels,
- RelFileLocator *rels,
+ RelFileLocator *rels, ForkBitmap *forks,
int nstats,
xl_xact_stats_item *stats,
int ninvalmsgs,
@@ -2368,7 +2396,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
* not they do.
*/
recptr = XactLogCommitRecord(committs,
- nchildren, children, nrels, rels,
+ nchildren, children, nrels, rels, forks,
nstats, stats,
ninvalmsgs, invalmsgs,
initfileinval,
@@ -2436,6 +2464,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
TransactionId *children,
int nrels,
RelFileLocator *rels,
+ ForkBitmap *forks,
int nstats,
xl_xact_stats_item *stats,
const char *gid)
@@ -2466,8 +2495,7 @@ RecordTransactionAbortPrepared(TransactionId xid,
* not they do.
*/
recptr = XactLogAbortRecord(GetCurrentTimestamp(),
- nchildren, children,
- nrels, rels,
+ nchildren, children, nrels, rels, forks,
nstats, stats,
MyXactFlags | XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK,
xid, gid);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index fa9b1185f88..caf82312708 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1319,6 +1319,7 @@ RecordTransactionCommit(void)
TransactionId latestXid = InvalidTransactionId;
int nrels;
RelFileLocator *rels;
+ ForkBitmap *forks = NULL;
int nchildren;
TransactionId *children;
int ndroppedstats = 0;
@@ -1440,7 +1441,7 @@ RecordTransactionCommit(void)
* Insert the commit XLOG record.
*/
XactLogCommitRecord(GetCurrentTransactionStopTimestamp(),
- nchildren, children, nrels, rels,
+ nchildren, children, nrels, rels, forks,
ndroppedstats, droppedstats,
nmsgs, invalMessages,
RelcacheInitFileInval,
@@ -1757,6 +1758,7 @@ RecordTransactionAbort(bool isSubXact)
TransactionId latestXid;
int nrels;
RelFileLocator *rels;
+ ForkBitmap *forks = NULL;
int ndroppedstats = 0;
xl_xact_stats_item *droppedstats = NULL;
int nchildren;
@@ -1818,7 +1820,7 @@ RecordTransactionAbort(bool isSubXact)
XactLogAbortRecord(xact_time,
nchildren, children,
- nrels, rels,
+ nrels, rels, forks,
ndroppedstats, droppedstats,
MyXactFlags, InvalidTransactionId,
NULL);
@@ -5819,7 +5821,7 @@ xactGetCommittedChildren(TransactionId **ptr)
XLogRecPtr
XactLogCommitRecord(TimestampTz commit_time,
int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
+ int nrels, RelFileLocator *rels, ForkBitmap *forks,
int ndroppedstats, xl_xact_stats_item *droppedstats,
int nmsgs, SharedInvalidationMessage *msgs,
bool relcacheInval,
@@ -5887,6 +5889,9 @@ XactLogCommitRecord(TimestampTz commit_time,
xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
xl_relfilelocators.nrels = nrels;
info |= XLR_SPECIAL_REL_UPDATE;
+
+ if (forks)
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILEFORKS;
}
if (ndroppedstats > 0)
@@ -5949,6 +5954,10 @@ XactLogCommitRecord(TimestampTz commit_time,
MinSizeOfXactRelfileLocators);
XLogRegisterData((char *) rels,
nrels * sizeof(RelFileLocator));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ XLogRegisterData((char *) forks,
+ nrels * sizeof(ForkBitmap));
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -5991,7 +6000,7 @@ XactLogCommitRecord(TimestampTz commit_time,
XLogRecPtr
XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
- int nrels, RelFileLocator *rels,
+ int nrels, RelFileLocator *rels, ForkBitmap *forks,
int ndroppedstats, xl_xact_stats_item *droppedstats,
int xactflags, TransactionId twophase_xid,
const char *twophase_gid)
@@ -6036,6 +6045,9 @@ XactLogAbortRecord(TimestampTz abort_time,
xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
xl_relfilelocators.nrels = nrels;
info |= XLR_SPECIAL_REL_UPDATE;
+
+ if (forks)
+ xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILEFORKS;
}
if (ndroppedstats > 0)
@@ -6102,6 +6114,10 @@ XactLogAbortRecord(TimestampTz abort_time,
MinSizeOfXactRelfileLocators);
XLogRegisterData((char *) rels,
nrels * sizeof(RelFileLocator));
+
+ if (xl_xinfo.xinfo & XACT_XINFO_HAS_RELFILEFORKS)
+ XLogRegisterData((char *) forks,
+ nrels * sizeof(ForkBitmap));
}
if (xl_xinfo.xinfo & XACT_XINFO_HAS_DROPPED_STATS)
@@ -6242,7 +6258,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
XLogFlush(lsn);
/* Make sure files supposed to be dropped are dropped */
- DropRelationFiles(parsed->xlocators, parsed->nrels, true);
+ DropRelationFiles(parsed->xlocators, parsed->xforks, parsed->nrels,
+ true);
}
AtEOXact_UndoLog(xid);
@@ -6356,7 +6373,8 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
*/
XLogFlush(lsn);
- DropRelationFiles(parsed->xlocators, parsed->nrels, true);
+ DropRelationFiles(parsed->xlocators, parsed->xforks, parsed->nrels,
+ true);
}
AtEOXact_UndoLog(xid);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index df114e8e0c0..10d740cc688 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -160,7 +160,7 @@ static BufMgrCleanup * cleanups = NULL; /* head of linked list */
typedef struct RelFileForks
{
RelFileLocator rloc; /* key member for qsort */
- ForkBitmap forks; /* fork number in bitmap */
+ ForkBitmap forks; /* fork numbers in bitmap */
} RelFileForks;
/* GUC variables */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 92d77dffc92..55cc7aad73a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1458,7 +1458,8 @@ ForgetDatabaseSyncRequests(Oid dbid)
* DropRelationFiles -- drop files of all given relations
*/
void
-DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
+DropRelationFiles(RelFileLocator *delrels, ForkBitmap *delforks, int ndelrels,
+ bool isRedo)
{
SMgrRelation *srels;
int i;
@@ -1472,13 +1473,17 @@ DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo)
{
ForkNumber fork;
+ /* Close the spacified forks at smgr level. */
for (fork = 0; fork <= MAX_FORKNUM; fork++)
- XLogDropRelation(delrels[i], fork);
+ {
+ if (!delforks || FORKBITMAP_ISSET(delforks[i], fork))
+ XLogDropRelation(delrels[i], fork);
+ }
}
srels[i] = srel;
}
- smgrdounlinkall(srels, NULL, ndelrels, isRedo);
+ smgrdounlinkall(srels, delforks, ndelrels, isRedo);
for (i = 0; i < ndelrels; i++)
smgrclose(srels[i]);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index fb64d7413a2..a5b61eec8f3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -194,6 +194,7 @@ typedef struct SavedTransactionCharacteristics
#define XACT_XINFO_HAS_AE_LOCKS (1U << 6)
#define XACT_XINFO_HAS_GID (1U << 7)
#define XACT_XINFO_HAS_DROPPED_STATS (1U << 8)
+#define XACT_XINFO_HAS_RELFILEFORKS (1U << 9)
/*
* Also stored in xinfo, these indicating a variety of additional actions that
@@ -359,7 +360,9 @@ typedef struct xl_xact_prepare
Oid owner; /* user running the transaction */
int32 nsubxacts; /* number of following subxact XIDs */
int32 ncommitrels; /* number of delete-on-commit rels */
+ bool comhasforks; /* commitrels is accompanied by forknums */
int32 nabortrels; /* number of delete-on-abort rels */
+ bool abohasforks; /* commitrels is accompanied by forknums */
int32 ncommitstats; /* number of stats to drop on commit */
int32 nabortstats; /* number of stats to drop on abort */
int32 ninvalmsgs; /* number of cache invalidation messages */
@@ -387,6 +390,7 @@ typedef struct xl_xact_parsed_commit
int nrels;
RelFileLocator *xlocators;
+ ForkBitmap *xforks;
int nstats;
xl_xact_stats_item *stats;
@@ -398,6 +402,7 @@ typedef struct xl_xact_parsed_commit
char twophase_gid[GIDSIZE]; /* only for 2PC */
int nabortrels; /* only for 2PC */
RelFileLocator *abortlocators; /* only for 2PC */
+ ForkBitmap *abortforks;
int nabortstats; /* only for 2PC */
xl_xact_stats_item *abortstats; /* only for 2PC */
@@ -420,6 +425,7 @@ typedef struct xl_xact_parsed_abort
int nrels;
RelFileLocator *xlocators;
+ ForkBitmap *xforks;
int nstats;
xl_xact_stats_item *stats;
@@ -503,6 +509,7 @@ extern int xactGetCommittedChildren(TransactionId **ptr);
extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileLocator *rels,
+ ForkBitmap *forks,
int ndroppedstats,
xl_xact_stats_item *droppedstats,
int nmsgs, SharedInvalidationMessage *msgs,
@@ -514,6 +521,7 @@ extern XLogRecPtr XactLogCommitRecord(TimestampTz commit_time,
extern XLogRecPtr XactLogAbortRecord(TimestampTz abort_time,
int nsubxacts, TransactionId *subxacts,
int nrels, RelFileLocator *rels,
+ ForkBitmap *forks,
int ndroppedstats,
xl_xact_stats_item *droppedstats,
int xactflags, TransactionId twophase_xid,
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index e7671dd6c18..46f2e44cf99 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -48,7 +48,8 @@ extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
extern void mdregistersync(SMgrRelation reln, ForkNumber forknum);
extern void ForgetDatabaseSyncRequests(Oid dbid);
-extern void DropRelationFiles(RelFileLocator *delrels, int ndelrels, bool isRedo);
+extern void DropRelationFiles(RelFileLocator *delrels, ForkBitmap *delforks,
+ int ndelrels, bool isRedo);
/* md sync callbacks */
extern int mdsyncfiletag(const FileTag *ftag, char *path);
--
2.43.5
v36-0010-Add-per-fork-deletion-support-to-pendingDeletes.patchtext/x-patch; charset=us-asciiDownload
From 5247fd0fdd38975b9cb60d64e77dd10ffbfef414 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 2 Aug 2024 21:39:11 +0900
Subject: [PATCH v36 10/17] Add per-fork deletion support to pendingDeletes
This patch introduces the ability to handle commit-time pending
deletes on a per-fork basis.
---
src/backend/access/transam/twophase.c | 23 ++++--
src/backend/access/transam/xact.c | 4 +-
src/backend/catalog/storage.c | 103 +++++++++++++++++++++++---
src/include/catalog/storage.h | 3 +-
4 files changed, 112 insertions(+), 21 deletions(-)
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5c20065e408..f8518c08768 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1091,9 +1091,10 @@ StartPrepare(GlobalTransaction gxact)
TwoPhaseFileHeader hdr;
TransactionId *children;
RelFileLocator *commitrels;
- ForkBitmap *commitforks = NULL;
+ ForkBitmap *commitforks;
RelFileLocator *abortrels;
- ForkBitmap *abortforks = NULL;
+ ForkBitmap *abortforks;
+
xl_xact_stats_item *abortstats = NULL;
xl_xact_stats_item *commitstats = NULL;
SharedInvalidationMessage *invalmsgs;
@@ -1119,10 +1120,10 @@ StartPrepare(GlobalTransaction gxact)
hdr.prepared_at = gxact->prepared_at;
hdr.owner = gxact->owner;
hdr.nsubxacts = xactGetCommittedChildren(&children);
- hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
- hdr.comhasforks = false;
- hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
- hdr.abohasforks = false;
+ hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels, &commitforks);
+ hdr.comhasforks = (commitforks != NULL);
+ hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels, &abortforks);
+ hdr.abohasforks = (abortforks != NULL);
hdr.ncommitstats =
pgstat_get_transactional_drops(true, &commitstats);
hdr.nabortstats =
@@ -1591,7 +1592,17 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
commitrels = (RelFileLocator *) bufptr;
bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileLocator));
abortrels = (RelFileLocator *) bufptr;
+ if (hdr->comhasforks)
+ {
+ commitforks = (ForkBitmap *) bufptr;
+ bufptr += MAXALIGN(hdr->ncommitrels * sizeof(ForkBitmap));
+ }
bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileLocator));
+ if (hdr->abohasforks)
+ {
+ abortforks = (ForkBitmap *) bufptr;
+ bufptr += MAXALIGN(hdr->nabortrels * sizeof(ForkBitmap));
+ }
commitstats = (xl_xact_stats_item *) bufptr;
bufptr += MAXALIGN(hdr->ncommitstats * sizeof(xl_xact_stats_item));
abortstats = (xl_xact_stats_item *) bufptr;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index caf82312708..ed73cda9acf 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1340,7 +1340,7 @@ RecordTransactionCommit(void)
LogLogicalInvalidations();
/* Get data needed for commit record */
- nrels = smgrGetPendingDeletes(true, &rels);
+ nrels = smgrGetPendingDeletes(true, &rels, &forks);
nchildren = xactGetCommittedChildren(&children);
ndroppedstats = pgstat_get_transactional_drops(true, &droppedstats);
if (XLogStandbyInfoActive())
@@ -1803,7 +1803,7 @@ RecordTransactionAbort(bool isSubXact)
replorigin_session_origin != DoNotReplicateId);
/* Fetch the data we need for the abort record */
- nrels = smgrGetPendingDeletes(false, &rels);
+ nrels = smgrGetPendingDeletes(false, &rels, &forks);
nchildren = xactGetCommittedChildren(&children);
ndroppedstats = pgstat_get_transactional_drops(false, &droppedstats);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 5b20c583d16..b4495cb1ab1 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -64,6 +64,7 @@ int wal_skip_threshold = 2048; /* in kilobytes */
typedef struct PendingRelDelete
{
RelFileLocator rlocator; /* relation that may need to be deleted */
+ ForkBitmap forks; /* fork bitmap */
ProcNumber procNumber; /* INVALID_PROC_NUMBER if not a temp rel */
bool atCommit; /* T=delete at commit; F=delete at abort */
int nestLevel; /* xact nesting level of request */
@@ -187,18 +188,48 @@ RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
/* Schedule the removal of this init fork at abort if requested. */
if (undo_log)
{
- PendingRelDelete *pending;
+ bool found = false;
ulog_smgrcreate(srel, forkNum);
- pending = (PendingRelDelete *)
- MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
- pending->rlocator = srel->smgr_rlocator.locator;
- pending->procNumber = INVALID_PROC_NUMBER;
- pending->atCommit = false; /* delete if abort */
- pending->nestLevel = GetCurrentTransactionNestLevel();
- pending->next = pendingDeletes;
- pendingDeletes = pending;
+ /* Update exiting entry if any */
+ for (PendingRelDelete *p = pendingDeletes ; p != NULL ; p = p->next)
+ {
+ if (!p->atCommit &&
+ RelFileLocatorEquals(srel->smgr_rlocator.locator,
+ p->rlocator))
+ {
+ Assert(p->procNumber == INVALID_PROC_NUMBER);
+ /* we mustn't have an entry when creating a main fork */
+ Assert(forkNum != MAIN_FORKNUM);
+ found = true;
+ FORKBITMAP_SET(p->forks, forkNum);
+ break;
+ }
+ }
+
+ /* Otherwise, add a new entry. */
+ if (!found)
+ {
+ PendingRelDelete *pending;
+
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->rlocator = srel->smgr_rlocator.locator;
+ /*
+ * Creating a main fork means that pending deletes must remove all
+ * forks at abort.
+ */
+ if (forkNum == MAIN_FORKNUM)
+ pending->forks = FORKBITMAP_ALLFORKS();
+ else
+ pending->forks = FORKBITMAP_BIT(forkNum);
+ pending->procNumber = INVALID_PROC_NUMBER;
+ pending->atCommit = false; /* delete if abort */
+ pending->nestLevel = GetCurrentTransactionNestLevel();
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+ }
}
/* WAL-log this creation if requested. */
@@ -292,6 +323,7 @@ RelationDropStorage(Relation rel)
pending = (PendingRelDelete *)
MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
pending->rlocator = rel->rd_locator;
+ pending->forks = FORKBITMAP_ALLFORKS();
pending->procNumber = rel->rd_backend;
pending->atCommit = true; /* delete if commit */
pending->nestLevel = GetCurrentTransactionNestLevel();
@@ -343,6 +375,8 @@ RelationPreserveStorage(RelFileLocator rlocator, bool atCommit)
if (RelFileLocatorEquals(rlocator, pending->rlocator)
&& pending->atCommit == atCommit)
{
+ Assert(pending->forks == FORKBITMAP_ALLFORKS());
+
found = true;
/* unlink and delete list entry */
@@ -696,7 +730,7 @@ SerializePendingSyncs(Size maxSize, char *startAddress)
/* remove deleted rnodes */
for (delete = pendingDeletes; delete != NULL; delete = delete->next)
- if (delete->atCommit)
+ if (delete->atCommit && delete->forks == FORKBITMAP_ALLFORKS())
(void) hash_search(tmphash, &delete->rlocator,
HASH_REMOVE, NULL);
@@ -750,6 +784,7 @@ smgrDoPendingDeletes(bool isCommit)
int nrels = 0,
maxrels = 0;
SMgrRelation *srels = NULL;
+ ForkBitmap *forks = NULL;
prev = NULL;
for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -772,6 +807,8 @@ smgrDoPendingDeletes(bool isCommit)
{
SMgrRelation srel;
+ Assert(pending->forks == FORKBITMAP_ALLFORKS());
+
srel = smgropen(pending->rlocator, pending->procNumber);
/* allocate the initial array, or extend it, if needed */
@@ -784,8 +821,26 @@ smgrDoPendingDeletes(bool isCommit)
{
maxrels *= 2;
srels = repalloc(srels, sizeof(SMgrRelation) * maxrels);
+
+ /* expand forks array if any */
+ if (forks)
+ forks = repalloc(forks, sizeof(ForkBitmap) * maxrels);
}
+ /* Create forks array on encountering partial forks. */
+ Assert((pending->forks & ~FORKBITMAP_ALLFORKS()) == 0);
+ if (!forks && pending->forks != FORKBITMAP_ALLFORKS())
+ {
+ forks = palloc(sizeof(ForkBitmap) * maxrels);
+
+ /* fill in the past elements */
+ for (int i = 0 ; i < nrels ; i++)
+ forks[i] = FORKBITMAP_ALLFORKS();
+ }
+
+ if (forks)
+ forks[nrels] = pending->forks;
+
srels[nrels++] = srel;
}
/* must explicitly free the list entry */
@@ -796,12 +851,15 @@ smgrDoPendingDeletes(bool isCommit)
if (nrels > 0)
{
- smgrdounlinkall(srels, NULL, nrels, false);
+ smgrdounlinkall(srels, forks, nrels, false);
for (int i = 0; i < nrels; i++)
smgrclose(srels[i]);
pfree(srels);
+
+ if (forks)
+ pfree(forks);
}
}
@@ -961,27 +1019,42 @@ smgrDoPendingSyncs(bool isCommit, bool isParallelWorker)
* by upper-level transactions.
*/
int
-smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr)
+smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr, ForkBitmap **fptr)
{
int nestLevel = GetCurrentTransactionNestLevel();
int nrels;
+ bool hasforks = false;
RelFileLocator *rptr;
+ ForkBitmap *rfptr = NULL;
PendingRelDelete *pending;
nrels = 0;
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
+ Assert((pending->forks & ~FORKBITMAP_ALLFORKS()) == 0);
+
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
&& pending->procNumber == INVALID_PROC_NUMBER)
+ {
nrels++;
+
+ if (pending->forks != FORKBITMAP_ALLFORKS())
+ hasforks = true;
+ }
}
if (nrels == 0)
{
*ptr = NULL;
+ *fptr = NULL;
return 0;
}
rptr = (RelFileLocator *) palloc(nrels * sizeof(RelFileLocator));
*ptr = rptr;
+
+ if (hasforks)
+ rfptr = (ForkBitmap *) palloc(nrels * sizeof(ForkBitmap));
+ *fptr = rfptr;
+
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit
@@ -989,6 +1062,12 @@ smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr)
{
*rptr = pending->rlocator;
rptr++;
+
+ if (rfptr)
+ {
+ *rfptr = pending->forks;
+ rfptr++;
+ }
}
}
return nrels;
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 3451d6ac80c..cd5486896a6 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -44,7 +44,8 @@ extern void RestorePendingSyncs(char *startAddress);
*/
extern void smgrDoPendingDeletes(bool isCommit);
extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker);
-extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr);
+extern int smgrGetPendingDeletes(bool forCommit, RelFileLocator **ptr,
+ ForkBitmap **fptr);
extern void AtSubCommit_smgr(void);
extern void AtSubAbort_smgr(void);
extern void PostPrepare_smgr(void);
--
2.43.5
v36-0011-Allow-init-fork-to-be-dropped.patchtext/x-patch; charset=us-asciiDownload
From 13cd85d073db707c8125302e3c3bbf9ca1f7a33e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 29 Nov 2024 16:21:43 +0900
Subject: [PATCH v36 11/17] Allow init fork to be dropped
Building on features introduced in previous commits, this commit adds
the ability to drop init fork transactionally. Dropping an init fork
is deferred until transaction commit, using the pendingDeletes
mechanism. No user side code is provided.
---
src/backend/catalog/storage.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index b4495cb1ab1..67f2b3727a9 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -373,10 +373,9 @@ RelationPreserveStorage(RelFileLocator rlocator, bool atCommit)
{
next = pending->next;
if (RelFileLocatorEquals(rlocator, pending->rlocator)
- && pending->atCommit == atCommit)
+ && pending->atCommit == atCommit
+ && FORKBITMAP_ISSET(pending->forks, MAIN_FORKNUM))
{
- Assert(pending->forks == FORKBITMAP_ALLFORKS());
-
found = true;
/* unlink and delete list entry */
@@ -807,8 +806,6 @@ smgrDoPendingDeletes(bool isCommit)
{
SMgrRelation srel;
- Assert(pending->forks == FORKBITMAP_ALLFORKS());
-
srel = smgropen(pending->rlocator, pending->procNumber);
/* allocate the initial array, or extend it, if needed */
@@ -1109,8 +1106,18 @@ AtSubCommit_smgr(void)
for (pending = pendingDeletes; pending != NULL; pending = pending->next)
{
- if (pending->nestLevel >= nestLevel)
- pending->nestLevel = nestLevel - 1;
+ if (pending->nestLevel < nestLevel)
+ {
+#ifdef USE_ASSERT_CHECKING
+ /* all the remaining entries must be of upper subtransactions */
+ for (; pending ; pending = pending->next)
+ Assert(pending->nestLevel < nestLevel);
+#endif
+ break;
+ }
+
+ /* move this entry to the immediately upper subtransaction */
+ pending->nestLevel = nestLevel - 1;
}
}
@@ -1324,10 +1331,11 @@ smgr_undoevent(ULogEvent event)
SMgrRelation reln;
ForkNumber forks[3];
BlockNumber firstblocks[3] = {0};
- int nforks = 0;
+ int nforks;
for (int i = 0 ; i < rlocs_len ; i++)
{
+ nforks = 0;
forks[nforks++] = MAIN_FORKNUM;
/*
--
2.43.5
v36-0012-Prepare-for-preventing-DML-operations-on-relatio.patchtext/x-patch; charset=us-asciiDownload
From 104f7c924cf461c7e31fba122cb2219da32edf08 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 15 Aug 2024 11:26:01 +0900
Subject: [PATCH v36 12/17] Prepare for preventing DML operations on relations.
Performing data manipulation on relations with in-place persistence
changes can lead to unrecoverable issues, particularly with
indexes. To prevent potential data corruption, this update sets up
mechanisms to inhibit DML operations in these cases rather than
attempting to accommodate them. No user-side code included.
---
src/backend/access/transam/xact.c | 7 ++++++
src/backend/executor/execMain.c | 5 +++-
src/backend/tcop/utility.c | 18 ++++++++++++++
src/backend/utils/cache/relcache.c | 39 +++++++++++++++++++++++++++---
src/include/access/xact.h | 2 ++
src/include/miscadmin.h | 1 +
src/include/utils/rel.h | 7 ++++++
src/include/utils/relcache.h | 1 +
8 files changed, 76 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index ed73cda9acf..b609a783464 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -86,6 +86,12 @@ bool XactDeferrable;
int synchronous_commit = SYNCHRONOUS_COMMIT_ON;
+/*
+ * Indicate whether relation persistence flipping was performed in the current
+ * transacion.
+ */
+bool XactPersistenceChanged;
+
/*
* CheckXidAlive is a xid value pointing to a possibly ongoing (sub)
* transaction. Currently, it is used in logical decoding. It's possible
@@ -2131,6 +2137,7 @@ StartTransaction(void)
s->startedInRecovery = false;
XactReadOnly = DefaultXactReadOnly;
}
+ XactPersistenceChanged = false;
XactDeferrable = DefaultXactDeferrable;
XactIsoLevel = DefaultXactIsoLevel;
forceSyncCommit = false;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 1c12d6ebff0..c84acc048ce 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -162,7 +162,7 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
* against performing unsafe operations in parallel mode, but this gives a
* more user-friendly error message.
*/
- if ((XactReadOnly || IsInParallelMode()) &&
+ if ((XactReadOnly || XactPersistenceChanged || IsInParallelMode()) &&
!(eflags & EXEC_FLAG_EXPLAIN_ONLY))
ExecCheckXactReadOnly(queryDesc->plannedstmt);
@@ -815,6 +815,9 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
continue;
PreventCommandIfReadOnly(CreateCommandName((Node *) plannedstmt));
+
+ PreventCommandIfPersistenceChanged(
+ CreateCommandName((Node *) plannedstmt), perminfo->relid);
}
if (plannedstmt->commandType != CMD_SELECT || plannedstmt->hasModifyingCTE)
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index c2ed8214ef6..ce2f259fa0c 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -411,6 +411,24 @@ PreventCommandIfReadOnly(const char *cmdname)
cmdname)));
}
+/*
+ * PreventCommandIfPersistenceChanged: throw error if persistence changed was
+ * performed
+ */
+void
+PreventCommandIfPersistenceChanged(const char *cmdname, Oid relid)
+{
+ Relation rel;
+
+ rel = RelationIdGetRelation(relid);
+ if (rel->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot execute %s on relation \"%s\" because of its persistence change in the current transaction",
+ cmdname, get_rel_name(relid)));
+ RelationClose(rel);
+}
+
/*
* PreventCommandIfParallelMode: throw error if current (sub)transaction is
* in parallel mode.
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 46a5ddfb3ae..f3a0d0b13d2 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1125,6 +1125,7 @@ retry:
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
switch (relation->rd_rel->relpersistence)
{
@@ -1888,6 +1889,7 @@ formrdesc(const char *relationName, Oid relationReltype,
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
relation->rd_backend = INVALID_PROC_NUMBER;
relation->rd_islocaltemp = false;
@@ -2691,6 +2693,7 @@ RelationRebuildRelation(Relation relation)
SWAPFIELD(SubTransactionId, rd_createSubid);
SWAPFIELD(SubTransactionId, rd_newRelfilelocatorSubid);
SWAPFIELD(SubTransactionId, rd_firstRelfilelocatorSubid);
+ SWAPFIELD(SubTransactionId, rd_firstPersistenceChangeSubid);
SWAPFIELD(SubTransactionId, rd_droppedSubid);
/* un-swap rd_rel pointers, swap contents instead */
SWAPFIELD(Form_pg_class, rd_rel);
@@ -2780,7 +2783,8 @@ static void
RelationFlushRelation(Relation relation)
{
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId ||
+ relation->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
{
/*
* New relcache entries are always rebuilt, not flushed; else we'd
@@ -2857,7 +2861,8 @@ RelationForgetRelation(Oid rid)
Assert(relation->rd_droppedSubid == InvalidSubTransactionId);
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId ||
+ relation->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
{
/*
* In the event of subtransaction rollback, we must not forget
@@ -2973,7 +2978,8 @@ RelationCacheInvalidate(bool debug_discard)
* applicable pending invalidations.
*/
if (relation->rd_createSubid != InvalidSubTransactionId ||
- relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId)
+ relation->rd_firstRelfilelocatorSubid != InvalidSubTransactionId ||
+ relation->rd_firstPersistenceChangeSubid != InvalidSubTransactionId)
continue;
relcacheInvalsReceived++;
@@ -3295,6 +3301,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
if (clear_relcache)
@@ -3410,6 +3417,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
relation->rd_createSubid = InvalidSubTransactionId;
relation->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
relation->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
relation->rd_droppedSubid = InvalidSubTransactionId;
RelationClearRelation(relation);
return;
@@ -3456,6 +3464,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
else
relation->rd_droppedSubid = InvalidSubTransactionId;
}
+
+ if (relation->rd_firstPersistenceChangeSubid == mySubid)
+ {
+ if (isCommit)
+ relation->rd_firstPersistenceChangeSubid = parentSubid;
+ else
+ relation->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
+ }
}
@@ -3546,6 +3562,7 @@ RelationBuildLocalRelation(const char *relname,
rel->rd_createSubid = GetCurrentSubTransactionId();
rel->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
rel->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ rel->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
rel->rd_droppedSubid = InvalidSubTransactionId;
/*
@@ -3927,6 +3944,21 @@ RelationAssumeNewRelfilelocator(Relation relation)
EOXactListAdd(relation);
}
+/*
+ * RelationAssumePersistenceChange
+ *
+ * Code that changes relation persistence must call this. This call triggers
+ * abort-time cleanups and prevents further data manipulation on the relation.
+ */
+void
+RelationAssumePersistenceChange(Relation relation)
+{
+ XactPersistenceChanged = true;
+ relation->rd_firstPersistenceChangeSubid = GetCurrentSubTransactionId();
+
+ /* Flag relation as needing eoxact cleanup (to clear this field) */
+ EOXactListAdd(relation);
+}
/*
* RelationCacheInitialize
@@ -6413,6 +6445,7 @@ load_relcache_init_file(bool shared)
rel->rd_createSubid = InvalidSubTransactionId;
rel->rd_newRelfilelocatorSubid = InvalidSubTransactionId;
rel->rd_firstRelfilelocatorSubid = InvalidSubTransactionId;
+ rel->rd_firstPersistenceChangeSubid = InvalidSubTransactionId;
rel->rd_droppedSubid = InvalidSubTransactionId;
rel->rd_amcache = NULL;
rel->pgstat_info = NULL;
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index a5b61eec8f3..a3c470d7e7a 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -55,6 +55,8 @@ extern PGDLLIMPORT int XactIsoLevel;
extern PGDLLIMPORT bool DefaultXactReadOnly;
extern PGDLLIMPORT bool XactReadOnly;
+extern PGDLLIMPORT bool XactPersistenceChanged;
+
/* flag for logging statements in this transaction */
extern PGDLLIMPORT bool xact_is_sampled;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 3f97fcef800..69cc5adfa6c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -307,6 +307,7 @@ extern long get_stack_depth_rlimit(void);
extern void PreventCommandIfReadOnly(const char *cmdname);
extern void PreventCommandIfParallelMode(const char *cmdname);
extern void PreventCommandDuringRecovery(const char *cmdname);
+extern void PreventCommandIfPersistenceChanged(const char *cmdname, Oid relid);
/*****************************************************************************
* pdir.h -- *
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 87002049538..a361e910509 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -108,6 +108,13 @@ typedef struct RelationData
* any value */
SubTransactionId rd_droppedSubid; /* dropped with another Subid set */
+ /*
+ * rd_firstPersistenceChangeSubid is the ID of the highest subtransaction
+ * ID the rel's persistence change has survived into.
+ */
+ SubTransactionId rd_firstPersistenceChangeSubid; /* highest subxact chaging
+ * persistence */
+
Form_pg_class rd_rel; /* RELATION tuple */
TupleDesc rd_att; /* tuple descriptor */
Oid rd_id; /* relation's object id */
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 8d23959e95e..d1903e7a58c 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -119,6 +119,7 @@ extern Relation RelationBuildLocalRelation(const char *relname,
*/
extern void RelationSetNewRelfilenumber(Relation relation, char persistence);
extern void RelationAssumeNewRelfilelocator(Relation relation);
+extern void RelationAssumePersistenceChange(Relation relation);
/*
* Routines for flushing/rebuilding relcache entries in various scenarios
--
2.43.5
v36-0013-Add-a-new-version-of-copy_file-to-allow-overwrit.patchtext/x-patch; charset=us-asciiDownload
From 31dd380d54e1e05808cc65aeea8707f93e19f01b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Wed, 31 Jul 2024 18:04:48 +0900
Subject: [PATCH v36 13/17] Add a new version of copy_file to allow overwrites
In subsequent patches, it will be necessary to overwrite the existing
main fork with the init fork. To facilitate this, add a version of the
copy_file function that supports overwriting.
---
src/backend/storage/file/copydir.c | 16 +++++++++++++++-
src/include/storage/copydir.h | 2 ++
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index d4fbe542077..30d0ae54ec4 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -115,6 +115,12 @@ copydir(const char *fromdir, const char *todir, bool recurse)
*/
void
copy_file(const char *fromfile, const char *tofile)
+{
+ copy_file_extended(fromfile, tofile, false);
+}
+
+void
+copy_file_extended(const char *fromfile, const char *tofile, bool overwrite)
{
char *buffer;
int srcfd;
@@ -122,6 +128,7 @@ copy_file(const char *fromfile, const char *tofile)
int nbytes;
off_t offset;
off_t flush_offset;
+ int dstflags;
/* Size of copy buffer (read and write requests) */
#define COPY_BUF_SIZE (8 * BLCKSZ)
@@ -150,7 +157,11 @@ copy_file(const char *fromfile, const char *tofile)
(errcode_for_file_access(),
errmsg("could not open file \"%s\": %m", fromfile)));
- dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ dstflags = O_RDWR | O_CREAT | PG_BINARY;
+ if (!overwrite)
+ dstflags |= O_EXCL;
+
+ dstfd = OpenTransientFile(tofile, dstflags);
if (dstfd < 0)
ereport(ERROR,
(errcode_for_file_access(),
@@ -159,6 +170,9 @@ copy_file(const char *fromfile, const char *tofile)
/*
* Do the data copying.
*/
+ if (overwrite)
+ pg_truncate(tofile, 0);
+
flush_offset = 0;
for (offset = 0;; offset += nbytes)
{
diff --git a/src/include/storage/copydir.h b/src/include/storage/copydir.h
index a25e258f479..1a430675428 100644
--- a/src/include/storage/copydir.h
+++ b/src/include/storage/copydir.h
@@ -15,5 +15,7 @@
extern void copydir(const char *fromdir, const char *todir, bool recurse);
extern void copy_file(const char *fromfile, const char *tofile);
+extern void copy_file_extended(const char *fromfile, const char *tofile,
+ bool overwrite);
#endif /* COPYDIR_H */
--
2.43.5
v36-0014-In-place-persistance-change-to-UNLOGGED.patchtext/x-patch; charset=us-asciiDownload
From a93130374f99cc5a0a28cad4b54e7a62c24a6b85 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 27 Aug 2024 11:19:53 +0900
Subject: [PATCH v36 14/17] In-place persistance change to UNLOGGED
This commit enables changing the persistence of relations to UNLOGGED
without creating a new storage file. ALTER TABLE LOGGED will continue
to create a new storage as before.
---
src/backend/access/transam/undolog.c | 32 +++-
src/backend/access/transam/xact.c | 8 +-
src/backend/catalog/storage.c | 112 ++++++++++++-
src/backend/commands/tablecmds.c | 226 ++++++++++++++++++++++-----
src/include/access/undolog.h | 2 +-
5 files changed, 331 insertions(+), 49 deletions(-)
diff --git a/src/backend/access/transam/undolog.c b/src/backend/access/transam/undolog.c
index b2fdbfcd0f9..33d1b35ae8d 100644
--- a/src/backend/access/transam/undolog.c
+++ b/src/backend/access/transam/undolog.c
@@ -757,9 +757,10 @@ UndoLog_UndoByXid(bool isCommit, TransactionId xid,
* During recovery, it should pass the target transaction ID.
*/
void
-AtEOXact_UndoLog(TransactionId xid)
+AtEOXact_UndoLog(bool isCommit, TransactionId xid)
{
FullTransactionId fxid = ULogLocal.current_xid;
+ bool redo = false;
if (TransactionIdIsValid(xid))
{
@@ -767,7 +768,8 @@ AtEOXact_UndoLog(TransactionId xid)
TransactionId oldest_xid;
TransactionId next_xid;
uint32 oldest_epoch;
-
+
+ redo = true;
LWLockAcquire(XactTruncationLock, LW_SHARED);
next_fxid = TransamVariables->nextXid;
oldest_xid = TransamVariables->oldestClogXid;
@@ -785,7 +787,31 @@ AtEOXact_UndoLog(TransactionId xid)
}
if (FullTransactionIdIsValid(fxid))
- undolog_drop_ulog(fxid);
+ {
+ UndoLogSlot *slot;
+
+ slot = undolog_find_slot(fxid, false);
+ if (slot)
+ {
+ undolog_flush_slot(slot, false);
+ LWLockRelease(&slot->lock);
+ }
+
+ if (slot || undolog_file_exists(fxid))
+ {
+ char fname[MAXPGPATH];
+ ULogContext cxt;
+
+ if (isCommit)
+ cxt = ULOGCXT_COMMIT;
+ else
+ cxt = ULOGCXT_ABORT;
+
+ UndoLogSetFilename(fname, fxid);
+ undolog_process_ulog(fname, cxt, redo);
+ undolog_drop_ulog(fxid);
+ }
+ }
}
/*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b609a783464..197fde27edc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2480,7 +2480,7 @@ CommitTransaction(void)
AtEOXact_on_commit_actions(true);
AtEOXact_Namespace(true, is_parallel_worker);
AtEOXact_SMgr();
- AtEOXact_UndoLog(InvalidTransactionId);
+ AtEOXact_UndoLog(true, InvalidTransactionId);
AtEOXact_Files(true);
AtEOXact_ComboCid();
AtEOXact_HashTables(true);
@@ -2999,7 +2999,7 @@ AbortTransaction(void)
AtEOXact_on_commit_actions(false);
AtEOXact_Namespace(false, is_parallel_worker);
AtEOXact_SMgr();
- AtEOXact_UndoLog(InvalidTransactionId);
+ AtEOXact_UndoLog(false, InvalidTransactionId);
AtEOXact_Files(false);
AtEOXact_ComboCid();
AtEOXact_HashTables(false);
@@ -6269,7 +6269,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
true);
}
- AtEOXact_UndoLog(xid);
+ AtEOXact_UndoLog(true, xid);
AtEOXact_Buffers_Redo(true, xid, parsed->nsubxacts, parsed->subxacts);
if (parsed->nstats > 0)
@@ -6384,7 +6384,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
true);
}
- AtEOXact_UndoLog(xid);
+ AtEOXact_UndoLog(false, xid);
AtEOXact_Buffers_Redo(false, xid, parsed->nsubxacts, parsed->subxacts);
if (parsed->nstats > 0)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 67f2b3727a9..ad8855b69b4 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -32,6 +32,8 @@
#include "miscadmin.h"
#include "storage/bulk_write.h"
#include "storage/freespace.h"
+#include "storage/copydir.h"
+#include "storage/fd.h"
#include "storage/proc.h"
#include "storage/smgr.h"
#include "utils/hsearch.h"
@@ -558,6 +560,58 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber);
}
+/*
+ * Reset an unlogged relation using the INIT fork, intended for use during the
+ * commit of prepared transactions. The relation is assumed to be UNLOGGED, so
+ * no WAL-logging is required.
+ */
+static void
+ResetUnloggedRelation(RelFileLocator rloc, ProcNumber backend)
+{
+ char *srcpath;
+ char *dstpath;
+ SMgrRelation srel = smgropen(rloc, backend);
+ ForkNumber forks[MAX_FORKNUM];
+ BlockNumber blocks[MAX_FORKNUM];
+ BlockNumber old_blocks[MAX_FORKNUM];
+ int nforks = 0;
+
+ srel = smgropen(rloc, backend);
+
+ Assert(smgrexists(srel, INIT_FORKNUM));
+
+ for (int i = 0 ; i <= MAX_FORKNUM ; i++)
+ {
+ if (i == INIT_FORKNUM || !smgrexists(srel, i))
+ continue;
+
+ forks[nforks] = i;
+ old_blocks[nforks] = smgrnblocks(srel, i);
+ blocks[nforks] = 0;
+ nforks++;
+ }
+
+ /*
+ * This relation is unlogged. Therefore, unlike RelationTruncate(), there
+ * is no need to call RelationPreTruncate().
+ */
+ START_CRIT_SECTION();
+ smgrtruncate(srel, forks, nforks, old_blocks, blocks);
+ END_CRIT_SECTION();
+
+ /* Note that this leaves the first segment of the main fork. */
+ for (int i = 0 ; i < nforks ; i++)
+ smgrunlink(srel, forks[i], false);
+
+ /* copy init fork to main fork */
+ srcpath = GetRelationPath(rloc.dbOid, rloc.spcOid, rloc.relNumber,
+ backend, INIT_FORKNUM);
+ dstpath = GetRelationPath(rloc.dbOid, rloc.spcOid, rloc.relNumber,
+ backend, MAIN_FORKNUM);
+ copy_file_extended(srcpath, dstpath, true);
+ fsync_fname(dstpath, false);
+}
+
/*
* RelationPreTruncate
* Perform AM-independent work before a physical truncation.
@@ -1314,8 +1368,62 @@ smgr_undo(UndoLogRecord *record, ULogContext cxt, bool redo, bool crashed)
else
elog(PANIC, "smgr_undo: unknown op code %d", info);
}
- else if (cxt == ULOGCXT_COMMIT || cxt == ULOGCXT_ABORT ||
- cxt == ULOGCXT_PREPARED)
+ else if (cxt == ULOGCXT_COMMIT)
+ {
+ Assert(record);
+ info = record->ul_info & ~ULR_INFO_MASK;
+
+ if (info == ULOG_SMGR_CREATE)
+ {
+ ul_smgr_create *ulrec = (ul_smgr_create *) ULogRecGetData(record);
+ /*
+ * If an init fork was created during recovery, the entire relation
+ * is set to be reset at recovery-end or the consistency point.
+ * Therefore, we need to drop the relation's buffers to prevent the
+ * end-of-recovery checkpoint from flushing storage files for these
+ * relations once they have been reset.
+ */
+ if (redo && ulrec->forknum == INIT_FORKNUM)
+ {
+ SMgrRelation reln;
+ int nforks;
+ ForkNumber forks[MAX_FORKNUM + 1];
+ BlockNumber firstblocks[MAX_FORKNUM + 1] = {0};
+
+ Assert(ulrec->backend == INVALID_PROC_NUMBER);
+
+ reln = smgropen(ulrec->rlocator, ulrec->backend);
+
+ nforks = 0;
+ for (int i = 0 ; i <= MAX_FORKNUM ; i++)
+ {
+ if (smgrexists(reln, i))
+ forks[nforks++] = i;
+ }
+
+ if (nforks > 0)
+ DropRelationBuffers(reln, forks, nforks, firstblocks);
+
+ smgrclose(reln);
+ }
+ else if (!redo && crashed && ulrec->forknum == INIT_FORKNUM)
+ {
+ /*
+ * System has been crashed until the transaction was
+ * prepared. Now that the init fork is persists, the relation
+ * needs to be cleared.
+ */
+ ResetUnloggedRelation(ulrec->rlocator, ulrec->backend);
+ ereport(WARNING,
+ errmsg("unlogged relation %u/%u/%u was reset",
+ ulrec->rlocator.spcOid, ulrec->rlocator.dbOid,
+ ulrec->rlocator.relNumber),
+ errdetail("Server experinced a crash after the transaction that altered the relation was prepared."));
+ }
+
+ }
+ }
+ else if(cxt == ULOGCXT_PREPARED || cxt == ULOGCXT_ABORT)
{
/* nothing to do here */
}
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b5766989d8e..3c58d5be464 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5744,6 +5744,143 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
return newcmd;
}
+/*
+ * RelationChangePersistence: perform in-place persistence change of a relation
+ */
+static void
+RelationChangePersistence(AlteredTableInfo *tab, char persistence,
+ LOCKMODE lockmode)
+{
+ Relation rel;
+ Relation classRel;
+ HeapTuple tuple,
+ newtuple;
+ Datum new_val[Natts_pg_class];
+ bool new_null[Natts_pg_class],
+ new_repl[Natts_pg_class];
+ List *relids;
+ ListCell *lc_oid;
+
+ Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE);
+ Assert(lockmode == AccessExclusiveLock);
+
+ /*
+ * Use ATRewriteTable instead of this function if the following condition
+ * is not satisfied.
+ */
+ Assert(tab->constraints == NULL && tab->partition_constraint == NULL &&
+ tab->newvals == NULL && !tab->verify_new_notnull);
+
+ rel = table_open(tab->relid, lockmode);
+
+ Assert(rel->rd_rel->relpersistence != persistence);
+
+ elog(DEBUG1, "perform in-place persistence change");
+
+ /*
+ * Initially, gather all relations that require a persistence change.
+ */
+
+ /* Collect OIDs of indexes and toast relations */
+ relids = RelationGetIndexList(rel);
+ relids = lcons_oid(rel->rd_id, relids);
+
+ /* Add toast relation if any */
+ if (OidIsValid(rel->rd_rel->reltoastrelid))
+ {
+ List *toastidx;
+ Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode);
+
+ relids = lappend_oid(relids, rel->rd_rel->reltoastrelid);
+ toastidx = RelationGetIndexList(toastrel);
+ relids = list_concat(relids, toastidx);
+ pfree(toastidx);
+ table_close(toastrel, NoLock);
+ }
+
+ table_close(rel, NoLock);
+
+ /* Make changes in storage */
+ classRel = table_open(RelationRelationId, RowExclusiveLock);
+
+ foreach(lc_oid, relids)
+ {
+ Oid reloid = lfirst_oid(lc_oid);
+ Relation r = relation_open(reloid, lockmode);
+ SMgrRelation srel;
+ bool persistent = (persistence == RELPERSISTENCE_PERMANENT);
+ bool is_index;
+
+ /*
+ * Reconstruct the storage when permanent and unlogged storage types
+ * are incompatible.
+ */
+ if (r->rd_rel->relkind == RELKIND_INDEX &&
+ !r->rd_indam->amunloggedstoragecompatible)
+ {
+ int reindex_flags;
+ ReindexParams params = {0};
+
+ /* reindex doesn't allow concurrent use of the index */
+ table_close(r, NoLock);
+
+ reindex_flags =
+ REINDEX_REL_SUPPRESS_INDEX_USE |
+ REINDEX_REL_CHECK_CONSTRAINTS;
+
+ /* Set the same persistence with the parent relation. */
+ if (persistent)
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT;
+ else
+ reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED;
+
+ /* this doesn't fire REINDEX event triegger */
+ reindex_index(NULL, reloid, reindex_flags, persistence, ¶ms);
+
+ continue;
+ }
+
+ /* Currently, only allowing changes to UNLOGGED. */
+ Assert(!persistent);
+
+ RelationAssumePersistenceChange(r);
+
+ /* switch buffer persistence */
+ srel = RelationGetSmgr(r);
+ log_smgrbufpersistence(srel->smgr_rlocator.locator, persistent);
+ SetRelationBuffersPersistence(srel, persistent);
+
+ /* then create the init fork */
+ is_index = (r->rd_rel->relkind == RELKIND_INDEX);
+ RelationCreateFork(srel, INIT_FORKNUM, !is_index, true);
+ if (is_index)
+ r->rd_indam->ambuildempty(r);
+
+ /* Update catalog */
+ tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+ if (!HeapTupleIsValid(tuple))
+ elog(ERROR, "cache lookup failed for relation %u", reloid);
+
+ memset(new_val, 0, sizeof(new_val));
+ memset(new_null, false, sizeof(new_null));
+ memset(new_repl, false, sizeof(new_repl));
+
+ new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence);
+ new_null[Anum_pg_class_relpersistence - 1] = false;
+ new_repl[Anum_pg_class_relpersistence - 1] = true;
+
+ newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel),
+ new_val, new_null, new_repl);
+
+ CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple);
+ heap_freetuple(newtuple);
+
+ table_close(r, NoLock);
+ }
+
+ table_close(classRel, NoLock);
+}
+
/*
* ATRewriteTables: ALTER TABLE phase 3
*/
@@ -5876,48 +6013,59 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- /*
- * Create transient table that will receive the modified data.
- *
- * Ensure it is marked correctly as logged or unlogged. We have
- * to do this here so that buffers for the new relfilenumber will
- * have the right persistence set, and at the same time ensure
- * that the original filenumbers's buffers will get read in with
- * the correct setting (i.e. the original one). Otherwise a
- * rollback after the rewrite would possibly result with buffers
- * for the original filenumbers having the wrong persistence
- * setting.
- *
- * NB: This relies on swap_relation_files() also swapping the
- * persistence. That wouldn't work for pg_class, but that can't be
- * unlogged anyway.
- */
- OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
- persistence, lockmode);
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE &&
+ persistence == RELPERSISTENCE_UNLOGGED)
+ {
+ /* Make in-place persistence change. */
+ RelationChangePersistence(tab, persistence, lockmode);
+ }
+ else
+ {
+ /*
+ * Create transient table that will receive the modified data.
+ *
+ * Ensure it is marked correctly as logged or unlogged. We
+ * have to do this here so that buffers for the new
+ * relfilenumber will have the right persistence set, and at
+ * the same time ensure that the original filenumbers's buffers
+ * will get read in with the correct setting (i.e. the original
+ * one). Otherwise a rollback after the rewrite would possibly
+ * result with buffers for the original filenumbers having the
+ * wrong persistence setting.
+ *
+ * NB: This relies on swap_relation_files() also swapping the
+ * persistence. That wouldn't work for pg_class, but that can't
+ * be unlogged anyway.
+ */
+ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace,
+ NewAccessMethod,
+ persistence, lockmode);
- /*
- * Copy the heap data into the new table with the desired
- * modifications, and test the current data within the table
- * against new constraints generated by ALTER TABLE commands.
- */
- ATRewriteTable(tab, OIDNewHeap, lockmode);
+ /*
+ * Copy the heap data into the new table with the desired
+ * modifications, and test the current data within the table
+ * against new constraints generated by ALTER TABLE commands.
+ */
+ ATRewriteTable(tab, OIDNewHeap, lockmode);
- /*
- * Swap the physical files of the old and new heaps, then rebuild
- * indexes and discard the old heap. We can use RecentXmin for
- * the table's new relfrozenxid because we rewrote all the tuples
- * in ATRewriteTable, so no older Xid remains in the table. Also,
- * we never try to swap toast tables by content, since we have no
- * interest in letting this code work on system catalogs.
- */
- finish_heap_swap(tab->relid, OIDNewHeap,
- false, false, true,
- !OidIsValid(tab->newTableSpace),
- RecentXmin,
- ReadNextMultiXactId(),
- persistence);
+ /*
+ * Swap the physical files of the old and new heaps, then
+ * rebuild indexes and discard the old heap. We can use
+ * RecentXmin for the table's new relfrozenxid because we
+ * rewrote all the tuples in ATRewriteTable, so no older Xid
+ * remains in the table. Also, we never try to swap toast
+ * tables by content, since we have no interest in letting this
+ * code work on system catalogs.
+ */
+ finish_heap_swap(tab->relid, OIDNewHeap,
+ false, false, true,
+ !OidIsValid(tab->newTableSpace),
+ RecentXmin,
+ ReadNextMultiXactId(),
+ persistence);
- InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
+ }
}
else if (tab->rewrite > 0 && tab->relkind == RELKIND_SEQUENCE)
{
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
index 19badc852a0..857a845eb9d 100644
--- a/src/include/access/undolog.h
+++ b/src/include/access/undolog.h
@@ -80,7 +80,7 @@ extern Size UndoLogShmemSize(void);
extern void UndoLogShmemInit(void);
extern void InitUndoLog(void);
extern void UndoLogWrite(RmgrId rmgr, uint8 info, void *data, int len);
-extern void AtEOXact_UndoLog(TransactionId xid);
+extern void AtEOXact_UndoLog(bool isCommit, TransactionId xid);
extern void AtPrepare_UndoLog(void);
extern void UndoLog_UndoByXid(bool isCommit, TransactionId xid,
int nchildren, TransactionId *children);
--
2.43.5
v36-0015-Add-test-for-ALTER-TABLE-UNLOGGED.patchtext/x-patch; charset=us-asciiDownload
From 254cc1dfa1321f9ea92c2530f28722f5df67c677 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 15 Aug 2024 16:06:34 +0900
Subject: [PATCH v36 15/17] Add test for ALTER TABLE UNLOGGED
---
src/test/recovery/t/044_persistence_change.pl | 511 ++++++++++++++++++
1 file changed, 511 insertions(+)
create mode 100644 src/test/recovery/t/044_persistence_change.pl
diff --git a/src/test/recovery/t/044_persistence_change.pl b/src/test/recovery/t/044_persistence_change.pl
new file mode 100644
index 00000000000..ad1b444cb46
--- /dev/null
+++ b/src/test/recovery/t/044_persistence_change.pl
@@ -0,0 +1,511 @@
+# Copyright (c) 2023-2024, PostgreSQL Global Development Group
+#
+# Test in-place relation persistence changes
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+my @relnames = ('t', 'i_bt', 'i_gin', 'i_gist', 'i_hash', 'i_brin', 'i_spgist');
+my @noninplace_names = ('i_gist');
+
+# This feature works differently by wal_level.
+run_test('minimal');
+run_test('replica');
+done_testing();
+
+sub run_test
+{
+ my ($wal_level) = @_;
+
+ note "## run with wal_level = $wal_level";
+
+ # Initialize primary node.
+ my $node = PostgreSQL::Test::Cluster->new("node_$wal_level");
+ $node->init;
+ # Inhibit checkpoints to run
+ $node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+checkpoint_timeout = '24h'
+max_prepared_transactions = 2
+ ));
+ $node->start;
+
+ my $datadir = $node->data_dir;
+ my $datoid = $node->safe_psql('postgres',
+ q/SELECT oid FROM pg_database WHERE datname = current_database()/);
+ my $dbdir = $node->data_dir . "/base/$datoid";
+
+ # Create a table and indexes of built-in kinds
+ $node->psql('postgres', qq(
+ CREATE TABLE t (bt int, gin int[], gist point, hash int,
+ brin int, spgist point);
+ CREATE INDEX i_bt ON t USING btree (bt);
+ CREATE INDEX i_gin ON t USING gin (gin);
+ CREATE INDEX i_gist ON t USING gist (gist);
+ CREATE INDEX i_hash ON t USING hash (hash);
+ CREATE INDEX i_brin ON t USING brin (brin);
+ CREATE INDEX i_spgist ON t USING spgist (spgist);));
+
+ my $relfilenodes1 = getrelfilenodes($node, \@relnames);
+
+ # the number must correspond to the in list above
+ is (scalar %{$relfilenodes1}, 7, "number of relations is correct");
+
+ # check initial state
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages are in logged state");
+
+ # Normal crash-recovery of LOGGED tables
+ $node->stop('immediate');
+ $node->start;
+
+ # Insert data 0 to 1999
+ $node->psql('postgres', insert_data_query(0, 2000));
+
+ # Check if the data survives a crash
+ $node->stop('immediate');
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "data loss check: crash with LOGGED table");
+
+ # Change the table to UNLOGGED then commit.
+ $node->psql('postgres', 'ALTER TABLE t SET UNLOGGED');
+
+ # Check if SET UNLOGGED above didn't change relfilenumbers.
+ my $relfilenodes2 = getrelfilenodes($node, \@relnames);
+ ok (checkrelfilenodes($relfilenodes1, $relfilenodes2),
+ "relfilenumber transition is as expected after SET UNLOGGED");
+
+ # check init-file state
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are in unlogged state");
+
+ # Check if the table is reset through recovery.
+ $node->stop('immediate');
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 0,
+ "table data is reset though recovery");
+
+ # check reset state
+ ok (check_storage_state(\&is_reset_state, $node, \@relnames),
+ "storages are in reset state");
+
+ # Insert data 0 to 1999, then set persistence to LOGGED then crash.
+ $node->psql('postgres', insert_data_query(0, 2000));
+ $node->psql('postgres', qq(ALTER TABLE t SET LOGGED));
+ $node->stop('immediate');
+ $node->start;
+
+ # Check if SET LOGGED didn't change relfilenumbers and data survive a crash
+ my $relfilenodes3 = getrelfilenodes($node, \@relnames);
+ ok (!checkrelfilenodes($relfilenodes2, $relfilenodes3),
+ "crashed SET-LOGGED relations have sane relfilenodes transition");
+
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "crashed SET-LOGGED table does not lose data");
+
+ # Change to UNLOGGED then insert data, then shutdown normally.
+ $node->psql('postgres', 'ALTER TABLE t SET UNLOGGED');
+ $node->psql('postgres', insert_data_query(2000, 2000)); # 2000 - 3999
+ $node->stop;
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 4000,
+ "UNLOGGED table does not lose data after graceful restart");
+
+ # Test for mid-transaction change to LOGGED and crash.
+ # Now, the table has data 0-3999
+ $node->psql('postgres', insert_data_query(4000, 2000)); # 4000 - 5999
+
+ my $sess = $node->interactive_psql('postgres');
+ $sess->set_query_timer_restart();
+ $sess->query('BEGIN; ALTER TABLE t SET LOGGED');
+ $sess->query(insert_data_query(6000, 2000)); # 6000-7999, no commit
+ $node->stop('immediate');
+ $sess->quit;
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 0,
+ "table is reset after in-transaction SET-LOGGED then insert");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are reverted to unlogged state");
+
+ # Test for mid-transaction change to UNLOGGED and crash.
+ # Now, the table has no data
+ $node->psql('postgres', 'ALTER TABLE t SET LOGGED');
+ $node->psql('postgres', insert_data_query(0, 2000)); # 0 - 1999
+ $sess = $node->interactive_psql('postgres');
+ $sess->set_query_timer_restart();
+ $sess->query('BEGIN; ALTER TABLE t SET UNLOGGED');
+ $sess->query(insert_data_query(2000, 2000)); # 2000-3999, no commit
+ $node->stop('immediate');
+ $sess->quit;
+ $node->start;
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "table is reset after in-transaction SET-UNLOGGED then insert");
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages are reverted to logged state");
+
+ ### Subtransactions
+ ok ($node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET UNLOGGED; -- committed
+ SAVEPOINT a;
+ ALTER TABLE t SET LOGGED; -- aborted
+ SAVEPOINT b;
+ ROLLBACK TO a;
+ COMMIT;
+ )) != 3,
+ "command succeeds 1");
+
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "table data is not changed 1");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are changed to unlogged state");
+
+ ok ($node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET LOGGED; -- aborted
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED; -- aborted
+ SAVEPOINT b;
+ RELEASE a;
+ ROLLBACK;
+ )) != 3,
+ "command succeeds 2");
+
+ is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
+ "table data is not changed 2");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages stay in unlogged state");
+
+ ### Prepared transactions
+ my ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ ALTER TABLE t SET LOGGED;
+ BEGIN;
+ ALTER TABLE t SET UNLOGGED;
+ PREPARE TRANSACTION 'a';
+ COMMIT PREPARED 'a';
+ ));
+ ok ($ret == 0, "prepare persistence-flipped xact");
+ ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
+ "storages are in unlogged state");
+
+ ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ ALTER TABLE t SET LOGGED;
+ BEGIN;
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED;
+ PREPARE TRANSACTION 'a';
+ ROLLBACK PREPARED 'a';
+ ));
+ ok ($ret == 0, "prepare persistence-flipped xact 2");
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages stay in logged state");
+
+ ### Error out DML
+ $node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET LOGGED;
+ INSERT INTO t VALUES(1); -- Succeeds
+ COMMIT;
+ ));
+
+ ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ BEGIN;
+ ALTER TABLE t SET UNLOGGED;
+ INSERT INTO t VALUES(2); -- ERROR
+ ));
+ ok ($stderr =~ m/cannot execute INSERT on relation/,
+ "errors out when DML is issued after persistence toggling");
+
+ ok ($node->psql('postgres',
+ qq(
+ BEGIN;
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED;
+ ROLLBACK TO a;
+ INSERT INTO t VALUES(3); -- Succeeds
+ COMMIT;
+ )) != 3,
+ "insert after rolled-back persistence change succeeds");
+
+ ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
+ qq(
+ BEGIN;
+ SAVEPOINT a;
+ ALTER TABLE t SET UNLOGGED;
+ RELEASE a;
+ UPDATE t SET bt = bt + 1; -- ERROR
+ ));
+ ok ($stderr =~ m/cannot execute UPDATE on relation/,
+ "errors out when DML is issued after persistence toggling in subxact");
+
+ $node->stop;
+ $node->teardown_node;
+}
+
+#==== helper routines
+
+# Generates a query to insert data from $st to $st + $num - 1
+sub insert_data_query
+{
+ my ($st, $num) = @_;
+ my $ed = $st + $num - 1;
+ my $query = qq(
+INSERT INTO t
+ (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i)
+ FROM generate_series($st, $ed) i);
+);
+ return $query;
+}
+
+sub check_indexes
+{
+ my ($node, $st, $ed) = @_;
+ my $num_data = $ed - $st;
+
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO true;
+ SET enable_indexscan TO false;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "heap is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE bt = i)),
+ $num_data, "btree is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gin = ARRAY[i, i * 2];)),
+ $num_data, "gin is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)),
+ $num_data, "gist is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE hash = i;)),
+ $num_data, "hash is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE brin = i;)),
+ $num_data, "brin is not broken");
+ is ($node->safe_psql('postgres', qq(
+ SET enable_seqscan TO false;
+ SET enable_indexscan TO true;
+ SELECT COUNT(*) FROM t, generate_series($st, $ed) i
+ WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)),
+ $num_data, "spgist is not broken");
+}
+
+sub getrelfilenodes
+{
+ my ($node, $relnames) = @_;
+
+ my $result = $node->safe_psql('postgres',
+ 'SELECT relname, relfilenode FROM pg_class
+ WHERE relname
+ IN (\'' .
+ join("','", @{$relnames}).
+ '\') ORDER BY oid');
+
+ my %relfilenodes;
+
+ foreach my $l (split(/\n/, $result))
+ {
+ die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/);
+ $relfilenodes{$1} = $2;
+ }
+
+ return \%relfilenodes;
+}
+
+sub checkrelfilenodes
+{
+ my ($rnodes1, $rnodes2) = @_;
+ my $result = 1;
+
+ foreach my $n (keys %{$rnodes1})
+ {
+ if (grep { $n eq $_ } @noninplace_names)
+ {
+ if ($rnodes1->{$n} == $rnodes2->{$n})
+ {
+ $result = 0;
+ note sprintf("$n: relfilenode is not changed: %d",
+ $rnodes1->{$n});
+ }
+ }
+ else
+ {
+ if ($rnodes1->{$n} != $rnodes2->{$n})
+ {
+ $result = 0;
+ note sprintf("$n: relfilenode is changed: %d => %d",
+ $rnodes1->{$n}, $rnodes2->{$n});
+ }
+ }
+ }
+ return $result;
+}
+
+sub getfilenames
+{
+ my ($dirname) = @_;
+
+ my $dir = opendir(my $dh, $dirname) or die "could not open $dirname: $!";
+ my @f = readdir($dh);
+ closedir($dh);
+
+ my @result = grep {$_ !~ /^..?$/} @f;
+
+ return \@result;
+}
+
+sub init_fork_exists
+{
+ my ($relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+
+ return $init_exists;
+}
+
+sub noninit_forks_exist
+{
+ my ($relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $noninit_exists = grep {/^${relfnumber}(_(?!init).*)?$/} @{$datafiles};
+
+ return $noninit_exists;
+}
+
+sub is_logged_state
+{
+ my ($node, $relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+ my $main_exists = grep {/^${relfnumber}$/} @{$datafiles};
+ my $persistence = $node->safe_psql('postgres',
+ qq(
+ SELECT relpersistence FROM pg_class WHERE relname = '$relname'
+ ));
+
+ if ($init_exists || !$main_exists || $persistence ne 'p')
+ {
+ # note the state if this test failed
+ note "## is_logged_state:($relname): \$init_exists=$init_exists, \$main_exists=$main_exists, \$persistence='$persistence'\n";
+ return 0 ;
+ }
+
+ return 1;
+}
+
+sub is_unlogged_state
+{
+ my ($node, $relfilenodes, $datafiles, $relname) = @_;
+
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+ my $main_exists = grep {/^${relfnumber}$/} @{$datafiles};
+ my $persistence = $node->safe_psql('postgres',
+ qq(
+ SELECT relpersistence FROM pg_class WHERE relname = '$relname'
+ ));
+
+ if (!$init_exists || !$main_exists || $persistence ne 'u')
+ {
+ # note the state if this test failed
+ note "is_unlogged_state:($relname): \$init_exists=$init_exists, \$main_exists=$main_exists, \$persistence='$persistence'\n";
+ return 0 ;
+ }
+
+ return 1;
+}
+
+sub is_reset_state
+{
+ my ($node, $relfilenodes, $datafiles, $relname) = @_;
+
+ my $datoid = $node->safe_psql('postgres',
+ q/SELECT oid FROM pg_database WHERE datname = current_database()/);
+ my $dbdir = $node->data_dir . "/base/$datoid";
+ my $relfnumber = ${$relfilenodes}{$relname};
+ my $init_exists = grep {/^${relfnumber}_init$/} @{$datafiles};
+ my $main_exists = grep {/^${relfnumber}$/} @{$datafiles};
+ my $others_not_exist = !grep {/^${relfnumber}_(?!init).*$/} @{$datafiles};
+ my $persistence = $node->safe_psql('postgres',
+ qq(
+ SELECT relpersistence FROM pg_class WHERE relname = '$relname'
+ ));
+
+ if (!$init_exists || !$main_exists || !$others_not_exist ||
+ $persistence ne 'u')
+ {
+ # note the state if this test failed
+ note "## is_reset_state:($relname): \$init_exists=$init_exists, \$main_exists=$main_exists, \$others_not_exist=$others_not_exist, \$persistence='$persistence'\n";
+ return 0 ;
+ }
+
+ my $main_file = "$dbdir/${relfnumber}";
+ my $init_file = "$dbdir/${relfnumber}_init";
+ my $main_file_size = -s $main_file;
+ my $init_file_size = -s $init_file;
+
+ if ($main_file_size != $init_file_size)
+ {
+ note "## is_reset_state:($relname): \$main_file='$main_file', size=$main_file_size, \$init_file='$init_file', size=$init_file_size\n";
+ return 0;
+ }
+
+ return 1;
+}
+
+sub check_storage_state
+{
+ my ($func, $node, $relnames) = @_;
+ my $relfilenodes = getrelfilenodes($node, $relnames);
+ my $datoid = $node->safe_psql('postgres',
+ q/SELECT oid FROM pg_database WHERE datname = current_database()/);
+ my $dbdir = $node->data_dir . "/base/$datoid";
+ my $datafiles = getfilenames($dbdir);
+ my $result = 1;
+
+ foreach my $relname (@{$relnames})
+ {
+ if (!$func->($node, $relfilenodes, $datafiles, $relname))
+ {
+ $result = 0;
+
+ ## do not return immediately, run this test for all
+ ## relations to leave diagnosis information in the log
+ ## file.
+ }
+ }
+
+ return $result;
+}
--
2.43.5
v36-0016-Add-function-RelationDropInitFork.patchtext/x-patch; charset=us-asciiDownload
From 90114d840582ebcbb186effc8cf77d63f2f59384 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 29 Nov 2024 16:22:28 +0900
Subject: [PATCH v36 16/17] Add function RelationDropInitFork
This commit introduces a function to drop an init fork during command
execution. The function is prepared as a prerequisite for the
following commits.
---
src/backend/catalog/storage.c | 24 ++++++++++++++++++++++++
src/include/catalog/storage.h | 1 +
2 files changed, 25 insertions(+)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index ad8855b69b4..82a0e50ec41 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -241,6 +241,30 @@ RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
smgrcreate(srel, forkNum, false);
}
+/*
+ * RelationDropInitFork
+ * Delete physical storage for the init fork of a relation.
+ */
+void
+RelationDropInitFork(SMgrRelation srel)
+{
+ int nestLevel = GetCurrentTransactionNestLevel();
+ RelFileLocator rlocator = srel->smgr_rlocator.locator;
+ ProcNumber procNumber = srel->smgr_rlocator.backend;
+ PendingRelDelete *pending;
+
+ /* Schedule the removal of this init fork at commit. */
+ pending = (PendingRelDelete *)
+ MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete));
+ pending->rlocator = rlocator;
+ pending->forks = FORKBITMAP_BIT(INIT_FORKNUM);
+ pending->procNumber = procNumber;
+ pending->atCommit = true; /* delete if commit */
+ pending->nestLevel = nestLevel;
+ pending->next = pendingDeletes;
+ pendingDeletes = pending;
+}
+
/*
* Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL.
*/
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index cd5486896a6..d69cd46551b 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -27,6 +27,7 @@ extern SMgrRelation RelationCreateStorage(RelFileLocator rlocator,
bool register_delete);
extern void RelationCreateFork(SMgrRelation srel, ForkNumber forkNum,
bool wal_log, bool undo_log);
+extern void RelationDropInitFork(SMgrRelation srel);
extern void RelationDropStorage(Relation rel);
extern void RelationPreserveStorage(RelFileLocator rlocator, bool atCommit);
extern void RelationPreTruncate(Relation rel);
--
2.43.5
v36-0017-In-place-persistence-change-to-LOGGED.patchtext/x-patch; charset=us-asciiDownload
From 0c3ce6aebbb22bb361fe029241475b08e870755b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 27 Aug 2024 10:44:46 +0900
Subject: [PATCH v36 17/17] In-place persistence change to LOGGED
---
src/backend/commands/tablecmds.c | 27 +++++++-----
src/test/recovery/t/044_persistence_change.pl | 43 ++++++++++---------
2 files changed, 40 insertions(+), 30 deletions(-)
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3c58d5be464..0eae3430fe5 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5840,9 +5840,6 @@ RelationChangePersistence(AlteredTableInfo *tab, char persistence,
continue;
}
- /* Currently, only allowing changes to UNLOGGED. */
- Assert(!persistent);
-
RelationAssumePersistenceChange(r);
/* switch buffer persistence */
@@ -5850,11 +5847,22 @@ RelationChangePersistence(AlteredTableInfo *tab, char persistence,
log_smgrbufpersistence(srel->smgr_rlocator.locator, persistent);
SetRelationBuffersPersistence(srel, persistent);
- /* then create the init fork */
- is_index = (r->rd_rel->relkind == RELKIND_INDEX);
- RelationCreateFork(srel, INIT_FORKNUM, !is_index, true);
- if (is_index)
- r->rd_indam->ambuildempty(r);
+ /* then create or drop the init fork */
+ if (persistent)
+ RelationDropInitFork(srel);
+ else
+ {
+ is_index = (r->rd_rel->relkind == RELKIND_INDEX);
+
+ /*
+ * If it is an index, have access methods initialize the file. In
+ * that case, WAL-logging is expected to performed by the
+ * ambuildempty() method.
+ */
+ RelationCreateFork(srel, INIT_FORKNUM, !is_index, true);
+ if (is_index)
+ r->rd_indam->ambuildempty(r);
+ }
/* Update catalog */
tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
@@ -6013,8 +6021,7 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
tab->relid,
tab->rewrite);
- if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE &&
- persistence == RELPERSISTENCE_UNLOGGED)
+ if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE)
{
/* Make in-place persistence change. */
RelationChangePersistence(tab, persistence, lockmode);
diff --git a/src/test/recovery/t/044_persistence_change.pl b/src/test/recovery/t/044_persistence_change.pl
index ad1b444cb46..24da84d562f 100644
--- a/src/test/recovery/t/044_persistence_change.pl
+++ b/src/test/recovery/t/044_persistence_change.pl
@@ -100,8 +100,8 @@ max_prepared_transactions = 2
# Check if SET LOGGED didn't change relfilenumbers and data survive a crash
my $relfilenodes3 = getrelfilenodes($node, \@relnames);
- ok (!checkrelfilenodes($relfilenodes2, $relfilenodes3),
- "crashed SET-LOGGED relations have sane relfilenodes transition");
+ ok (checkrelfilenodes($relfilenodes2, $relfilenodes3),
+ "crashed SET-LOGGED relations have sane relfilenodes transition");
is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
"crashed SET-LOGGED table does not lose data");
@@ -147,34 +147,35 @@ max_prepared_transactions = 2
"storages are reverted to logged state");
### Subtransactions
- ok ($node->psql('postgres',
+ my ($ret, $stdout, $stderr) =
+ $node->psql('postgres',
qq(
BEGIN;
ALTER TABLE t SET UNLOGGED; -- committed
SAVEPOINT a;
- ALTER TABLE t SET LOGGED; -- aborted
+ ALTER TABLE t SET LOGGED; -- ERROR
SAVEPOINT b;
ROLLBACK TO a;
COMMIT;
- )) != 3,
- "command succeeds 1");
-
- is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
- "table data is not changed 1");
- ok (check_storage_state(\&is_unlogged_state, $node, \@relnames),
- "storages are changed to unlogged state");
+ ));
+ ok ($stderr =~ m/persistence of this relation has been already changed/,
+ "errors out when double flip occured in a single transaction");
+ ok (check_storage_state(\&is_logged_state, $node, \@relnames),
+ "storages stay in logged state");
ok ($node->psql('postgres',
qq(
+ ALTER TABLE t SET UNLOGGED;
BEGIN;
+ SAVEPOINT a;
ALTER TABLE t SET LOGGED; -- aborted
+ ROLLBACK TO a;
SAVEPOINT a;
- ALTER TABLE t SET UNLOGGED; -- aborted
- SAVEPOINT b;
+ ALTER TABLE t SET LOGGED; -- no error
RELEASE a;
ROLLBACK;
)) != 3,
- "command succeeds 2");
+ "rolled-back persistence flip doesn't prevent subsequent flips");
is ($node->safe_psql('postgres', "SELECT count(*) FROM t;"), 2000,
"table data is not changed 2");
@@ -182,7 +183,7 @@ max_prepared_transactions = 2
"storages stay in unlogged state");
### Prepared transactions
- my ($ret, $stdout, $stderr) =
+ ($ret, $stdout, $stderr) =
$node->psql('postgres',
qq(
ALTER TABLE t SET LOGGED;
@@ -207,16 +208,17 @@ max_prepared_transactions = 2
));
ok ($ret == 0, "prepare persistence-flipped xact 2");
ok (check_storage_state(\&is_logged_state, $node, \@relnames),
- "storages stay in logged state");
+ "storages stay in logged state 2");
### Error out DML
- $node->psql('postgres',
+ ok($node->psql('postgres',
qq(
BEGIN;
- ALTER TABLE t SET LOGGED;
+ ALTER TABLE t SET LOGGED; -- no effect
INSERT INTO t VALUES(1); -- Succeeds
COMMIT;
- ));
+ )) != 3,
+ "ineffective persistence change doesn't prevent DML");
($ret, $stdout, $stderr) =
$node->psql('postgres',
@@ -232,7 +234,7 @@ max_prepared_transactions = 2
qq(
BEGIN;
SAVEPOINT a;
- ALTER TABLE t SET UNLOGGED;
+ ALTER TABLE t SET LOGGED;
ROLLBACK TO a;
INSERT INTO t VALUES(3); -- Succeeds
COMMIT;
@@ -242,6 +244,7 @@ max_prepared_transactions = 2
($ret, $stdout, $stderr) =
$node->psql('postgres',
qq(
+ ALTER TABLE t SET LOGGED;
BEGIN;
SAVEPOINT a;
ALTER TABLE t SET UNLOGGED;
--
2.43.5
On Fri, 27 Dec 2024 at 08:26, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Hello. This is the updated version.
(Sorry for the delay; I've been a little swamped.)
- Undo logs are primarily stored in a fixed number of fixed-length
slots and are spilled into files under some conditions.The number of slots is 32 (ULOG_SLOT_NUM), and the buffer length is
1024 (ULOG_SLOT_BUF_LEN). Both are currently non-configurable.- Undo logs are now used only during recovery and no longer involved
in transaction ends for normal backends. Pending deletes for aborts
have been restored.- Undo logs are stored on a per-Top-XID basis.
- RelationPreserverStorate() is no longer modified.
In this version, in the part following the introduction of orphan
storage prevention, the restriction on prepared transactions
persisting beyond server crashes (i.e., the prohibition) has been
removed. This is because handling for such cases has been reverted to
pendingDeletes.Let me know if you have any questions or concerns.
I just went to give this a test drive, but HEAD has drifted too far,
at least for 0017 to apply. Could you please rebase and make the
necessary modifications?
Thanks
Thom
On 05/04/2025 00:29, Thom Brown wrote:
On Fri, 27 Dec 2024 at 08:26, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Hello. This is the updated version.
(Sorry for the delay; I've been a little swamped.)
- Undo logs are primarily stored in a fixed number of fixed-length
slots and are spilled into files under some conditions.The number of slots is 32 (ULOG_SLOT_NUM), and the buffer length is
1024 (ULOG_SLOT_BUF_LEN). Both are currently non-configurable.- Undo logs are now used only during recovery and no longer involved
in transaction ends for normal backends. Pending deletes for aborts
have been restored.- Undo logs are stored on a per-Top-XID basis.
- RelationPreserverStorate() is no longer modified.
In this version, in the part following the introduction of orphan
storage prevention, the restriction on prepared transactions
persisting beyond server crashes (i.e., the prohibition) has been
removed. This is because handling for such cases has been reverted to
pendingDeletes.Let me know if you have any questions or concerns.
I just went to give this a test drive, but HEAD has drifted too far,
at least for 0017 to apply. Could you please rebase and make the
necessary modifications?
I had a quick look a this latest version now, up to
"v36-0005-Prevent-orphan-storage-files-after-server-crash.patch"
(because I'm very interested in that, but not in the rest of the
patches). Sorry I haven't gotten around to it earlier.
Overall I'm pretty happy with the design. The main thing that's now
missing is documentation. The main SGML docs should surely have a
section on the UNDO log. A new README to describe how modules should use
the undo log etc. would probably also be in order.
Off the top of my head, some subtle high-level things that should be
explained somewhere:
- The UNDO log is only used to clean up after crash of a relation
creation. It is *not* used for aborting or crash recovery of data, like
on most systems. As a result, it's not as performance critical as you
might think.
- The UNDO log is not a single sequential log like on many other
systems. One way to think about it is that it's a per-transaction file,
with a cache in shared memory for performance.
- The UNDO log is not used to handle controlled aborts, only for cleanup
after a crash.
- What happens if you fail to process the UNDO log for some reason? Some
storage files are leaked. Is that still considered OK, i.e. is the UNDO
log a nice-to-have, or are there some more serious consequences?
- The interaction between REDO and UNDO. Every record inserted to the
UNDO log of a transaction is WAL-logged in the REDO log. The undo log is
like data file in that sense. Writing to the undo log follows the usual
"WAL-before-write" rule: the WAL is flushed before the corresponding
undo log entry is written to disk. (Is that true? I'm not 100% sure)
- When a new relation is created, do you flush the WAL before creating
the file? Or is there still a small window where it can leak, if the
file creation makes it to disk before crash but the undo log (or the WAL
record of the undo log entry) does not?
Have you done any performance testing of this? By "this" I mean the
overhead of the undo-logging on create/drop table.
--
Heikki Linnakangas
Neon (https://neon.tech)