Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans
My ongoing project to make VACUUM more predictable over time by
proactive freezing [1]/messages/by-id/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com -- Peter Geoghegan will increase the overall number of tuples
frozen by VACUUM significantly (at least in larger tables). It's
important that we avoid any new user-visible impact from extra
freezing, though. I recently spent a lot of time on adding high-level
techniques that aim to avoid extra freezing (e.g. by being lazy about
freezing) when that makes sense. Low level techniques aimed at making
the mechanical process of freezing cheaper might also help. (In any
case it's well worth optimizing.)
I'd like to talk about one such technique on this thread. The attached
WIP patch reduces the size of xl_heap_freeze_page records by applying
a simple deduplication process. This can be treated as independent
work (I think it can, at least). The patch doesn't change anything
about the conceptual model used by VACUUM's lazy_scan_prune function
to build xl_heap_freeze_page records for a page, and yet still manages
to make the WAL records for freeze records over 5x smaller in many
important cases. They'll be ~4x-5x smaller with *most* workloads,
even.
Each individual tuple entry (each xl_heap_freeze_tuple) adds a full 12
bytes to the WAL record right now -- no matter what. So the existing
approach is rather space inefficient by any standard (perhaps because
it was developed under time pressure while fixing bugs in Postgres
9.3). More importantly, there is a lot of natural redundancy among
each xl_heap_freeze_tuple entry -- each tuple's xl_heap_freeze_tuple
details tend to match. We can usually get away with storing each
unique combination of values from xl_heap_freeze_tuple once per
xl_heap_freeze_page record, while storing associated page offset
numbers in a separate area, grouped by their canonical freeze plan
(which is a normalized version of the information currently stored in
xl_heap_freeze_tuple).
In practice most individual tuples that undergo any kind of freezing
only need to have their xmin field frozen. And when xmax is affected
at all, it'll usually just get set to InvalidTransactionId. And so the
actual low-level processing steps for xmax have a high chance of being
shared by other tuples on the page, even in ostensibly tricky cases.
While there are quite a few paths that lead to VACUUM setting a
tuple's xmax to InvalidTransactionId, they all end up with the same
instructional state in the final xl_heap_freeze_tuple entry.
Note that there is a small chance that the patch will be less space
efficient by up to 2 bytes per tuple frozen per page in cases where
we're allocating new Mulits during VACUUM. I think that this should be
acceptable on its own -- even in rare bad cases we'll usually still
come out ahead -- what are the chances that we won't make up the
difference on the same page? Or at least within the same VACUUM? And
that's before we talk about a future world in which freezing will
batch tuples together at the page level (you don't have to bring the
other VACUUM work into this discussion, I think, but it's not
*completely* unrelated either).
[1]: /messages/by-id/CAH2-WzkFok_6EAHuK39GaW4FjEFQsY=3J0AAd6FXk93u-Xq3Fg@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
v1-0001-Shrink-freeze-WAL-records-via-deduplication.patchapplication/octet-stream; name=v1-0001-Shrink-freeze-WAL-records-via-deduplication.patchDownload
From df59510439f844f9a367f7fd701d613f60486f53 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 30 Jul 2022 10:47:59 -0700
Subject: [PATCH v1] Shrink freeze WAL records via deduplication.
---
src/include/access/heapam.h | 25 ++
src/include/access/heapam_xlog.h | 35 +--
src/backend/access/heap/heapam.c | 307 ++++++++++++++++++++-----
src/backend/access/heap/vacuumlazy.c | 39 +---
src/backend/access/rmgrdesc/heapdesc.c | 4 +-
5 files changed, 286 insertions(+), 124 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index abf62d9df..e37dab48d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -99,6 +99,19 @@ typedef enum
HEAPTUPLE_DELETE_IN_PROGRESS /* deleting xact is still in progress */
} HTSV_Result;
+/* heap_prepare_freeze_tuple state describing how to freeze a tuple */
+typedef struct HeapFreezeTuple
+{
+ /* Fields describing how to process tuple */
+ uint16 t_infomask2;
+ uint16 t_infomask;
+ TransactionId xmax;
+ uint8 frzflags;
+
+ /* Page offset number for tuple */
+ OffsetNumber offset;
+} HeapFreezeTuple;
+
/* ----------------
* function prototypes for heap access method
*
@@ -164,6 +177,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
Buffer *buffer, struct TM_FailureData *tmfd);
extern void heap_inplace_update(Relation relation, HeapTuple tuple);
+extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
+ TransactionId relfrozenxid,
+ TransactionId relminmxid,
+ TransactionId cutoff_xid,
+ TransactionId cutoff_multi,
+ HeapFreezeTuple *frz,
+ bool *totally_frozen,
+ TransactionId *relfrozenxid_out,
+ MultiXactId *relminmxid_out);
+extern void heap_freeze_prepared_tuples(Relation rel, Buffer buffer,
+ TransactionId latestRemovedXid,
+ HeapFreezeTuple *tuples, int ntuples);
extern bool heap_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 1705e736b..fb6ff2045 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -315,34 +315,36 @@ typedef struct xl_heap_inplace
/*
* This struct represents a 'freeze plan', which is what we need to know about
- * a single tuple being frozen during vacuum.
+ * a group of tuples being frozen from the same page.
*/
/* 0x01 was XLH_FREEZE_XMIN */
#define XLH_FREEZE_XVAC 0x02
#define XLH_INVALID_XVAC 0x04
-typedef struct xl_heap_freeze_tuple
+typedef struct xl_heap_freeze_plan
{
TransactionId xmax;
- OffsetNumber offset;
+ uint16 ntuples;
uint16 t_infomask2;
uint16 t_infomask;
uint8 frzflags;
-} xl_heap_freeze_tuple;
+} xl_heap_freeze_plan;
/*
* This is what we need to know about a block being frozen during vacuum
*
- * Backup block 0's data contains an array of xl_heap_freeze_tuple structs,
- * one for each tuple.
+ * Backup block 0's data contains an array of xl_heap_freeze_plan structs.
*/
typedef struct xl_heap_freeze_page
{
- TransactionId cutoff_xid;
- uint16 ntuples;
+ TransactionId latestRemovedXid;
+ uint16 nplans;
+
+ /* FREEZE PLANS FOLLOW */
+ /* OFFSET NUMBERS FOLLOW */
} xl_heap_freeze_page;
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, ntuples) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
/*
* This is what we need to know about setting a visibility map bit
@@ -400,21 +402,6 @@ extern void heap2_redo(XLogReaderState *record);
extern void heap2_desc(StringInfo buf, XLogReaderState *record);
extern const char *heap2_identify(uint8 info);
extern void heap_xlog_logical_rewrite(XLogReaderState *r);
-
-extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
- TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
- int ntuples);
-extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
- TransactionId relfrozenxid,
- TransactionId relminmxid,
- TransactionId cutoff_xid,
- TransactionId cutoff_multi,
- xl_heap_freeze_tuple *frz,
- bool *totally_frozen,
- TransactionId *relfrozenxid_out,
- MultiXactId *relminmxid_out);
-extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
- xl_heap_freeze_tuple *xlrec_tp);
extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
Buffer vm_buffer, TransactionId cutoff_xid, uint8 flags);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 588716606..918728f27 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -110,6 +110,9 @@ static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_required,
bool *copy);
+static int heap_xlog_freeze_plan(HeapFreezeTuple *tuples, int ntuples,
+ xl_heap_freeze_plan *plans_wal_out,
+ OffsetNumber *offsets_wal_out);
/*
@@ -6431,7 +6434,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
* will be totally frozen after these operations are performed and false if
* more freezing will eventually be required.
*
- * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
+ * VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
+ * returned true for when called. A later heap_freeze_prepared_tuples call
+ * will execute freezing for caller's page as a whole.
*
* It is assumed that the caller has checked the tuple with
* HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
@@ -6455,15 +6460,12 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
* It will be set as tuple's new xmax when our *frz output is processed within
* heap_execute_freeze_tuple later on. If the tuple is in a shared buffer
* then caller had better have an exclusive lock on it already.
- *
- * NB: It is not enough to set hint bits to indicate an XID committed/aborted.
- * The *frz WAL record we output completely removes all old XIDs during REDO.
*/
bool
heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi,
- xl_heap_freeze_tuple *frz, bool *totally_frozen,
+ HeapFreezeTuple *frz, bool *totally_frozen,
TransactionId *relfrozenxid_out,
MultiXactId *relminmxid_out)
{
@@ -6754,10 +6756,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
* tuple status. Also, getting exclusive lock makes it safe to adjust the
* infomask bits.
*
- * NB: All code in here must be safe to execute during crash recovery!
+ * NB: All code in here must be safe to execute during crash recovery! It is
+ * not enough to set hint bits to indicate an XID committed/aborted, either.
+ * Caller's REDO routine must reliably remove all old XIDs.
*/
-void
-heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
+static void
+heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapFreezeTuple *frz)
{
HeapTupleHeaderSetXmax(tuple, frz->xmax);
@@ -6771,6 +6775,80 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
tuple->t_infomask2 = frz->t_infomask2;
}
+/*
+ * heap_freeze_prepared_tuples
+ * Execute freezing of prepared tuple plans.
+ *
+ * If we need to freeze any tuples we'll mark the buffer dirty, and write a
+ * WAL record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
+ *
+ * Caller must always set 'offset' page offset number from each tuple plan.
+ */
+void
+heap_freeze_prepared_tuples(Relation rel, Buffer buffer,
+ TransactionId latestRemovedXid,
+ HeapFreezeTuple *tuples, int ntuples)
+{
+ Page page = BufferGetPage(buffer);
+
+ /* nor when there are no tuples to freeze */
+ Assert(ntuples > 0);
+ Assert(TransactionIdIsValid(latestRemovedXid));
+
+ START_CRIT_SECTION();
+
+ MarkBufferDirty(buffer);
+
+ /* execute collected freezes */
+ for (int i = 0; i < ntuples; i++)
+ {
+ HeapTupleHeader htup;
+ ItemId itemid = PageGetItemId(page, tuples[i].offset);
+
+ htup = (HeapTupleHeader) PageGetItem(page, itemid);
+ heap_execute_freeze_tuple(htup, &tuples[i]);
+ }
+
+ /* Now WAL-log freezing if necessary */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_heap_freeze_plan plans[MaxHeapTuplesPerPage];
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ int nplans;
+ xl_heap_freeze_page xlrec;
+ XLogRecPtr recptr;
+
+ /* Use compact deduplicated representation in WAL record */
+ nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
+
+ Assert(nplans > 0 && nplans <= ntuples);
+
+ xlrec.latestRemovedXid = latestRemovedXid;
+ xlrec.nplans = nplans;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapFreezePage);
+
+ /*
+ * The freeze plan array is not actually in the buffer, but pretend
+ * that it is. When XLogInsert stores the whole buffer, the freeze
+ * plan and offset numbers array need not be stored too.
+ */
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterBufData(0, (char *) plans,
+ nplans * sizeof(xl_heap_freeze_plan));
+ XLogRegisterBufData(0, (char *) offsets,
+ ntuples * sizeof(OffsetNumber));
+
+ recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
/*
* heap_freeze_tuple
* Freeze tuple in place, without WAL logging.
@@ -6782,7 +6860,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi)
{
- xl_heap_freeze_tuple frz;
+ HeapFreezeTuple frz;
bool do_freeze;
bool tuple_totally_frozen;
TransactionId relfrozenxid_out = cutoff_xid;
@@ -8143,42 +8221,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
return nblocksfavorable;
}
-/*
- * Perform XLogInsert for a heap-freeze operation. Caller must have already
- * modified the buffer and marked it dirty.
- */
-XLogRecPtr
-log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
- xl_heap_freeze_tuple *tuples, int ntuples)
-{
- xl_heap_freeze_page xlrec;
- XLogRecPtr recptr;
-
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
- /* nor when there are no tuples to freeze */
- Assert(ntuples > 0);
-
- xlrec.cutoff_xid = cutoff_xid;
- xlrec.ntuples = ntuples;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapFreezePage);
-
- /*
- * The freeze plan array is not actually in the buffer, but pretend that
- * it is. When XLogInsert stores the whole buffer, the freeze plan need
- * not be stored too.
- */
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) tuples,
- ntuples * sizeof(xl_heap_freeze_tuple));
-
- recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE);
-
- return recptr;
-}
-
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
* being marked all-visible, and vm_buffer is the buffer containing the
@@ -8915,6 +8957,134 @@ heap_xlog_visible(XLogReaderState *record)
UnlockReleaseBuffer(vmbuffer);
}
+/*
+ * qsort comparator used to deduplicate tuple-based freeze plans for WAL
+ */
+static int
+heap_xlog_freeze_cmp(const void *arg1, const void *arg2)
+{
+ HeapFreezeTuple *frz1 = (HeapFreezeTuple *) arg1;
+ HeapFreezeTuple *frz2 = (HeapFreezeTuple *) arg2;
+
+ /* Compare fields that describe actions required to freeze this tuple */
+ if (frz1->t_infomask2 < frz2->t_infomask2)
+ return -1;
+ else if (frz1->t_infomask2 > frz2->t_infomask2)
+ return 1;
+
+ if (frz1->t_infomask < frz2->t_infomask)
+ return -1;
+ else if (frz1->t_infomask > frz2->t_infomask)
+ return 1;
+
+ if (frz1->xmax < frz2->xmax)
+ return -1;
+ else if (frz1->xmax > frz2->xmax)
+ return 1;
+
+ if (frz1->frzflags < frz2->frzflags)
+ return -1;
+ else if (frz1->frzflags > frz2->frzflags)
+ return 1;
+
+ /* Tiebreak on an individual tuple's page offset number */
+ if (frz1->offset < frz2->offset)
+ return -1;
+ else if (frz1->offset > frz2->offset)
+ return 1;
+
+ Assert(false);
+ return 0;
+}
+
+/*
+ * Compare fields that describe actions required to freeze tuple with caller's
+ * open plan. If everything matches then the frz tuple plan is equivalent to
+ * caller's plan.
+ */
+static bool
+heap_xlog_freeze_eq(xl_heap_freeze_plan *plan, HeapFreezeTuple *frz)
+{
+ if (plan->t_infomask2 == frz->t_infomask2 &&
+ plan->t_infomask == frz->t_infomask &&
+ plan->xmax == frz->xmax && plan->frzflags == frz->frzflags)
+ return true;
+
+ /* Caller must call heap_xlog_new_freeze_plan again for frz */
+ return false;
+}
+
+/*
+ * Start new plan initialized using tuple-level actions. At least one tuple
+ * will have steps required to freeze described by caller's plan during REDO.
+ */
+static void
+heap_xlog_new_freeze_plan(xl_heap_freeze_plan *plan, HeapFreezeTuple *frz)
+{
+ plan->xmax = frz->xmax;
+ plan->t_infomask = frz->t_infomask;
+ plan->t_infomask2 = frz->t_infomask2;
+ plan->frzflags = frz->frzflags;
+ plan->ntuples = 1;
+}
+
+/*
+ * Deduplicate tuple-based freeze plans so that each distinct set of
+ * processing steps is only stored once in final WAL record.
+ *
+ * Return value is number of plans set in *plans_out for caller. *offsets_out
+ * is set to offset numbers for use in WAL record by caller.
+ */
+static int
+heap_xlog_freeze_plan(HeapFreezeTuple *tuples, int ntuples,
+ xl_heap_freeze_plan *plans_out,
+ OffsetNumber *offsets_out)
+{
+ xl_heap_freeze_plan *curplan = plans_out;
+ int nplans = 0;
+
+ /* Sort tuple-based freeze plans in the order required to deduplicate */
+ qsort(tuples, ntuples, sizeof(HeapFreezeTuple), heap_xlog_freeze_cmp);
+
+ for (int i = 0; i < ntuples; i++)
+ {
+ HeapFreezeTuple *frz = tuples + i;
+
+ if (i == 0)
+ {
+ /* New canonical freeze plan starting with first tup */
+ heap_xlog_new_freeze_plan(curplan, frz);
+ nplans++;
+ }
+ else if (heap_xlog_freeze_eq(curplan, frz))
+ {
+ /* tup matches open canonical plan -- include tup in it */
+ Assert(offsets_out[i - 1] < frz->offset);
+ curplan->ntuples++;
+ }
+ else
+ {
+ /* Tup doesn't match current plan -- done with it now */
+ curplan++;
+
+ /* New canonical freeze plan starting with this tup */
+ heap_xlog_new_freeze_plan(curplan, frz);
+ nplans++;
+ }
+
+ /*
+ * Save page offset number in dedicated buffer in passing.
+ *
+ * REDO routine relies on the record's offset numbers array grouping
+ * offset numbers by freeze plan. The sort order within each grouping
+ * is ascending offset number order, just to keep things tidy.
+ */
+ offsets_out[i] = frz->offset;
+ }
+
+ return nplans;
+}
+
/*
* Replay XLOG_HEAP2_FREEZE_PAGE records
*/
@@ -8923,20 +9093,17 @@ heap_xlog_freeze_page(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) XLogRecGetData(record);
- TransactionId cutoff_xid = xlrec->cutoff_xid;
+ TransactionId latestRemovedXid = xlrec->latestRemovedXid;
Buffer buffer;
- int ntup;
/*
* In Hot Standby mode, ensure that there's no queries running which still
* consider the frozen xids as running.
*/
+ Assert(TransactionIdIsValid(latestRemovedXid));
if (InHotStandby)
{
RelFileLocator rlocator;
- TransactionId latestRemovedXid = cutoff_xid;
-
- TransactionIdRetreat(latestRemovedXid);
XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rlocator);
@@ -8945,22 +9112,38 @@ heap_xlog_freeze_page(XLogReaderState *record)
if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
{
Page page = BufferGetPage(buffer);
- xl_heap_freeze_tuple *tuples;
+ xl_heap_freeze_plan *plans;
+ OffsetNumber *offsets;
+ int curoff = 0;
- tuples = (xl_heap_freeze_tuple *) XLogRecGetBlockData(record, 0, NULL);
-
- /* now execute freeze plan for each frozen tuple */
- for (ntup = 0; ntup < xlrec->ntuples; ntup++)
+ /*
+ * Convert plan/tuple grouping representation from WAL record into
+ * per-tuple representation for heap_execute_freeze_tuple call
+ */
+ plans = (xl_heap_freeze_plan *) XLogRecGetBlockData(record, 0, NULL);
+ offsets = (OffsetNumber *) ((char *) plans +
+ (xlrec->nplans *
+ sizeof(xl_heap_freeze_plan)));
+ for (int plan = 0; plan < xlrec->nplans; plan++)
{
- xl_heap_freeze_tuple *xlrec_tp;
- ItemId lp;
- HeapTupleHeader tuple;
+ xl_heap_freeze_plan *tuple_grouping = &plans[plan];
- xlrec_tp = &tuples[ntup];
- lp = PageGetItemId(page, xlrec_tp->offset); /* offsets are one-based */
- tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ for (int i = 0; i < tuple_grouping->ntuples; i++)
+ {
+ HeapFreezeTuple frz;
+ ItemId lp;
+ HeapTupleHeader tuple;
- heap_execute_freeze_tuple(tuple, xlrec_tp);
+ frz.t_infomask2 = tuple_grouping->t_infomask2;
+ frz.t_infomask = tuple_grouping->t_infomask;
+ frz.xmax = tuple_grouping->xmax;
+ frz.frzflags = tuple_grouping->frzflags;
+ frz.offset = offsets[curoff++];
+
+ lp = PageGetItemId(page, frz.offset);
+ tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ heap_execute_freeze_tuple(tuple, &frz);
+ }
}
PageSetLSN(page, lsn);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472..0623923c4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1566,7 +1566,7 @@ lazy_scan_prune(LVRelState *vacrel,
TransactionId NewRelfrozenXid;
MultiXactId NewRelminMxid;
OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
- xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+ HeapFreezeTuple frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1824,41 +1824,8 @@ retry:
Assert(prunestate->hastup);
vacrel->frozen_pages++;
-
- /*
- * At least one tuple with storage needs to be frozen -- execute that
- * now.
- *
- * If we need to freeze any tuples we'll mark the buffer dirty, and
- * write a WAL record recording the changes. We must log the changes
- * to be crash-safe against future truncation of CLOG.
- */
- START_CRIT_SECTION();
-
- MarkBufferDirty(buf);
-
- /* execute collected freezes */
- for (int i = 0; i < tuples_frozen; i++)
- {
- HeapTupleHeader htup;
-
- itemid = PageGetItemId(page, frozen[i].offset);
- htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
- heap_execute_freeze_tuple(htup, &frozen[i]);
- }
-
- /* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(vacrel->rel))
- {
- XLogRecPtr recptr;
-
- recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
- frozen, tuples_frozen);
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
+ heap_freeze_prepared_tuples(vacrel->rel, buf, vacrel->FreezeLimit,
+ frozen, tuples_frozen);
}
/*
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 923d3bc43..3f8c5e63f 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -140,8 +140,8 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
{
xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
- appendStringInfo(buf, "cutoff xid %u ntuples %u",
- xlrec->cutoff_xid, xlrec->ntuples);
+ appendStringInfo(buf, "latestRemovedXid %u nplans %u",
+ xlrec->latestRemovedXid, xlrec->nplans);
}
else if (info == XLOG_HEAP2_VISIBLE)
{
--
2.34.1
On Tue, Sep 13, 2022 at 6:02 AM Peter Geoghegan <pg@bowt.ie> wrote:
My ongoing project to make VACUUM more predictable over time by
proactive freezing [1] will increase the overall number of tuples
frozen by VACUUM significantly (at least in larger tables). It's
important that we avoid any new user-visible impact from extra
freezing, though. I recently spent a lot of time on adding high-level
techniques that aim to avoid extra freezing (e.g. by being lazy about
freezing) when that makes sense. Low level techniques aimed at making
the mechanical process of freezing cheaper might also help. (In any
case it's well worth optimizing.)I'd like to talk about one such technique on this thread. The attached
WIP patch reduces the size of xl_heap_freeze_page records by applying
a simple deduplication process. This can be treated as independent
work (I think it can, at least).
+1
The patch doesn't change anything
about the conceptual model used by VACUUM's lazy_scan_prune function
to build xl_heap_freeze_page records for a page, and yet still manages
to make the WAL records for freeze records over 5x smaller in many
important cases. They'll be ~4x-5x smaller with *most* workloads,
even.
After a quick benchmark, I've confirmed that the amount of WAL records
for freezing 1 million tuples reduced to about one-fifth (1.2GB vs
250MB). Great.
Each individual tuple entry (each xl_heap_freeze_tuple) adds a full 12
bytes to the WAL record right now -- no matter what. So the existing
approach is rather space inefficient by any standard (perhaps because
it was developed under time pressure while fixing bugs in Postgres
9.3). More importantly, there is a lot of natural redundancy among
each xl_heap_freeze_tuple entry -- each tuple's xl_heap_freeze_tuple
details tend to match. We can usually get away with storing each
unique combination of values from xl_heap_freeze_tuple once per
xl_heap_freeze_page record, while storing associated page offset
numbers in a separate area, grouped by their canonical freeze plan
(which is a normalized version of the information currently stored in
xl_heap_freeze_tuple).
True. I've not looked at the patch in depth yet but I think we need
regression tests for this.
Regards,
--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Fri, Sep 16, 2022 at 12:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
After a quick benchmark, I've confirmed that the amount of WAL records
for freezing 1 million tuples reduced to about one-fifth (1.2GB vs
250MB). Great.
I think that the really interesting thing about the patch is how this
changes the way we should think about freezing costs. It makes
page-level batching seem very natural.
The minimum possible size of a Heap2/FREEZE_PAGE record is 64 bytes,
once alignment and so on is taken into account (without the patch).
Once we already know that we have to freeze *some* tuples on a given
heap page, it becomes very reasonable to freeze as many as possible,
in batch, just because we know that it'll be much cheaper if we do it
now versus doing it later on instead. Even if this extra freezing ends
up "going to waste" due to updates against the same tuples that happen
a little later on, the *added* cost of freezing "extra" tuples will
have been so small that it's unlikely to matter. On the other hand, if
it's not wasted we'll be *much* better off.
It's very hard to predict the future, which is kinda what the current
FreezeLimit-based approach to freezing does. It's actually quite easy
to understand the cost of freezing now versus freezing later, though.
At a high level, it makes sense for VACUUM to focus on freezing costs
(including the risk that comes with *not* freezing with larger
tables), and not worry so much about making accurate predictions.
Making accurate predictions about freezing/workload characteristics is
overrated.
True. I've not looked at the patch in depth yet but I think we need
regression tests for this.
What did you have in mind?
I think that the best way to test something like this is with
wal_consistency_checking. That mostly works fine. However, note that
heap_mask() won't always be able to preserve the state of a tuple's
xmax when modified by freezing. We sometimes need "hint bits" to
actually reliably be set in REDO, when replaying the records for
freezing. At other times they really are just hints. We have to
conservatively assume that it's just a hint when masking. Not sure if
I can do much about that.
Note that this optimization is one level below lazy_scan_prune(), and
one level above heap_execute_freeze_tuple(). Neither function really
changes at all. This seems useful because there are rare
pg_upgrade-only paths where xvac fields need to be frozen. That's not
tested either.
--
Peter Geoghegan
On Mon, Sep 12, 2022 at 2:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
I'd like to talk about one such technique on this thread. The attached
WIP patch reduces the size of xl_heap_freeze_page records by applying
a simple deduplication process.
Attached is v2, which I'm just posting to keep CFTester happy. No real
changes here.
--
Peter Geoghegan
Attachments:
v2-0001-Shrink-freeze-WAL-records-via-deduplication.patchapplication/x-patch; name=v2-0001-Shrink-freeze-WAL-records-via-deduplication.patchDownload
From c555f716a79c7de1f26a5e1e87265f2702b980ee Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 30 Jul 2022 10:47:59 -0700
Subject: [PATCH v2] Shrink freeze WAL records via deduplication.
---
src/include/access/heapam.h | 25 ++
src/include/access/heapam_xlog.h | 34 +--
src/backend/access/heap/heapam.c | 310 ++++++++++++++++++++-----
src/backend/access/heap/vacuumlazy.c | 39 +---
src/backend/access/rmgrdesc/heapdesc.c | 4 +-
5 files changed, 289 insertions(+), 123 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9dab35551..579ef32b0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -99,6 +99,19 @@ typedef enum
HEAPTUPLE_DELETE_IN_PROGRESS /* deleting xact is still in progress */
} HTSV_Result;
+/* heap_prepare_freeze_tuple state describing how to freeze a tuple */
+typedef struct HeapFreezeTuple
+{
+ /* Fields describing how to process tuple */
+ uint16 t_infomask2;
+ uint16 t_infomask;
+ TransactionId xmax;
+ uint8 frzflags;
+
+ /* Page offset number for tuple */
+ OffsetNumber offset;
+} HeapFreezeTuple;
+
/* ----------------
* function prototypes for heap access method
*
@@ -164,6 +177,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
Buffer *buffer, struct TM_FailureData *tmfd);
extern void heap_inplace_update(Relation relation, HeapTuple tuple);
+extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
+ TransactionId relfrozenxid,
+ TransactionId relminmxid,
+ TransactionId cutoff_xid,
+ TransactionId cutoff_multi,
+ HeapFreezeTuple *frz,
+ bool *totally_frozen,
+ TransactionId *relfrozenxid_out,
+ MultiXactId *relminmxid_out);
+extern void heap_freeze_prepared_tuples(Relation rel, Buffer buffer,
+ TransactionId latestRemovedXid,
+ HeapFreezeTuple *tuples, int ntuples);
extern bool heap_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 34220d93c..4e3a023eb 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -315,34 +315,36 @@ typedef struct xl_heap_inplace
/*
* This struct represents a 'freeze plan', which is what we need to know about
- * a single tuple being frozen during vacuum.
+ * a group of tuples being frozen from the same page.
*/
/* 0x01 was XLH_FREEZE_XMIN */
#define XLH_FREEZE_XVAC 0x02
#define XLH_INVALID_XVAC 0x04
-typedef struct xl_heap_freeze_tuple
+typedef struct xl_heap_freeze_plan
{
TransactionId xmax;
- OffsetNumber offset;
+ uint16 ntuples;
uint16 t_infomask2;
uint16 t_infomask;
uint8 frzflags;
-} xl_heap_freeze_tuple;
+} xl_heap_freeze_plan;
/*
* This is what we need to know about a block being frozen during vacuum
*
- * Backup block 0's data contains an array of xl_heap_freeze_tuple structs,
- * one for each tuple.
+ * Backup block 0's data contains an array of xl_heap_freeze_plan structs.
*/
typedef struct xl_heap_freeze_page
{
- TransactionId cutoff_xid;
- uint16 ntuples;
+ TransactionId latestRemovedXid;
+ uint16 nplans;
+
+ /* FREEZE PLANS FOLLOW */
+ /* OFFSET NUMBERS FOLLOW */
} xl_heap_freeze_page;
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, ntuples) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
/*
* This is what we need to know about setting a visibility map bit
@@ -401,20 +403,6 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
extern const char *heap2_identify(uint8 info);
extern void heap_xlog_logical_rewrite(XLogReaderState *r);
-extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
- TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
- int ntuples);
-extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
- TransactionId relfrozenxid,
- TransactionId relminmxid,
- TransactionId cutoff_xid,
- TransactionId cutoff_multi,
- xl_heap_freeze_tuple *frz,
- bool *totally_frozen,
- TransactionId *relfrozenxid_out,
- MultiXactId *relminmxid_out);
-extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
- xl_heap_freeze_tuple *frz);
extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
Buffer vm_buffer, TransactionId cutoff_xid, uint8 vmflags);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index eb811d751..df2105e3a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -110,6 +110,9 @@ static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static int heap_xlog_freeze_plan(HeapFreezeTuple *tuples, int ntuples,
+ xl_heap_freeze_plan *plans_out,
+ OffsetNumber *offsets_out);
/*
@@ -6431,7 +6434,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
* will be totally frozen after these operations are performed and false if
* more freezing will eventually be required.
*
- * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
+ * VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
+ * returned true for when called. A later heap_freeze_prepared_tuples call
+ * will execute freezing for caller's page as a whole.
*
* It is assumed that the caller has checked the tuple with
* HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
@@ -6455,15 +6460,12 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
* It will be set as tuple's new xmax when our *frz output is processed within
* heap_execute_freeze_tuple later on. If the tuple is in a shared buffer
* then caller had better have an exclusive lock on it already.
- *
- * NB: It is not enough to set hint bits to indicate an XID committed/aborted.
- * The *frz WAL record we output completely removes all old XIDs during REDO.
*/
bool
heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi,
- xl_heap_freeze_tuple *frz, bool *totally_frozen,
+ HeapFreezeTuple *frz, bool *totally_frozen,
TransactionId *relfrozenxid_out,
MultiXactId *relminmxid_out)
{
@@ -6754,10 +6756,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
* tuple status. Also, getting exclusive lock makes it safe to adjust the
* infomask bits.
*
- * NB: All code in here must be safe to execute during crash recovery!
+ * NB: All code in here must be safe to execute during crash recovery! It is
+ * not enough to set hint bits to indicate an XID committed/aborted, either.
+ * Caller's REDO routine must reliably remove all old XIDs.
*/
-void
-heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
+static void
+heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapFreezeTuple *frz)
{
HeapTupleHeaderSetXmax(tuple, frz->xmax);
@@ -6771,6 +6775,82 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
tuple->t_infomask2 = frz->t_infomask2;
}
+/*
+ * heap_freeze_prepared_tuples
+ * Execute freezing of prepared tuple plans.
+ *
+ * If we need to freeze any tuples we'll mark the buffer dirty, and write a
+ * WAL record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
+ *
+ * Caller must always set 'offset' page offset number from each tuple plan.
+ *
+ * NB: We sort the tuples array. Caller should be finished with it by now.
+ */
+void
+heap_freeze_prepared_tuples(Relation rel, Buffer buffer,
+ TransactionId latestRemovedXid,
+ HeapFreezeTuple *tuples, int ntuples)
+{
+ Page page = BufferGetPage(buffer);
+
+ /* nor when there are no tuples to freeze */
+ Assert(ntuples > 0);
+ Assert(TransactionIdIsValid(latestRemovedXid));
+
+ START_CRIT_SECTION();
+
+ MarkBufferDirty(buffer);
+
+ /* execute collected freezes */
+ for (int i = 0; i < ntuples; i++)
+ {
+ HeapTupleHeader htup;
+ ItemId itemid = PageGetItemId(page, tuples[i].offset);
+
+ htup = (HeapTupleHeader) PageGetItem(page, itemid);
+ heap_execute_freeze_tuple(htup, &tuples[i]);
+ }
+
+ /* Now WAL-log freezing if necessary */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_heap_freeze_plan plans[MaxHeapTuplesPerPage];
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ int nplans;
+ xl_heap_freeze_page xlrec;
+ XLogRecPtr recptr;
+
+ /* Prepare deduplicated representation for use in WAL record */
+ nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
+
+ Assert(nplans > 0 && nplans <= ntuples);
+
+ xlrec.latestRemovedXid = latestRemovedXid;
+ xlrec.nplans = nplans;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapFreezePage);
+
+ /*
+ * The freeze plan array and offset array are not actually in the
+ * buffer, but pretend that they are. When XLogInsert stores the
+ * whole buffer, the arrays need not be stored too.
+ */
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterBufData(0, (char *) plans,
+ nplans * sizeof(xl_heap_freeze_plan));
+ XLogRegisterBufData(0, (char *) offsets,
+ ntuples * sizeof(OffsetNumber));
+
+ recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
/*
* heap_freeze_tuple
* Freeze tuple in place, without WAL logging.
@@ -6782,7 +6862,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi)
{
- xl_heap_freeze_tuple frz;
+ HeapFreezeTuple frz;
bool do_freeze;
bool tuple_totally_frozen;
TransactionId relfrozenxid_out = cutoff_xid;
@@ -8143,42 +8223,6 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
return nblocksfavorable;
}
-/*
- * Perform XLogInsert for a heap-freeze operation. Caller must have already
- * modified the buffer and marked it dirty.
- */
-XLogRecPtr
-log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
- xl_heap_freeze_tuple *tuples, int ntuples)
-{
- xl_heap_freeze_page xlrec;
- XLogRecPtr recptr;
-
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
- /* nor when there are no tuples to freeze */
- Assert(ntuples > 0);
-
- xlrec.cutoff_xid = cutoff_xid;
- xlrec.ntuples = ntuples;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapFreezePage);
-
- /*
- * The freeze plan array is not actually in the buffer, but pretend that
- * it is. When XLogInsert stores the whole buffer, the freeze plan need
- * not be stored too.
- */
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) tuples,
- ntuples * sizeof(xl_heap_freeze_tuple));
-
- recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE);
-
- return recptr;
-}
-
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
* being marked all-visible, and vm_buffer is the buffer containing the
@@ -8915,6 +8959,133 @@ heap_xlog_visible(XLogReaderState *record)
UnlockReleaseBuffer(vmbuffer);
}
+/*
+ * qsort comparator used to deduplicate tuple-based freeze plans for WAL
+ */
+static int
+heap_xlog_freeze_cmp(const void *arg1, const void *arg2)
+{
+ HeapFreezeTuple *frz1 = (HeapFreezeTuple *) arg1;
+ HeapFreezeTuple *frz2 = (HeapFreezeTuple *) arg2;
+
+ /* Compare fields that describe actions required to freeze this tuple */
+ if (frz1->t_infomask2 < frz2->t_infomask2)
+ return -1;
+ else if (frz1->t_infomask2 > frz2->t_infomask2)
+ return 1;
+
+ if (frz1->t_infomask < frz2->t_infomask)
+ return -1;
+ else if (frz1->t_infomask > frz2->t_infomask)
+ return 1;
+
+ if (frz1->xmax < frz2->xmax)
+ return -1;
+ else if (frz1->xmax > frz2->xmax)
+ return 1;
+
+ if (frz1->frzflags < frz2->frzflags)
+ return -1;
+ else if (frz1->frzflags > frz2->frzflags)
+ return 1;
+
+ /* Tiebreak on an individual tuple's page offset number */
+ if (frz1->offset < frz2->offset)
+ return -1;
+ else if (frz1->offset > frz2->offset)
+ return 1;
+
+ Assert(false);
+ return 0;
+}
+
+/*
+ * Compare fields that describe actions required to freeze tuple with caller's
+ * open plan. If everything matches then the frz tuple plan is equivalent to
+ * caller's plan.
+ */
+static bool
+heap_xlog_freeze_eq(xl_heap_freeze_plan *plan, HeapFreezeTuple *frz)
+{
+ if (plan->t_infomask2 == frz->t_infomask2 &&
+ plan->t_infomask == frz->t_infomask &&
+ plan->xmax == frz->xmax && plan->frzflags == frz->frzflags)
+ return true;
+
+ /* Caller must call heap_xlog_new_freeze_plan again for frz */
+ return false;
+}
+
+/*
+ * Start new plan initialized using tuple-level actions. At least one tuple
+ * will have steps required to freeze described by caller's plan during REDO.
+ */
+static void
+heap_xlog_new_freeze_plan(xl_heap_freeze_plan *plan, HeapFreezeTuple *frz)
+{
+ plan->xmax = frz->xmax;
+ plan->t_infomask = frz->t_infomask;
+ plan->t_infomask2 = frz->t_infomask2;
+ plan->frzflags = frz->frzflags;
+ plan->ntuples = 1;
+}
+
+/*
+ * Deduplicate tuple-based freeze plans so that each distinct set of
+ * processing steps is only stored once in final WAL record.
+ *
+ * Return value is number of plans set in *plans_out for caller. *offsets_out
+ * is set to offset numbers for use in WAL record by caller.
+ */
+static int
+heap_xlog_freeze_plan(HeapFreezeTuple *tuples, int ntuples,
+ xl_heap_freeze_plan *plans_out,
+ OffsetNumber *offsets_out)
+{
+ int nplans = 0;
+
+ /* Sort tuple-based freeze plans in the order required to deduplicate */
+ qsort(tuples, ntuples, sizeof(HeapFreezeTuple), heap_xlog_freeze_cmp);
+
+ for (int i = 0; i < ntuples; i++)
+ {
+ HeapFreezeTuple *frz = tuples + i;
+
+ if (i == 0)
+ {
+ /* New canonical freeze plan starting with first tup */
+ heap_xlog_new_freeze_plan(plans_out, frz);
+ nplans++;
+ }
+ else if (heap_xlog_freeze_eq(plans_out, frz))
+ {
+ /* tup matches open canonical plan -- include tup in it */
+ Assert(offsets_out[i - 1] < frz->offset);
+ plans_out->ntuples++;
+ }
+ else
+ {
+ /* Tup doesn't match current plan -- done with it now */
+ plans_out++;
+
+ /* New canonical freeze plan starting with this tup */
+ heap_xlog_new_freeze_plan(plans_out, frz);
+ nplans++;
+ }
+
+ /*
+ * Save page offset number in dedicated buffer in passing.
+ *
+ * REDO routine relies on the record's offset numbers array grouping
+ * offset numbers by freeze plan. The sort order within each grouping
+ * is ascending offset number order, just to keep things tidy.
+ */
+ offsets_out[i] = frz->offset;
+ }
+
+ return nplans;
+}
+
/*
* Replay XLOG_HEAP2_FREEZE_PAGE records
*/
@@ -8923,20 +9094,17 @@ heap_xlog_freeze_page(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) XLogRecGetData(record);
- TransactionId cutoff_xid = xlrec->cutoff_xid;
+ TransactionId latestRemovedXid = xlrec->latestRemovedXid;
Buffer buffer;
- int ntup;
/*
* In Hot Standby mode, ensure that there's no queries running which still
* consider the frozen xids as running.
*/
+ Assert(TransactionIdIsValid(latestRemovedXid));
if (InHotStandby)
{
RelFileLocator rlocator;
- TransactionId latestRemovedXid = cutoff_xid;
-
- TransactionIdRetreat(latestRemovedXid);
XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rlocator);
@@ -8945,22 +9113,40 @@ heap_xlog_freeze_page(XLogReaderState *record)
if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
{
Page page = BufferGetPage(buffer);
- xl_heap_freeze_tuple *tuples;
+ xl_heap_freeze_plan *plans;
+ OffsetNumber *offsets;
+ int curoff = 0;
- tuples = (xl_heap_freeze_tuple *) XLogRecGetBlockData(record, 0, NULL);
-
- /* now execute freeze plan for each frozen tuple */
- for (ntup = 0; ntup < xlrec->ntuples; ntup++)
+ /*
+ * Convert plan/tuple grouping representation from WAL record into
+ * per-tuple representation for heap_execute_freeze_tuple call
+ */
+ plans = (xl_heap_freeze_plan *) XLogRecGetBlockData(record, 0, NULL);
+ offsets = (OffsetNumber *) ((char *) plans +
+ (xlrec->nplans *
+ sizeof(xl_heap_freeze_plan)));
+ for (int plan = 0; plan < xlrec->nplans; plan++)
{
- xl_heap_freeze_tuple *xlrec_tp;
- ItemId lp;
- HeapTupleHeader tuple;
+ xl_heap_freeze_plan *tuple_grouping = &plans[plan];
+ HeapFreezeTuple frz;
- xlrec_tp = &tuples[ntup];
- lp = PageGetItemId(page, xlrec_tp->offset); /* offsets are one-based */
- tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ frz.t_infomask2 = tuple_grouping->t_infomask2;
+ frz.t_infomask = tuple_grouping->t_infomask;
+ frz.xmax = tuple_grouping->xmax;
+ frz.frzflags = tuple_grouping->frzflags;
+ frz.offset = InvalidOffsetNumber;
- heap_execute_freeze_tuple(tuple, xlrec_tp);
+ /* Iterate through tuple grouping in offset number order */
+ for (int i = 0; i < tuple_grouping->ntuples; i++)
+ {
+ ItemId lp;
+ HeapTupleHeader tuple;
+ OffsetNumber offset = offsets[curoff++];
+
+ lp = PageGetItemId(page, offset);
+ tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ heap_execute_freeze_tuple(tuple, &frz);
+ }
}
PageSetLSN(page, lsn);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472..0623923c4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1566,7 +1566,7 @@ lazy_scan_prune(LVRelState *vacrel,
TransactionId NewRelfrozenXid;
MultiXactId NewRelminMxid;
OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
- xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+ HeapFreezeTuple frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1824,41 +1824,8 @@ retry:
Assert(prunestate->hastup);
vacrel->frozen_pages++;
-
- /*
- * At least one tuple with storage needs to be frozen -- execute that
- * now.
- *
- * If we need to freeze any tuples we'll mark the buffer dirty, and
- * write a WAL record recording the changes. We must log the changes
- * to be crash-safe against future truncation of CLOG.
- */
- START_CRIT_SECTION();
-
- MarkBufferDirty(buf);
-
- /* execute collected freezes */
- for (int i = 0; i < tuples_frozen; i++)
- {
- HeapTupleHeader htup;
-
- itemid = PageGetItemId(page, frozen[i].offset);
- htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
- heap_execute_freeze_tuple(htup, &frozen[i]);
- }
-
- /* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(vacrel->rel))
- {
- XLogRecPtr recptr;
-
- recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
- frozen, tuples_frozen);
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
+ heap_freeze_prepared_tuples(vacrel->rel, buf, vacrel->FreezeLimit,
+ frozen, tuples_frozen);
}
/*
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 923d3bc43..3f8c5e63f 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -140,8 +140,8 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
{
xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
- appendStringInfo(buf, "cutoff xid %u ntuples %u",
- xlrec->cutoff_xid, xlrec->ntuples);
+ appendStringInfo(buf, "latestRemovedXid %u nplans %u",
+ xlrec->latestRemovedXid, xlrec->nplans);
}
else if (info == XLOG_HEAP2_VISIBLE)
{
--
2.34.1
On Tue, Sep 20, 2022 at 03:12:00PM -0700, Peter Geoghegan wrote:
On Mon, Sep 12, 2022 at 2:01 PM Peter Geoghegan <pg@bowt.ie> wrote:
I'd like to talk about one such technique on this thread. The attached
WIP patch reduces the size of xl_heap_freeze_page records by applying
a simple deduplication process.Attached is v2, which I'm just posting to keep CFTester happy. No real
changes here.
This idea seems promising. I see that you called this patch a
work-in-progress, so I'm curious what else you are planning to do with it.
As I'm reading this thread and the patch, I'm finding myself wondering if
it's worth exploring using wal_compression for these records instead. I
think you've essentially created an efficient compression mechanism for
this one type of record, but I'm assuming that lz4/zstd would also yield
some rather substantial improvements for this kind of data. Presumably a
generic WAL record compression mechanism could be reused for other large
records, too. That could be much easier than devising a deduplication
strategy for every record type.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, Sep 21, 2022 at 1:14 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
This idea seems promising. I see that you called this patch a
work-in-progress, so I'm curious what else you are planning to do with it.
I really just meant that the patch wasn't completely finished at that
point. I hadn't yet convinced myself that I mostly had it right. I'm
more confident now.
As I'm reading this thread and the patch, I'm finding myself wondering if
it's worth exploring using wal_compression for these records instead.
The term deduplication works better than compression here because
we're not actually decompressing anything in the REDO routine. Rather,
the REDO routine processes each freeze plan by processing all affected
tuples in order. To me this seems like the natural way to structure
things -- the WAL records are much smaller, but in a way that's kind
of incidental. The approach taken by the patch just seems like the
natural approach, given the specifics of how freezing works at a high
level.
I think you've essentially created an efficient compression mechanism for
this one type of record, but I'm assuming that lz4/zstd would also yield
some rather substantial improvements for this kind of data.
I don't think of it that way. I've used the term "deduplication" to
advertise the patch, but that's mostly just a description of what
we're doing in the patch relative to what we do on HEAD today. There
is nothing truly clever in the patch. We see a huge amount of
redundancy among tuples from the same page in practically all cases,
for reasons that have everything to do with what freezing is, and how
it works at a high level. The thought process that led to my writing
this patch was more high level than appearances suggest. (I often
write patches that combine high level and low level insights in some
way or other, actually.)
Theoretically there might not be very much redundancy within each
xl_heap_freeze_page record, with the right workload, but in practice a
decrease of 4x or more is all but guaranteed once you have more than a
few tuples to freeze on each page. If there are other WAL records that
are as space inefficient as xl_heap_freeze_page is, then I'd be
surprised -- it is *unusually* space inefficient (like I said, I
suspect that this may have something to do with the fact that it was
originally designed under time pressure). So I don't expect that this
patch tells us much about what we should do for any other WAL record.
I certainly *hope* that it doesn't, at least.
Presumably a
generic WAL record compression mechanism could be reused for other large
records, too. That could be much easier than devising a deduplication
strategy for every record type.
It's quite possible that that's a good idea, but that should probably
work as an additive thing. That's something that I think of as a
"clever technique", whereas I'm focussed on just not being naive in
how we represent this one specific WAL record type.
--
Peter Geoghegan
On Wed, Sep 21, 2022 at 2:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
Presumably a
generic WAL record compression mechanism could be reused for other large
records, too. That could be much easier than devising a deduplication
strategy for every record type.It's quite possible that that's a good idea, but that should probably
work as an additive thing. That's something that I think of as a
"clever technique", whereas I'm focussed on just not being naive in
how we represent this one specific WAL record type.
BTW, if you wanted to pursue something like this, that would work with
many different types of WAL record, ISTM that a "medium level" (not
low level) approach might be the best place to start. In particular,
the way that page offset numbers are represented in many WAL records
is quite space inefficient. A domain-specific approach built with
some understanding of how page offset numbers tend to look in practice
seems promising.
The representation of page offset numbers in PRUNE and VACUUM heapam
WAL records (and in index WAL records) always just stores an array of
2 byte OffsetNumber elements. It probably wouldn't be all that
difficult to come up with a simple scheme for compressing an array of
OffsetNumbers in WAL records. It certainly doesn't seem like it would
be all that difficult to get it down to 1 byte per offset number in
most cases (even greater improvements seem doable).
That could also be used for the xl_heap_freeze_page record type --
though only after this patch is committed. The patch makes the WAL
record use a simple array of page offset numbers, just like in
PRUNE/VACUUM records. That's another reason why the approach
implemented by the patch seems like "the natural approach" to me. It's
much closer to how heapam PRUNE records work (we have a variable
number of arrays of page offset numbers in both cases).
--
Peter Geoghegan
On Wed, Sep 21, 2022 at 02:41:28PM -0700, Peter Geoghegan wrote:
On Wed, Sep 21, 2022 at 2:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
Presumably a
generic WAL record compression mechanism could be reused for other large
records, too. That could be much easier than devising a deduplication
strategy for every record type.It's quite possible that that's a good idea, but that should probably
work as an additive thing. That's something that I think of as a
"clever technique", whereas I'm focussed on just not being naive in
how we represent this one specific WAL record type.BTW, if you wanted to pursue something like this, that would work with
many different types of WAL record, ISTM that a "medium level" (not
low level) approach might be the best place to start. In particular,
the way that page offset numbers are represented in many WAL records
is quite space inefficient. A domain-specific approach built with
some understanding of how page offset numbers tend to look in practice
seems promising.
I wouldn't mind giving this a try.
The representation of page offset numbers in PRUNE and VACUUM heapam
WAL records (and in index WAL records) always just stores an array of
2 byte OffsetNumber elements. It probably wouldn't be all that
difficult to come up with a simple scheme for compressing an array of
OffsetNumbers in WAL records. It certainly doesn't seem like it would
be all that difficult to get it down to 1 byte per offset number in
most cases (even greater improvements seem doable).That could also be used for the xl_heap_freeze_page record type --
though only after this patch is committed. The patch makes the WAL
record use a simple array of page offset numbers, just like in
PRUNE/VACUUM records. That's another reason why the approach
implemented by the patch seems like "the natural approach" to me. It's
much closer to how heapam PRUNE records work (we have a variable
number of arrays of page offset numbers in both cases).
Yeah, it seems likely that we could pack offsets in single bytes in many
cases. A more sophisticated approach could even choose how many bits to
use per offset based on the maximum in the array. Furthermore, we might be
able to make use of SIMD instructions to mitigate any performance penalty.
I'm tempted to start by just using single-byte offsets when possible since
that should be relatively simple while still yielding a decent improvement
for many workloads. What do you think?
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, Sep 21, 2022 at 02:11:36PM -0700, Peter Geoghegan wrote:
On Wed, Sep 21, 2022 at 1:14 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
Presumably a
generic WAL record compression mechanism could be reused for other large
records, too. That could be much easier than devising a deduplication
strategy for every record type.It's quite possible that that's a good idea, but that should probably
work as an additive thing. That's something that I think of as a
"clever technique", whereas I'm focussed on just not being naive in
how we represent this one specific WAL record type.
Got it. I think that's a fair point.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, Sep 21, 2022 at 9:21 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
I wouldn't mind giving this a try.
Definitely seems worth experimenting with. Many WAL records generated
during VACUUM (and during opportunistic pruning/index tuple deletion)
have offset number arrays that can be assumed to be both sorted and
unique. My guess is that these WAL record types are the most amenable
to compression at the WAL record level.
Yeah, it seems likely that we could pack offsets in single bytes in many
cases. A more sophisticated approach could even choose how many bits to
use per offset based on the maximum in the array. Furthermore, we might be
able to make use of SIMD instructions to mitigate any performance penalty.
I guess I'd start with some kind of differential compression that
relies on the arrays being both sorted and unique. While it might be
important to be able to compose together multiple different techniques
(something that is more than the sum of its parts can be very useful),
it seems most important to quickly validate the basic idea first.
One obvious thing that still seems worth pointing out: you may not
need to decompress anything. All that you really need to do is to get
the logic from some routine like PageIndexMultiDelete() to be executed
by a REDO routine. Perhaps it'll make sense to come up with a
representation that can just be passed to the routine directly. (I
really don't know how likely this is, but it's something to consider.)
I'm tempted to start by just using single-byte offsets when possible since
that should be relatively simple while still yielding a decent improvement
for many workloads. What do you think?
The really big wins for WAL size will come from applying high level
insights about what is really needed, in all likelihood. The main
overhead is very often generic WAL record header overhead. When it is
then it'll be hard to really move the needle just by compressing the
payload. To do that you'd need to find a way to use fewer WAL records
to do the same amount of work.
The thing that I think will really make the biggest difference is
merging PRUNE records with FREEZE records (and maybe even make the
merged record do the work of a VISIBLE record when that's possible).
Just because now you have 1 WAL record instead of 2 (or even 3) WAL
records. Obviously that's a complicated project, but it's another case
where it feels like the more efficient approach might also be simpler.
We often write a PRUNE record with only one or two items in the array,
in which case it's practically free to do some freezing, at least in
terms of WAL space overhead (as long as you can do it with the same
WAL record). Plus freezing is already inescapably tied to pruning --
we always prune a page that we're going to try to freeze in VACUUM (we
can't safely freeze dead tuples, so there is more or less a dependency
already).
Not that you shouldn't pursue compressing the payload from WAL records
as a project -- maybe that'll work very well. I'm just pointing out
that there is a bigger picture, that may or may not end up mattering
here. For the patch on this thread there certainly is a bigger picture
about costs over time. Something like that could be true for this
other patch too.
It's definitely worth considering the size of the WAL records when
there are only one or two items, how common that may be in each
individual case, etc.
For example, FREEZE records have a minimum size of 64 bytes in
practice, due to WAL record alignment overhead (the payload itself
doesn't usually have to be aligned, but the WAL header still does). It
may be possible to avoid going over the critical threshold that makes
the WAL size one MAXALIGN() quantum larger in the event of having only
a few tuples to freeze, a scenario where negative compression is
likely.
Negative compression is always a potential problem, but maybe you can
deal with it very effectively by thinking about the problem
holistically. If you're "wasting" space that was just going to be
alignment padding anyway, does it really matter at all? (Maybe there
is some reason to care, but I offhand I can't think of one.)
--
Peter Geoghegan
On Tue, Sep 20, 2022 at 3:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v2, which I'm just posting to keep CFTester happy. No real
changes here.
Attached is v3. I'd like to move forward with commit soon. I'll do so
in the next few days, barring objections.
v3 has vacuumlazy.c pass NewRelfrozenXid instead of FreezeLimit for
the purposes of generating recovery conflicts during subsequent REDO
of the resulting xl_heap_freeze_page WAL record. This more general
approach is preparation for my patch to add page-level freezing [1]https://commitfest.postgresql.org/39/3843/ -- Peter Geoghegan.
It might theoretically lead to more recovery conflicts, but in
practice the impact should be negligible. For one thing VACUUM must
freeze *something* before any recovery conflict can happen during
subsequent REDO on a replica in hot standby. It's far more likely that
any disruptive recovery conflicts come from pruning.
It also makes the cutoff_xid field from the xl_heap_freeze_page WAL
record into a "standard latestRemovedXid format" field. In other words
it backs up an XID passed by vacuumlazy.c caller during original
execution (not in the REDO routine, as on HEAD). To make things
clearer, the patch also renames the nearby xl_heap_visible.cutoff_xid
field to xl_heap_visible.latestRemovedXid. Now there are no WAL
records with a field called "cutoff_xid" (they're all called
"latestRemovedXid" now). This matches PRUNE records, and B-Tree DELETE
records.
The overall picture is that all REDO routines (for both heapam and
index AMs) now advertise that they have a field that they use to
generate recovery conflicts that follows a standard format. All
latestRemovedXid XIDs are applied in a standard way during REDO: by
passing them to ResolveRecoveryConflictWithSnapshot(). Users can grep
the output of tools like pg_waldump to find latestRemovedXid fields,
without necessarily needing to give any thought to which kind of WAL
records are involved, or even the rmgr. Presenting this information
precisely and uniformity seems useful to me. (Perhaps we should have a
truly generic name, which latestRemovedXid isn't, but that can be
handled separately.)
[1]: https://commitfest.postgresql.org/39/3843/ -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
v3-0001-Shrink-freeze-WAL-records-via-deduplication.patchapplication/octet-stream; name=v3-0001-Shrink-freeze-WAL-records-via-deduplication.patchDownload
From 6d828f2f1e7f6551e05e20f98a7b09087b00cdfe Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 30 Jul 2022 10:47:59 -0700
Subject: [PATCH v3] Shrink freeze WAL records via deduplication.
Make heapam WAL records that describe freezing performed by VACUUM more
space efficient by storing each distinct "freeze plan" once, alongside
an array of associated page offset numbers (one per freeze plan). The
freeze plans required for most heap pages tend to naturally have a great
deal of redundancy, so this technique is very effective. It routinely
results in a WAL record that is 5x smaller than an equivalent record
generated using the previous approach.
The freeze plan concept was introduced by bugfix commit 3b97e6823b,
which fixed shortcomings in how VACUUM processed MultiXacts. We retain
the general idea of freeze plans, but return to using space efficient
page offset number arrays. This makes the WAL records about as space
efficient as they were before the bugfix, without losing flexibility.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=XytErMnb8FAyFd+OQEbiipB0Q2FmFdXrggPL4VBnRYQ@mail.gmail.com
---
src/include/access/heapam.h | 25 ++
src/include/access/heapam_xlog.h | 42 ++-
src/backend/access/heap/heapam.c | 343 ++++++++++++++++++++-----
src/backend/access/heap/vacuumlazy.c | 38 +--
src/backend/access/rmgrdesc/heapdesc.c | 8 +-
5 files changed, 325 insertions(+), 131 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 9dab35551..76ed417bc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -99,6 +99,19 @@ typedef enum
HEAPTUPLE_DELETE_IN_PROGRESS /* deleting xact is still in progress */
} HTSV_Result;
+/* heap_prepare_freeze_tuple state describing how to freeze a tuple */
+typedef struct HeapFreezeTuple
+{
+ /* Fields describing how to process tuple */
+ uint16 t_infomask2;
+ uint16 t_infomask;
+ TransactionId xmax;
+ uint8 frzflags;
+
+ /* Page offset number for tuple */
+ OffsetNumber offset;
+} HeapFreezeTuple;
+
/* ----------------
* function prototypes for heap access method
*
@@ -164,6 +177,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
Buffer *buffer, struct TM_FailureData *tmfd);
extern void heap_inplace_update(Relation relation, HeapTuple tuple);
+extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
+ TransactionId relfrozenxid,
+ TransactionId relminmxid,
+ TransactionId cutoff_xid,
+ TransactionId cutoff_multi,
+ HeapFreezeTuple *frz,
+ bool *totally_frozen,
+ TransactionId *relfrozenxid_out,
+ MultiXactId *relminmxid_out);
+extern void heap_freeze_prepared_tuples(Relation rel, Buffer buffer,
+ TransactionId NewRelfrozenXid,
+ HeapFreezeTuple *tuples, int ntuples);
extern bool heap_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 34220d93c..dee0a6864 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -315,34 +315,39 @@ typedef struct xl_heap_inplace
/*
* This struct represents a 'freeze plan', which is what we need to know about
- * a single tuple being frozen during vacuum.
+ * a group of tuples being frozen from the same page.
*/
/* 0x01 was XLH_FREEZE_XMIN */
#define XLH_FREEZE_XVAC 0x02
#define XLH_INVALID_XVAC 0x04
-typedef struct xl_heap_freeze_tuple
+typedef struct xl_heap_freeze_plan
{
TransactionId xmax;
- OffsetNumber offset;
+ uint16 ntuples;
uint16 t_infomask2;
uint16 t_infomask;
uint8 frzflags;
-} xl_heap_freeze_tuple;
+} xl_heap_freeze_plan;
/*
* This is what we need to know about a block being frozen during vacuum
*
- * Backup block 0's data contains an array of xl_heap_freeze_tuple structs,
- * one for each tuple.
+ * Backup block 0's data contains an array of xl_heap_freeze_plan structs
+ * (with nplans elements), followed by one or more page offset number arrays.
+ * Each such page offset number array corresponds to a single freeze plan
+ * (REDO routine freezes corresponding heap tuples using freeze plan).
*/
typedef struct xl_heap_freeze_page
{
- TransactionId cutoff_xid;
- uint16 ntuples;
+ TransactionId latestRemovedXid;
+ uint16 nplans;
+
+ /* FREEZE PLANS FOLLOW */
+ /* OFFSET NUMBER ARRAY FOLLOWS */
} xl_heap_freeze_page;
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, ntuples) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
/*
* This is what we need to know about setting a visibility map bit
@@ -352,7 +357,7 @@ typedef struct xl_heap_freeze_page
*/
typedef struct xl_heap_visible
{
- TransactionId cutoff_xid;
+ TransactionId latestRemovedXid;
uint8 flags;
} xl_heap_visible;
@@ -401,21 +406,8 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
extern const char *heap2_identify(uint8 info);
extern void heap_xlog_logical_rewrite(XLogReaderState *r);
-extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
- TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
- int ntuples);
-extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
- TransactionId relfrozenxid,
- TransactionId relminmxid,
- TransactionId cutoff_xid,
- TransactionId cutoff_multi,
- xl_heap_freeze_tuple *frz,
- bool *totally_frozen,
- TransactionId *relfrozenxid_out,
- MultiXactId *relminmxid_out);
-extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
- xl_heap_freeze_tuple *frz);
extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid, uint8 vmflags);
+ Buffer vm_buffer, TransactionId latestRemovedXid,
+ uint8 vmflags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 12be87efe..fc9082a6b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -110,6 +110,9 @@ static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static int heap_xlog_freeze_plan(HeapFreezeTuple *tuples, int ntuples,
+ xl_heap_freeze_plan *plans_out,
+ OffsetNumber *offsets_out);
/*
@@ -6439,7 +6442,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
* will be totally frozen after these operations are performed and false if
* more freezing will eventually be required.
*
- * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
+ * VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
+ * returned true for when called. A later heap_freeze_prepared_tuples call
+ * will execute freezing for caller's page as a whole.
*
* It is assumed that the caller has checked the tuple with
* HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
@@ -6463,15 +6468,12 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
* It will be set as tuple's new xmax when our *frz output is processed within
* heap_execute_freeze_tuple later on. If the tuple is in a shared buffer
* then caller had better have an exclusive lock on it already.
- *
- * NB: It is not enough to set hint bits to indicate an XID committed/aborted.
- * The *frz WAL record we output completely removes all old XIDs during REDO.
*/
bool
heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi,
- xl_heap_freeze_tuple *frz, bool *totally_frozen,
+ HeapFreezeTuple *frz, bool *totally_frozen,
TransactionId *relfrozenxid_out,
MultiXactId *relminmxid_out)
{
@@ -6762,10 +6764,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
* tuple status. Also, getting exclusive lock makes it safe to adjust the
* infomask bits.
*
- * NB: All code in here must be safe to execute during crash recovery!
+ * NB: All code in here must be safe to execute during crash recovery! It is
+ * not enough to set hint bits to indicate an XID committed/aborted, either.
+ * Caller's REDO routine must reliably remove all old XIDs.
*/
-void
-heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
+static inline void
+heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapFreezeTuple *frz)
{
HeapTupleHeaderSetXmax(tuple, frz->xmax);
@@ -6779,6 +6783,92 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
tuple->t_infomask2 = frz->t_infomask2;
}
+/*
+ * heap_freeze_prepared_tuples
+ * Execute freezing of prepared tuple plans.
+ *
+ * If we need to freeze any tuples we'll mark the buffer dirty, and write a
+ * WAL record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
+ *
+ * Caller must always set 'offset' page offset number from each tuple plan.
+ *
+ * NB: We sort the tuples array in-place. Caller should be finished with it
+ * by now.
+ */
+void
+heap_freeze_prepared_tuples(Relation rel, Buffer buffer,
+ TransactionId NewRelfrozenXid,
+ HeapFreezeTuple *tuples, int ntuples)
+{
+ Page page = BufferGetPage(buffer);
+
+ /* nor when there are no tuples to freeze */
+ Assert(ntuples > 0);
+ Assert(TransactionIdIsValid(NewRelfrozenXid));
+
+ START_CRIT_SECTION();
+
+ MarkBufferDirty(buffer);
+
+ /* execute collected freezes */
+ for (int i = 0; i < ntuples; i++)
+ {
+ HeapTupleHeader htup;
+ ItemId itemid = PageGetItemId(page, tuples[i].offset);
+
+ htup = (HeapTupleHeader) PageGetItem(page, itemid);
+ heap_execute_freeze_tuple(htup, &tuples[i]);
+ }
+
+ /* Now WAL-log freezing if necessary */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_heap_freeze_plan plans[MaxHeapTuplesPerPage];
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ int nplans;
+ xl_heap_freeze_page xlrec;
+ XLogRecPtr recptr;
+ TransactionId latestRemovedXid;
+
+ /* Prepare deduplicated representation for use in WAL record */
+ nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
+
+ /*
+ * latestRemovedXid describes the latest processed XID, whereas
+ * NewRelfrozenXid describes the earliest unprocessed extant XID.
+ * Back up caller's NewRelfrozenXid to "convert". This isn't strictly
+ * necessary, but doing so can avoid false conflicts, especially when
+ * NewRelfrozenXid is precisely equal to VACUUM's OldestXmin cutoff.
+ */
+ latestRemovedXid = NewRelfrozenXid;
+ TransactionIdRetreat(latestRemovedXid);
+
+ xlrec.latestRemovedXid = latestRemovedXid;
+ xlrec.nplans = nplans;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapFreezePage);
+
+ /*
+ * The freeze plan array and offset array are not actually in the
+ * buffer, but pretend that they are. When XLogInsert stores the
+ * whole buffer, the arrays need not be stored too.
+ */
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterBufData(0, (char *) plans,
+ nplans * sizeof(xl_heap_freeze_plan));
+ XLogRegisterBufData(0, (char *) offsets,
+ ntuples * sizeof(OffsetNumber));
+
+ recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
/*
* heap_freeze_tuple
* Freeze tuple in place, without WAL logging.
@@ -6790,7 +6880,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi)
{
- xl_heap_freeze_tuple frz;
+ HeapFreezeTuple frz;
bool do_freeze;
bool tuple_totally_frozen;
TransactionId relfrozenxid_out = cutoff_xid;
@@ -8151,54 +8241,23 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
return nblocksfavorable;
}
-/*
- * Perform XLogInsert for a heap-freeze operation. Caller must have already
- * modified the buffer and marked it dirty.
- */
-XLogRecPtr
-log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
- xl_heap_freeze_tuple *tuples, int ntuples)
-{
- xl_heap_freeze_page xlrec;
- XLogRecPtr recptr;
-
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
- /* nor when there are no tuples to freeze */
- Assert(ntuples > 0);
-
- xlrec.cutoff_xid = cutoff_xid;
- xlrec.ntuples = ntuples;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapFreezePage);
-
- /*
- * The freeze plan array is not actually in the buffer, but pretend that
- * it is. When XLogInsert stores the whole buffer, the freeze plan need
- * not be stored too.
- */
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) tuples,
- ntuples * sizeof(xl_heap_freeze_tuple));
-
- recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE);
-
- return recptr;
-}
-
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
* being marked all-visible, and vm_buffer is the buffer containing the
* corresponding visibility map block. Both should have already been modified
* and dirtied.
*
+ * latestRemovedXid comes from the largest xmin on the page being marked
+ * all-visible. REDO routines uses it as a latestRemovedXid to generate
+ * recovery conflicts in the standard way (even though nothing has been
+ * removed). Passed to visibilitymap_set() as cutoff_xid argument by VACUUM.
+ *
* If checksums are enabled, we also generate a full-page image of
* heap_buffer, if necessary.
*/
XLogRecPtr
log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid, uint8 vmflags)
+ TransactionId latestRemovedXid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -8207,7 +8266,7 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(heap_buffer));
Assert(BufferIsValid(vm_buffer));
- xlrec.cutoff_xid = cutoff_xid;
+ xlrec.latestRemovedXid = latestRemovedXid;
xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -8810,7 +8869,7 @@ heap_xlog_visible(XLogReaderState *record)
* rather than killing the transaction outright.
*/
if (InHotStandby)
- ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rlocator);
+ ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rlocator);
/*
* Read the heap page, if it still exists. If the heap file has dropped or
@@ -8914,7 +8973,7 @@ heap_xlog_visible(XLogReaderState *record)
*/
if (lsn > PageGetLSN(vmpage))
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid, xlrec->flags);
+ xlrec->latestRemovedXid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -8923,6 +8982,141 @@ heap_xlog_visible(XLogReaderState *record)
UnlockReleaseBuffer(vmbuffer);
}
+/*
+ * qsort comparator used to deduplicate tuple-based freeze plans for WAL
+ */
+static int
+heap_xlog_freeze_cmp(const void *arg1, const void *arg2)
+{
+ HeapFreezeTuple *frz1 = (HeapFreezeTuple *) arg1;
+ HeapFreezeTuple *frz2 = (HeapFreezeTuple *) arg2;
+
+ /* Compare fields that describe actions required to freeze this tuple */
+ if (frz1->t_infomask2 < frz2->t_infomask2)
+ return -1;
+ else if (frz1->t_infomask2 > frz2->t_infomask2)
+ return 1;
+
+ if (frz1->t_infomask < frz2->t_infomask)
+ return -1;
+ else if (frz1->t_infomask > frz2->t_infomask)
+ return 1;
+
+ if (frz1->xmax < frz2->xmax)
+ return -1;
+ else if (frz1->xmax > frz2->xmax)
+ return 1;
+
+ if (frz1->frzflags < frz2->frzflags)
+ return -1;
+ else if (frz1->frzflags > frz2->frzflags)
+ return 1;
+
+ /*
+ * heap_xlog_freeze_eq would report that this pair of tuple-wise freeze
+ * plans as equal. Their canonicalized freeze plan will only be stored
+ * once in final WAL record.
+ *
+ * Tiebreak on page offset number, just to keep things tidy.
+ */
+ if (frz1->offset < frz2->offset)
+ return -1;
+ else if (frz1->offset > frz2->offset)
+ return 1;
+
+ Assert(false);
+ return 0;
+}
+
+/*
+ * Compare fields that describe actions required to freeze tuple with caller's
+ * open plan. If everything matches then the frz tuple plan is equivalent to
+ * caller's plan.
+ */
+static bool
+heap_xlog_freeze_eq(xl_heap_freeze_plan *plan, HeapFreezeTuple *frz)
+{
+ if (plan->t_infomask2 == frz->t_infomask2 &&
+ plan->t_infomask == frz->t_infomask &&
+ plan->xmax == frz->xmax && plan->frzflags == frz->frzflags)
+ return true;
+
+ /* Caller must call heap_xlog_new_freeze_plan again for frz */
+ return false;
+}
+
+/*
+ * Start new plan initialized using tuple-level actions. At least one tuple
+ * will have steps required to freeze described by caller's plan during REDO.
+ */
+static void
+heap_xlog_new_freeze_plan(xl_heap_freeze_plan *plan, HeapFreezeTuple *frz)
+{
+ plan->xmax = frz->xmax;
+ plan->t_infomask = frz->t_infomask;
+ plan->t_infomask2 = frz->t_infomask2;
+ plan->frzflags = frz->frzflags;
+ plan->ntuples = 1;
+}
+
+/*
+ * Deduplicate tuple-based freeze plans so that each distinct set of
+ * processing steps is only stored once in final WAL record.
+ *
+ * Return value is number of plans set in *plans_out for caller. *offsets_out
+ * is set to offset numbers for use in WAL record by caller.
+ */
+static int
+heap_xlog_freeze_plan(HeapFreezeTuple *tuples, int ntuples,
+ xl_heap_freeze_plan *plans_out,
+ OffsetNumber *offsets_out)
+{
+ int nplans = 0;
+
+ /* Sort tuple-based freeze plans in the order required to deduplicate */
+ qsort(tuples, ntuples, sizeof(HeapFreezeTuple), heap_xlog_freeze_cmp);
+
+ for (int i = 0; i < ntuples; i++)
+ {
+ HeapFreezeTuple *frz = tuples + i;
+
+ if (i == 0)
+ {
+ /* New canonical freeze plan starting with first tup */
+ heap_xlog_new_freeze_plan(plans_out, frz);
+ nplans++;
+ }
+ else if (heap_xlog_freeze_eq(plans_out, frz))
+ {
+ /* tup matches open canonical plan -- include tup in it */
+ Assert(offsets_out[i - 1] < frz->offset);
+ plans_out->ntuples++;
+ }
+ else
+ {
+ /* Tup doesn't match current plan -- done with it now */
+ plans_out++;
+
+ /* New canonical freeze plan starting with this tup */
+ heap_xlog_new_freeze_plan(plans_out, frz);
+ nplans++;
+ }
+
+ /*
+ * Save page offset number in dedicated buffer in passing.
+ *
+ * REDO routine relies on the record's offset numbers array grouping
+ * offset numbers by freeze plan. The sort order within each grouping
+ * is ascending offset number order, just to keep things tidy.
+ */
+ offsets_out[i] = frz->offset;
+ }
+
+ Assert(nplans > 0 && nplans <= ntuples);
+
+ return nplans;
+}
+
/*
* Replay XLOG_HEAP2_FREEZE_PAGE records
*/
@@ -8931,9 +9125,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) XLogRecGetData(record);
- TransactionId cutoff_xid = xlrec->cutoff_xid;
Buffer buffer;
- int ntup;
/*
* In Hot Standby mode, ensure that there's no queries running which still
@@ -8942,33 +9134,50 @@ heap_xlog_freeze_page(XLogReaderState *record)
if (InHotStandby)
{
RelFileLocator rlocator;
- TransactionId latestRemovedXid = cutoff_xid;
-
- TransactionIdRetreat(latestRemovedXid);
XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
- ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rlocator);
+ ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rlocator);
}
if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
{
Page page = BufferGetPage(buffer);
- xl_heap_freeze_tuple *tuples;
+ xl_heap_freeze_plan *plans;
+ OffsetNumber *offsets;
+ int curoff = 0;
- tuples = (xl_heap_freeze_tuple *) XLogRecGetBlockData(record, 0, NULL);
-
- /* now execute freeze plan for each frozen tuple */
- for (ntup = 0; ntup < xlrec->ntuples; ntup++)
+ plans = (xl_heap_freeze_plan *) XLogRecGetBlockData(record, 0, NULL);
+ offsets = (OffsetNumber *) ((char *) plans +
+ (xlrec->nplans *
+ sizeof(xl_heap_freeze_plan)));
+ for (int p = 0; p < xlrec->nplans; p++)
{
- xl_heap_freeze_tuple *xlrec_tp;
- ItemId lp;
- HeapTupleHeader tuple;
+ xl_heap_freeze_plan plan;
+ HeapFreezeTuple frz;
- xlrec_tp = &tuples[ntup];
- lp = PageGetItemId(page, xlrec_tp->offset); /* offsets are one-based */
- tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ memcpy(&plan, &plans[p], sizeof(xl_heap_freeze_plan));
- heap_execute_freeze_tuple(tuple, xlrec_tp);
+ /*
+ * Convert freeze plan representation into the per-tuple format
+ * used by heap_execute_freeze_tuple (matches original execution)
+ */
+ frz.t_infomask2 = plan.t_infomask2;
+ frz.t_infomask = plan.t_infomask;
+ frz.xmax = plan.xmax;
+ frz.frzflags = plan.frzflags;
+ frz.offset = InvalidOffsetNumber; /* unused */
+
+ /* Iterate through tuples for plan, then freeze each tuple */
+ for (int i = 0; i < plan.ntuples; i++)
+ {
+ ItemId lp;
+ HeapTupleHeader tuple;
+ OffsetNumber offset = offsets[curoff++];
+
+ lp = PageGetItemId(page, offset);
+ tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ heap_execute_freeze_tuple(tuple, &frz);
+ }
}
PageSetLSN(page, lsn);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472..6c52926fa 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1566,7 +1566,7 @@ lazy_scan_prune(LVRelState *vacrel,
TransactionId NewRelfrozenXid;
MultiXactId NewRelminMxid;
OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
- xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+ HeapFreezeTuple frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1825,40 +1825,8 @@ retry:
vacrel->frozen_pages++;
- /*
- * At least one tuple with storage needs to be frozen -- execute that
- * now.
- *
- * If we need to freeze any tuples we'll mark the buffer dirty, and
- * write a WAL record recording the changes. We must log the changes
- * to be crash-safe against future truncation of CLOG.
- */
- START_CRIT_SECTION();
-
- MarkBufferDirty(buf);
-
- /* execute collected freezes */
- for (int i = 0; i < tuples_frozen; i++)
- {
- HeapTupleHeader htup;
-
- itemid = PageGetItemId(page, frozen[i].offset);
- htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
- heap_execute_freeze_tuple(htup, &frozen[i]);
- }
-
- /* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(vacrel->rel))
- {
- XLogRecPtr recptr;
-
- recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
- frozen, tuples_frozen);
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
+ heap_freeze_prepared_tuples(vacrel->rel, buf, NewRelfrozenXid,
+ frozen, tuples_frozen);
}
/*
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 923d3bc43..bc86a62a9 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -140,15 +140,15 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
{
xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
- appendStringInfo(buf, "cutoff xid %u ntuples %u",
- xlrec->cutoff_xid, xlrec->ntuples);
+ appendStringInfo(buf, "latestRemovedXid %u nplans %u",
+ xlrec->latestRemovedXid, xlrec->nplans);
}
else if (info == XLOG_HEAP2_VISIBLE)
{
xl_heap_visible *xlrec = (xl_heap_visible *) rec;
- appendStringInfo(buf, "cutoff xid %u flags 0x%02X",
- xlrec->cutoff_xid, xlrec->flags);
+ appendStringInfo(buf, "latestRemovedXid %u flags 0x%02X",
+ xlrec->latestRemovedXid, xlrec->flags);
}
else if (info == XLOG_HEAP2_MULTI_INSERT)
{
--
2.34.1
On Thu, Nov 10, 2022 at 04:48:17PM -0800, Peter Geoghegan wrote:
On Tue, Sep 20, 2022 at 3:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v2, which I'm just posting to keep CFTester happy. No real
changes here.Attached is v3. I'd like to move forward with commit soon. I'll do so
in the next few days, barring objections.
Note that this comment is dangling in your patch:
+{
+ Page page = BufferGetPage(buffer);
+
+ /* nor when there are no tuples to freeze */
...
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
- /* nor when there are no tuples to freeze */
- Assert(ntuples > 0);
On Thu, Nov 10, 2022 at 7:00 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
Note that this comment is dangling in your patch:
Attached is v4, which removes the old comments you pointed out were
now out of place (they weren't adding much anyway). Also fixed bitrot
against HEAD from today's visibility map commit from Jeff Davis.
There is a more substantive change here, too. Like v3, v4 refactors
the *mechanical* details of how the XID based cutoff is handed down.
However, unlike v3, v4 goes back to using vacuumlazy.c's FreezeLimit
as the starting point for generating a latestRemovedXid. It seemed
better to deal with the recovery conflict issues created by my big
page-level freezing/freezing strategy patch set in the patch set
itself.
Will commit this early next week barring any objections.
Thanks
--
Peter Geoghegan
Attachments:
v4-0001-Shrink-freeze-WAL-records-via-deduplication.patchapplication/octet-stream; name=v4-0001-Shrink-freeze-WAL-records-via-deduplication.patchDownload
From 688498b003c8077bc516aaae06fdb8e8c89f5c53 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 30 Jul 2022 10:47:59 -0700
Subject: [PATCH v4] Shrink freeze WAL records via deduplication.
Make heapam WAL records that describe freezing performed by VACUUM more
space efficient by storing each distinct "freeze plan" once, alongside
an array of associated page offset numbers (one per freeze plan). The
freeze plans required for most heap pages tend to naturally have a great
deal of redundancy, so this technique is very effective. It routinely
results in a WAL record that is 5x smaller than an equivalent record
generated using the previous approach.
The freeze plan concept was introduced by bugfix commit 3b97e6823b,
which fixed shortcomings in how VACUUM processed MultiXacts. We retain
the general idea of freeze plans, but return to using space efficient
page offset number arrays. This makes the WAL records about as space
efficient as they were before the bugfix, without losing flexibility.
In passing refactor some of the details around how recovery conflicts
are handled during REDO of freeze WAL records, as well as visibility map
WAL records. This makes it clearer that latestRemovedXid format cutoffs
are used by both record types, just like in prune WAL records. The name
latestRemovedXid is slightly inappropriate here because nothing is
actually removed, but that is arguably a preexisting issue. It can be
dealt with separately by bulk renaming latestRemovedXid fields across
the entire tree.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-By: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CAH2-Wz=XytErMnb8FAyFd+OQEbiipB0Q2FmFdXrggPL4VBnRYQ@mail.gmail.com
---
src/include/access/heapam.h | 25 ++
src/include/access/heapam_xlog.h | 42 ++-
src/backend/access/heap/heapam.c | 341 ++++++++++++++++++++-----
src/backend/access/heap/vacuumlazy.c | 38 +--
src/backend/access/rmgrdesc/heapdesc.c | 8 +-
5 files changed, 323 insertions(+), 131 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 5b0787574..9940663bc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -99,6 +99,19 @@ typedef enum
HEAPTUPLE_DELETE_IN_PROGRESS /* deleting xact is still in progress */
} HTSV_Result;
+/* heap_prepare_freeze_tuple state describing how to freeze a tuple */
+typedef struct HeapFreezeTuple
+{
+ /* Fields describing how to process tuple */
+ uint16 t_infomask2;
+ uint16 t_infomask;
+ TransactionId xmax;
+ uint8 frzflags;
+
+ /* Page offset number for tuple */
+ OffsetNumber offset;
+} HeapFreezeTuple;
+
/* ----------------
* function prototypes for heap access method
*
@@ -164,6 +177,18 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
Buffer *buffer, struct TM_FailureData *tmfd);
extern void heap_inplace_update(Relation relation, HeapTuple tuple);
+extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
+ TransactionId relfrozenxid,
+ TransactionId relminmxid,
+ TransactionId cutoff_xid,
+ TransactionId cutoff_multi,
+ HeapFreezeTuple *frz,
+ bool *totally_frozen,
+ TransactionId *relfrozenxid_out,
+ MultiXactId *relminmxid_out);
+extern void heap_freeze_prepared_tuples(Relation rel, Buffer buffer,
+ TransactionId FreezeLimit,
+ HeapFreezeTuple *tuples, int ntuples);
extern bool heap_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi);
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 34220d93c..dee0a6864 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -315,34 +315,39 @@ typedef struct xl_heap_inplace
/*
* This struct represents a 'freeze plan', which is what we need to know about
- * a single tuple being frozen during vacuum.
+ * a group of tuples being frozen from the same page.
*/
/* 0x01 was XLH_FREEZE_XMIN */
#define XLH_FREEZE_XVAC 0x02
#define XLH_INVALID_XVAC 0x04
-typedef struct xl_heap_freeze_tuple
+typedef struct xl_heap_freeze_plan
{
TransactionId xmax;
- OffsetNumber offset;
+ uint16 ntuples;
uint16 t_infomask2;
uint16 t_infomask;
uint8 frzflags;
-} xl_heap_freeze_tuple;
+} xl_heap_freeze_plan;
/*
* This is what we need to know about a block being frozen during vacuum
*
- * Backup block 0's data contains an array of xl_heap_freeze_tuple structs,
- * one for each tuple.
+ * Backup block 0's data contains an array of xl_heap_freeze_plan structs
+ * (with nplans elements), followed by one or more page offset number arrays.
+ * Each such page offset number array corresponds to a single freeze plan
+ * (REDO routine freezes corresponding heap tuples using freeze plan).
*/
typedef struct xl_heap_freeze_page
{
- TransactionId cutoff_xid;
- uint16 ntuples;
+ TransactionId latestRemovedXid;
+ uint16 nplans;
+
+ /* FREEZE PLANS FOLLOW */
+ /* OFFSET NUMBER ARRAY FOLLOWS */
} xl_heap_freeze_page;
-#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, ntuples) + sizeof(uint16))
+#define SizeOfHeapFreezePage (offsetof(xl_heap_freeze_page, nplans) + sizeof(uint16))
/*
* This is what we need to know about setting a visibility map bit
@@ -352,7 +357,7 @@ typedef struct xl_heap_freeze_page
*/
typedef struct xl_heap_visible
{
- TransactionId cutoff_xid;
+ TransactionId latestRemovedXid;
uint8 flags;
} xl_heap_visible;
@@ -401,21 +406,8 @@ extern void heap2_desc(StringInfo buf, XLogReaderState *record);
extern const char *heap2_identify(uint8 info);
extern void heap_xlog_logical_rewrite(XLogReaderState *r);
-extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
- TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
- int ntuples);
-extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
- TransactionId relfrozenxid,
- TransactionId relminmxid,
- TransactionId cutoff_xid,
- TransactionId cutoff_multi,
- xl_heap_freeze_tuple *frz,
- bool *totally_frozen,
- TransactionId *relfrozenxid_out,
- MultiXactId *relminmxid_out);
-extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
- xl_heap_freeze_tuple *frz);
extern XLogRecPtr log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer,
- Buffer vm_buffer, TransactionId cutoff_xid, uint8 vmflags);
+ Buffer vm_buffer, TransactionId latestRemovedXid,
+ uint8 vmflags);
#endif /* HEAPAM_XLOG_H */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 560f1c81a..b69bf28ba 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -110,6 +110,9 @@ static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static int heap_xlog_freeze_plan(HeapFreezeTuple *tuples, int ntuples,
+ xl_heap_freeze_plan *plans_out,
+ OffsetNumber *offsets_out);
/*
@@ -6439,7 +6442,9 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
* will be totally frozen after these operations are performed and false if
* more freezing will eventually be required.
*
- * Caller must set frz->offset itself, before heap_execute_freeze_tuple call.
+ * VACUUM caller must assemble HeapFreezeTuple entries for every tuple that we
+ * returned true for when called. A later heap_freeze_prepared_tuples call
+ * will execute freezing for caller's page as a whole.
*
* It is assumed that the caller has checked the tuple with
* HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
@@ -6463,15 +6468,12 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
* It will be set as tuple's new xmax when our *frz output is processed within
* heap_execute_freeze_tuple later on. If the tuple is in a shared buffer
* then caller had better have an exclusive lock on it already.
- *
- * NB: It is not enough to set hint bits to indicate an XID committed/aborted.
- * The *frz WAL record we output completely removes all old XIDs during REDO.
*/
bool
heap_prepare_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi,
- xl_heap_freeze_tuple *frz, bool *totally_frozen,
+ HeapFreezeTuple *frz, bool *totally_frozen,
TransactionId *relfrozenxid_out,
MultiXactId *relminmxid_out)
{
@@ -6762,10 +6764,12 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
* tuple status. Also, getting exclusive lock makes it safe to adjust the
* infomask bits.
*
- * NB: All code in here must be safe to execute during crash recovery!
+ * NB: All code in here must be safe to execute during crash recovery! It is
+ * not enough to set hint bits to indicate an XID committed/aborted, either.
+ * Caller's REDO routine must reliably remove all old XIDs.
*/
-void
-heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
+static inline void
+heap_execute_freeze_tuple(HeapTupleHeader tuple, HeapFreezeTuple *frz)
{
HeapTupleHeaderSetXmax(tuple, frz->xmax);
@@ -6779,6 +6783,90 @@ heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
tuple->t_infomask2 = frz->t_infomask2;
}
+/*
+ * heap_freeze_prepared_tuples
+ * Execute freezing of prepared tuple plans.
+ *
+ * If we need to freeze any tuples we'll mark the buffer dirty, and write a
+ * WAL record recording the changes. We must log the changes to be crash-safe
+ * against future truncation of CLOG.
+ *
+ * Caller must always set 'offset' page offset number from each tuple plan.
+ *
+ * NB: We sort the tuples array in-place. Caller should be finished with it
+ * by now.
+ */
+void
+heap_freeze_prepared_tuples(Relation rel, Buffer buffer,
+ TransactionId FreezeLimit,
+ HeapFreezeTuple *tuples, int ntuples)
+{
+ Page page = BufferGetPage(buffer);
+
+ Assert(ntuples > 0);
+ Assert(TransactionIdIsValid(FreezeLimit));
+
+ START_CRIT_SECTION();
+
+ MarkBufferDirty(buffer);
+
+ /* execute collected freezes */
+ for (int i = 0; i < ntuples; i++)
+ {
+ HeapTupleHeader htup;
+ ItemId itemid = PageGetItemId(page, tuples[i].offset);
+
+ htup = (HeapTupleHeader) PageGetItem(page, itemid);
+ heap_execute_freeze_tuple(htup, &tuples[i]);
+ }
+
+ /* Now WAL-log freezing if necessary */
+ if (RelationNeedsWAL(rel))
+ {
+ xl_heap_freeze_plan plans[MaxHeapTuplesPerPage];
+ OffsetNumber offsets[MaxHeapTuplesPerPage];
+ int nplans;
+ xl_heap_freeze_page xlrec;
+ XLogRecPtr recptr;
+ TransactionId latestRemovedXid;
+
+ /* Prepare deduplicated representation for use in WAL record */
+ nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
+
+ /*
+ * latestRemovedXid describes the latest processed XID, whereas
+ * FreezeLimit is (approximately) the first XID not frozen by VACUUM.
+ * Back up caller's FreezeLimit to avoid false conflicts when
+ * FreezeLimit is precisely equal to VACUUM's OldestXmin cutoff.
+ */
+ latestRemovedXid = FreezeLimit;
+ TransactionIdRetreat(latestRemovedXid);
+
+ xlrec.latestRemovedXid = latestRemovedXid;
+ xlrec.nplans = nplans;
+
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, SizeOfHeapFreezePage);
+
+ /*
+ * The freeze plan array and offset array are not actually in the
+ * buffer, but pretend that they are. When XLogInsert stores the
+ * whole buffer, the arrays need not be stored too.
+ */
+ XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+ XLogRegisterBufData(0, (char *) plans,
+ nplans * sizeof(xl_heap_freeze_plan));
+ XLogRegisterBufData(0, (char *) offsets,
+ ntuples * sizeof(OffsetNumber));
+
+ recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+}
+
/*
* heap_freeze_tuple
* Freeze tuple in place, without WAL logging.
@@ -6790,7 +6878,7 @@ heap_freeze_tuple(HeapTupleHeader tuple,
TransactionId relfrozenxid, TransactionId relminmxid,
TransactionId cutoff_xid, TransactionId cutoff_multi)
{
- xl_heap_freeze_tuple frz;
+ HeapFreezeTuple frz;
bool do_freeze;
bool tuple_totally_frozen;
TransactionId relfrozenxid_out = cutoff_xid;
@@ -8151,54 +8239,23 @@ bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate)
return nblocksfavorable;
}
-/*
- * Perform XLogInsert for a heap-freeze operation. Caller must have already
- * modified the buffer and marked it dirty.
- */
-XLogRecPtr
-log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
- xl_heap_freeze_tuple *tuples, int ntuples)
-{
- xl_heap_freeze_page xlrec;
- XLogRecPtr recptr;
-
- /* Caller should not call me on a non-WAL-logged relation */
- Assert(RelationNeedsWAL(reln));
- /* nor when there are no tuples to freeze */
- Assert(ntuples > 0);
-
- xlrec.cutoff_xid = cutoff_xid;
- xlrec.ntuples = ntuples;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec, SizeOfHeapFreezePage);
-
- /*
- * The freeze plan array is not actually in the buffer, but pretend that
- * it is. When XLogInsert stores the whole buffer, the freeze plan need
- * not be stored too.
- */
- XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
- XLogRegisterBufData(0, (char *) tuples,
- ntuples * sizeof(xl_heap_freeze_tuple));
-
- recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE);
-
- return recptr;
-}
-
/*
* Perform XLogInsert for a heap-visible operation. 'block' is the block
* being marked all-visible, and vm_buffer is the buffer containing the
* corresponding visibility map block. Both should have already been modified
* and dirtied.
*
+ * latestRemovedXid comes from the largest xmin on the page being marked
+ * all-visible. REDO routines uses it as a latestRemovedXid to generate
+ * recovery conflicts in the standard way (even though nothing has been
+ * removed). Passed to visibilitymap_set() as cutoff_xid argument by VACUUM.
+ *
* If checksums are enabled, we also generate a full-page image of
* heap_buffer, if necessary.
*/
XLogRecPtr
log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
- TransactionId cutoff_xid, uint8 vmflags)
+ TransactionId latestRemovedXid, uint8 vmflags)
{
xl_heap_visible xlrec;
XLogRecPtr recptr;
@@ -8207,7 +8264,7 @@ log_heap_visible(RelFileLocator rlocator, Buffer heap_buffer, Buffer vm_buffer,
Assert(BufferIsValid(heap_buffer));
Assert(BufferIsValid(vm_buffer));
- xlrec.cutoff_xid = cutoff_xid;
+ xlrec.latestRemovedXid = latestRemovedXid;
xlrec.flags = vmflags;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapVisible);
@@ -8810,7 +8867,7 @@ heap_xlog_visible(XLogReaderState *record)
* rather than killing the transaction outright.
*/
if (InHotStandby)
- ResolveRecoveryConflictWithSnapshot(xlrec->cutoff_xid, rlocator);
+ ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rlocator);
/*
* Read the heap page, if it still exists. If the heap file has dropped or
@@ -8896,7 +8953,7 @@ heap_xlog_visible(XLogReaderState *record)
visibilitymap_pin(reln, blkno, &vmbuffer);
visibilitymap_set(reln, blkno, InvalidBuffer, lsn, vmbuffer,
- xlrec->cutoff_xid, xlrec->flags);
+ xlrec->latestRemovedXid, xlrec->flags);
ReleaseBuffer(vmbuffer);
FreeFakeRelcacheEntry(reln);
@@ -8905,6 +8962,141 @@ heap_xlog_visible(XLogReaderState *record)
UnlockReleaseBuffer(vmbuffer);
}
+/*
+ * qsort comparator used to deduplicate tuple-based freeze plans for WAL
+ */
+static int
+heap_xlog_freeze_cmp(const void *arg1, const void *arg2)
+{
+ HeapFreezeTuple *frz1 = (HeapFreezeTuple *) arg1;
+ HeapFreezeTuple *frz2 = (HeapFreezeTuple *) arg2;
+
+ /* Compare fields that describe actions required to freeze this tuple */
+ if (frz1->t_infomask2 < frz2->t_infomask2)
+ return -1;
+ else if (frz1->t_infomask2 > frz2->t_infomask2)
+ return 1;
+
+ if (frz1->t_infomask < frz2->t_infomask)
+ return -1;
+ else if (frz1->t_infomask > frz2->t_infomask)
+ return 1;
+
+ if (frz1->xmax < frz2->xmax)
+ return -1;
+ else if (frz1->xmax > frz2->xmax)
+ return 1;
+
+ if (frz1->frzflags < frz2->frzflags)
+ return -1;
+ else if (frz1->frzflags > frz2->frzflags)
+ return 1;
+
+ /*
+ * heap_xlog_freeze_eq would report that this pair of tuple-wise freeze
+ * plans as equal. Their canonicalized freeze plan will only be stored
+ * once in final WAL record.
+ *
+ * Tiebreak on page offset number, just to keep things tidy.
+ */
+ if (frz1->offset < frz2->offset)
+ return -1;
+ else if (frz1->offset > frz2->offset)
+ return 1;
+
+ Assert(false);
+ return 0;
+}
+
+/*
+ * Compare fields that describe actions required to freeze tuple with caller's
+ * open plan. If everything matches then the frz tuple plan is equivalent to
+ * caller's plan.
+ */
+static bool
+heap_xlog_freeze_eq(xl_heap_freeze_plan *plan, HeapFreezeTuple *frz)
+{
+ if (plan->t_infomask2 == frz->t_infomask2 &&
+ plan->t_infomask == frz->t_infomask &&
+ plan->xmax == frz->xmax && plan->frzflags == frz->frzflags)
+ return true;
+
+ /* Caller must call heap_xlog_new_freeze_plan again for frz */
+ return false;
+}
+
+/*
+ * Start new plan initialized using tuple-level actions. At least one tuple
+ * will have steps required to freeze described by caller's plan during REDO.
+ */
+static void
+heap_xlog_new_freeze_plan(xl_heap_freeze_plan *plan, HeapFreezeTuple *frz)
+{
+ plan->xmax = frz->xmax;
+ plan->t_infomask = frz->t_infomask;
+ plan->t_infomask2 = frz->t_infomask2;
+ plan->frzflags = frz->frzflags;
+ plan->ntuples = 1;
+}
+
+/*
+ * Deduplicate tuple-based freeze plans so that each distinct set of
+ * processing steps is only stored once in final WAL record.
+ *
+ * Return value is number of plans set in *plans_out for caller. *offsets_out
+ * is set to offset numbers for use in WAL record by caller.
+ */
+static int
+heap_xlog_freeze_plan(HeapFreezeTuple *tuples, int ntuples,
+ xl_heap_freeze_plan *plans_out,
+ OffsetNumber *offsets_out)
+{
+ int nplans = 0;
+
+ /* Sort tuple-based freeze plans in the order required to deduplicate */
+ qsort(tuples, ntuples, sizeof(HeapFreezeTuple), heap_xlog_freeze_cmp);
+
+ for (int i = 0; i < ntuples; i++)
+ {
+ HeapFreezeTuple *frz = tuples + i;
+
+ if (i == 0)
+ {
+ /* New canonical freeze plan starting with first tup */
+ heap_xlog_new_freeze_plan(plans_out, frz);
+ nplans++;
+ }
+ else if (heap_xlog_freeze_eq(plans_out, frz))
+ {
+ /* tup matches open canonical plan -- include tup in it */
+ Assert(offsets_out[i - 1] < frz->offset);
+ plans_out->ntuples++;
+ }
+ else
+ {
+ /* Tup doesn't match current plan -- done with it now */
+ plans_out++;
+
+ /* New canonical freeze plan starting with this tup */
+ heap_xlog_new_freeze_plan(plans_out, frz);
+ nplans++;
+ }
+
+ /*
+ * Save page offset number in dedicated buffer in passing.
+ *
+ * REDO routine relies on the record's offset numbers array grouping
+ * offset numbers by freeze plan. The sort order within each grouping
+ * is ascending offset number order, just to keep things tidy.
+ */
+ offsets_out[i] = frz->offset;
+ }
+
+ Assert(nplans > 0 && nplans <= ntuples);
+
+ return nplans;
+}
+
/*
* Replay XLOG_HEAP2_FREEZE_PAGE records
*/
@@ -8913,9 +9105,7 @@ heap_xlog_freeze_page(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) XLogRecGetData(record);
- TransactionId cutoff_xid = xlrec->cutoff_xid;
Buffer buffer;
- int ntup;
/*
* In Hot Standby mode, ensure that there's no queries running which still
@@ -8924,33 +9114,50 @@ heap_xlog_freeze_page(XLogReaderState *record)
if (InHotStandby)
{
RelFileLocator rlocator;
- TransactionId latestRemovedXid = cutoff_xid;
-
- TransactionIdRetreat(latestRemovedXid);
XLogRecGetBlockTag(record, 0, &rlocator, NULL, NULL);
- ResolveRecoveryConflictWithSnapshot(latestRemovedXid, rlocator);
+ ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, rlocator);
}
if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
{
Page page = BufferGetPage(buffer);
- xl_heap_freeze_tuple *tuples;
+ xl_heap_freeze_plan *plans;
+ OffsetNumber *offsets;
+ int curoff = 0;
- tuples = (xl_heap_freeze_tuple *) XLogRecGetBlockData(record, 0, NULL);
-
- /* now execute freeze plan for each frozen tuple */
- for (ntup = 0; ntup < xlrec->ntuples; ntup++)
+ plans = (xl_heap_freeze_plan *) XLogRecGetBlockData(record, 0, NULL);
+ offsets = (OffsetNumber *) ((char *) plans +
+ (xlrec->nplans *
+ sizeof(xl_heap_freeze_plan)));
+ for (int p = 0; p < xlrec->nplans; p++)
{
- xl_heap_freeze_tuple *xlrec_tp;
- ItemId lp;
- HeapTupleHeader tuple;
+ xl_heap_freeze_plan plan;
+ HeapFreezeTuple frz;
- xlrec_tp = &tuples[ntup];
- lp = PageGetItemId(page, xlrec_tp->offset); /* offsets are one-based */
- tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ memcpy(&plan, &plans[p], sizeof(xl_heap_freeze_plan));
- heap_execute_freeze_tuple(tuple, xlrec_tp);
+ /*
+ * Convert freeze plan representation into the per-tuple format
+ * used by heap_execute_freeze_tuple (matches original execution)
+ */
+ frz.t_infomask2 = plan.t_infomask2;
+ frz.t_infomask = plan.t_infomask;
+ frz.xmax = plan.xmax;
+ frz.frzflags = plan.frzflags;
+ frz.offset = InvalidOffsetNumber; /* unused */
+
+ /* Iterate through tuples for plan, then freeze each tuple */
+ for (int i = 0; i < plan.ntuples; i++)
+ {
+ ItemId lp;
+ HeapTupleHeader tuple;
+ OffsetNumber offset = offsets[curoff++];
+
+ lp = PageGetItemId(page, offset);
+ tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ heap_execute_freeze_tuple(tuple, &frz);
+ }
}
PageSetLSN(page, lsn);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472..78192172f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1566,7 +1566,7 @@ lazy_scan_prune(LVRelState *vacrel,
TransactionId NewRelfrozenXid;
MultiXactId NewRelminMxid;
OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
- xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage];
+ HeapFreezeTuple frozen[MaxHeapTuplesPerPage];
Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1825,40 +1825,8 @@ retry:
vacrel->frozen_pages++;
- /*
- * At least one tuple with storage needs to be frozen -- execute that
- * now.
- *
- * If we need to freeze any tuples we'll mark the buffer dirty, and
- * write a WAL record recording the changes. We must log the changes
- * to be crash-safe against future truncation of CLOG.
- */
- START_CRIT_SECTION();
-
- MarkBufferDirty(buf);
-
- /* execute collected freezes */
- for (int i = 0; i < tuples_frozen; i++)
- {
- HeapTupleHeader htup;
-
- itemid = PageGetItemId(page, frozen[i].offset);
- htup = (HeapTupleHeader) PageGetItem(page, itemid);
-
- heap_execute_freeze_tuple(htup, &frozen[i]);
- }
-
- /* Now WAL-log freezing if necessary */
- if (RelationNeedsWAL(vacrel->rel))
- {
- XLogRecPtr recptr;
-
- recptr = log_heap_freeze(vacrel->rel, buf, vacrel->FreezeLimit,
- frozen, tuples_frozen);
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
+ heap_freeze_prepared_tuples(vacrel->rel, buf, vacrel->FreezeLimit,
+ frozen, tuples_frozen);
}
/*
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 923d3bc43..bc86a62a9 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -140,15 +140,15 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
{
xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
- appendStringInfo(buf, "cutoff xid %u ntuples %u",
- xlrec->cutoff_xid, xlrec->ntuples);
+ appendStringInfo(buf, "latestRemovedXid %u nplans %u",
+ xlrec->latestRemovedXid, xlrec->nplans);
}
else if (info == XLOG_HEAP2_VISIBLE)
{
xl_heap_visible *xlrec = (xl_heap_visible *) rec;
- appendStringInfo(buf, "cutoff xid %u flags 0x%02X",
- xlrec->cutoff_xid, xlrec->flags);
+ appendStringInfo(buf, "latestRemovedXid %u flags 0x%02X",
+ xlrec->latestRemovedXid, xlrec->flags);
}
else if (info == XLOG_HEAP2_MULTI_INSERT)
{
--
2.34.1
On Fri, Nov 11, 2022 at 10:38 AM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is v4, which removes the old comments you pointed out were
now out of place (they weren't adding much anyway). Also fixed bitrot
against HEAD from today's visibility map commit from Jeff Davis.
Pushed something like this earlier today, though without any changes
to VISIBLE records.
I just started a new thread to discuss standardizing the symbol name
for recovery conflict XID cutoffs reported by tools like pg_waldump.
Seemed better to deal with VISIBLE records in the scope of that new
refactoring patch.
Thanks
--
Peter Geoghegan
Hi,
On 2022-11-15 10:26:05 -0800, Peter Geoghegan wrote:
Pushed something like this earlier today, though without any changes
to VISIBLE records.
While updating a patch to log various offsets in pg_waldump, I noticed a few
minor issues in this patch:
ISTM that some of the page level freezing functions are misnamed. In heapam.c
the heap_xlog* routines are for replay, afaict. However
heap_xlog_freeze_plan() is used to WAL log the freeze
plan. heap_xlog_freeze_page() is used to replay that WAL record. Probably your
brain is too used to nbtree/ :).
I think s/heap_xlog_freeze/heap_log_freeze/ would mostly do the trick, except
that heap_xlog_new_freeze_plan() doesn't quite fit in the scheme.
The routines then also should be moved a bit up, because right now they're
inbetween other routines doing WAL replay, adding to the confusion.
The memcpy in heap_xlog_freeze_page() seems a tad odd. I assume that the
alignment guarantees for xl_heap_freeze_plan are too weak? But imo it's
failure prone (and I'm not sure strictly speaking legal from an undefined
behaviour POV) to form a pointer to a misaligned array. Even if we then later
just memcpy() from those pointers.
Greetings,
Andres Freund
On Mon, Jan 9, 2023 at 1:43 PM Andres Freund <andres@anarazel.de> wrote:
ISTM that some of the page level freezing functions are misnamed. In heapam.c
the heap_xlog* routines are for replay, afaict. However
heap_xlog_freeze_plan() is used to WAL log the freeze
plan. heap_xlog_freeze_page() is used to replay that WAL record. Probably your
brain is too used to nbtree/ :).
Sometimes I wonder why other people stubbornly insist on not starting
every function name with an underscore. :-)
I think s/heap_xlog_freeze/heap_log_freeze/ would mostly do the trick, except
that heap_xlog_new_freeze_plan() doesn't quite fit in the scheme.
The routines then also should be moved a bit up, because right now they're
inbetween other routines doing WAL replay, adding to the confusion.
I believe that I used this scheme because of the fact that the new
functions were conceptually related to REDO routines, even though they
run during original execution. I'm quite happy to revise the code
based on your suggestions, though.
The memcpy in heap_xlog_freeze_page() seems a tad odd. I assume that the
alignment guarantees for xl_heap_freeze_plan are too weak?
They're not too weak. I'm not sure why the memcpy() was used. I see
your point; it makes you wonder if it must be necessary, which then
seems to call into question why it's okay to access the main array as
an array. I can change this detail, too.
I'll try to get back to it this week.
--
Peter Geoghegan
On Mon, Jan 9, 2023 at 2:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
I'll try to get back to it this week.
Attached patch fixes up these issues. It's almost totally mechanical.
(Ended up using "git diff --color-moved=dimmed-zebra
--color-moved-ws=ignore-all-space" with this, per your recent tip,
which did help.)
--
Peter Geoghegan
Attachments:
v1-0001-Rename-freeze-plan-dedup-routines.patchapplication/octet-stream; name=v1-0001-Rename-freeze-plan-dedup-routines.patchDownload
From 839218bd3cf8aa633b5c2c07a1746c879152625b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Wed, 11 Jan 2023 15:44:06 -0800
Subject: [PATCH v1] Rename freeze plan dedup routines.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230109214308.icz26oqvt3k2274c@awork3.anarazel.de
---
src/backend/access/heap/heapam.c | 296 +++++++++++++++----------------
1 file changed, 147 insertions(+), 149 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 63c4f01f0..388df94a4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -95,6 +95,9 @@ static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
static TM_Result heap_lock_updated_tuple(Relation rel, HeapTuple tuple,
ItemPointer ctid, TransactionId xid,
LockTupleMode mode);
+static int heap_log_freeze_plan(HeapTupleFreeze *tuples, int ntuples,
+ xl_heap_freeze_plan *plans_out,
+ OffsetNumber *offsets_out);
static void GetMultiXactIdHintBits(MultiXactId multi, uint16 *new_infomask,
uint16 *new_infomask2);
static TransactionId MultiXactIdGetUpdateXid(TransactionId xmax,
@@ -111,9 +114,6 @@ static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
-static int heap_xlog_freeze_plan(HeapTupleFreeze *tuples, int ntuples,
- xl_heap_freeze_plan *plans_out,
- OffsetNumber *offsets_out);
/*
@@ -6868,7 +6868,7 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
XLogRecPtr recptr;
/* Prepare deduplicated representation for use in WAL record */
- nplans = heap_xlog_freeze_plan(tuples, ntuples, plans, offsets);
+ nplans = heap_log_freeze_plan(tuples, ntuples, plans, offsets);
xlrec.snapshotConflictHorizon = snapshotConflictHorizon;
xlrec.nplans = nplans;
@@ -6895,6 +6895,144 @@ heap_freeze_execute_prepared(Relation rel, Buffer buffer,
END_CRIT_SECTION();
}
+/*
+ * Comparator used to deduplicate XLOG_HEAP2_FREEZE_PAGE freeze plans
+ */
+static int
+heap_log_freeze_cmp(const void *arg1, const void *arg2)
+{
+ HeapTupleFreeze *frz1 = (HeapTupleFreeze *) arg1;
+ HeapTupleFreeze *frz2 = (HeapTupleFreeze *) arg2;
+
+ if (frz1->xmax < frz2->xmax)
+ return -1;
+ else if (frz1->xmax > frz2->xmax)
+ return 1;
+
+ if (frz1->t_infomask2 < frz2->t_infomask2)
+ return -1;
+ else if (frz1->t_infomask2 > frz2->t_infomask2)
+ return 1;
+
+ if (frz1->t_infomask < frz2->t_infomask)
+ return -1;
+ else if (frz1->t_infomask > frz2->t_infomask)
+ return 1;
+
+ if (frz1->frzflags < frz2->frzflags)
+ return -1;
+ else if (frz1->frzflags > frz2->frzflags)
+ return 1;
+
+ /*
+ * heap_log_freeze_eq would consider these tuple-wise plans to be equal.
+ * (So the tuples will share a single canonical freeze plan.)
+ *
+ * We tiebreak on page offset number to keep each freeze plan's page
+ * offset number array individually sorted. (Unnecessary, but be tidy.)
+ */
+ if (frz1->offset < frz2->offset)
+ return -1;
+ else if (frz1->offset > frz2->offset)
+ return 1;
+
+ Assert(false);
+ return 0;
+}
+
+/*
+ * Compare fields that describe actions required to freeze tuple with caller's
+ * open plan. If everything matches then the frz tuple plan is equivalent to
+ * caller's plan.
+ */
+static inline bool
+heap_log_freeze_eq(xl_heap_freeze_plan *plan, HeapTupleFreeze *frz)
+{
+ if (plan->xmax == frz->xmax &&
+ plan->t_infomask2 == frz->t_infomask2 &&
+ plan->t_infomask == frz->t_infomask &&
+ plan->frzflags == frz->frzflags)
+ return true;
+
+ /* Caller must call heap_log_freeze_new_plan again for frz */
+ return false;
+}
+
+/*
+ * Start new plan initialized using tuple-level actions. At least one tuple
+ * will have steps required to freeze described by caller's plan during REDO.
+ */
+static inline void
+heap_log_freeze_new_plan(xl_heap_freeze_plan *plan, HeapTupleFreeze *frz)
+{
+ plan->xmax = frz->xmax;
+ plan->t_infomask2 = frz->t_infomask2;
+ plan->t_infomask = frz->t_infomask;
+ plan->frzflags = frz->frzflags;
+ plan->ntuples = 1; /* for now */
+}
+
+/*
+ * Deduplicate tuple-based freeze plans so that each distinct set of
+ * processing steps is only stored once in XLOG_HEAP2_FREEZE_PAGE records.
+ * Called during original execution of freezing (for logged relations).
+ *
+ * Return value is number of plans set in *plans_out for caller. Also writes
+ * an array of offset numbers into *offsets_out output argument for caller
+ * (actually there is one array per freeze plan, but that's not of immediate
+ * concern to our caller).
+ */
+static int
+heap_log_freeze_plan(HeapTupleFreeze *tuples, int ntuples,
+ xl_heap_freeze_plan *plans_out,
+ OffsetNumber *offsets_out)
+{
+ int nplans = 0;
+
+ /* Sort tuple-based freeze plans in the order required to deduplicate */
+ qsort(tuples, ntuples, sizeof(HeapTupleFreeze), heap_log_freeze_cmp);
+
+ for (int i = 0; i < ntuples; i++)
+ {
+ HeapTupleFreeze *frz = tuples + i;
+
+ if (i == 0)
+ {
+ /* New canonical freeze plan starting with first tup */
+ heap_log_freeze_new_plan(plans_out, frz);
+ nplans++;
+ }
+ else if (heap_log_freeze_eq(plans_out, frz))
+ {
+ /* tup matches open canonical plan -- include tup in it */
+ Assert(offsets_out[i - 1] < frz->offset);
+ plans_out->ntuples++;
+ }
+ else
+ {
+ /* Tup doesn't match current plan -- done with it now */
+ plans_out++;
+
+ /* New canonical freeze plan starting with this tup */
+ heap_log_freeze_new_plan(plans_out, frz);
+ nplans++;
+ }
+
+ /*
+ * Save page offset number in dedicated buffer in passing.
+ *
+ * REDO routine relies on the record's offset numbers array grouping
+ * offset numbers by freeze plan. The sort order within each grouping
+ * is ascending offset number order, just to keep things tidy.
+ */
+ offsets_out[i] = frz->offset;
+ }
+
+ Assert(nplans > 0 && nplans <= ntuples);
+
+ return nplans;
+}
+
/*
* heap_freeze_tuple
* Freeze tuple in place, without WAL logging.
@@ -9015,144 +9153,6 @@ heap_xlog_visible(XLogReaderState *record)
UnlockReleaseBuffer(vmbuffer);
}
-/*
- * Comparator used to deduplicate XLOG_HEAP2_FREEZE_PAGE freeze plans
- */
-static int
-heap_xlog_freeze_cmp(const void *arg1, const void *arg2)
-{
- HeapTupleFreeze *frz1 = (HeapTupleFreeze *) arg1;
- HeapTupleFreeze *frz2 = (HeapTupleFreeze *) arg2;
-
- if (frz1->xmax < frz2->xmax)
- return -1;
- else if (frz1->xmax > frz2->xmax)
- return 1;
-
- if (frz1->t_infomask2 < frz2->t_infomask2)
- return -1;
- else if (frz1->t_infomask2 > frz2->t_infomask2)
- return 1;
-
- if (frz1->t_infomask < frz2->t_infomask)
- return -1;
- else if (frz1->t_infomask > frz2->t_infomask)
- return 1;
-
- if (frz1->frzflags < frz2->frzflags)
- return -1;
- else if (frz1->frzflags > frz2->frzflags)
- return 1;
-
- /*
- * heap_xlog_freeze_eq would consider these tuple-wise plans to be equal.
- * (So the tuples will share a single canonical freeze plan.)
- *
- * We tiebreak on page offset number to keep each freeze plan's page
- * offset number array individually sorted. (Unnecessary, but be tidy.)
- */
- if (frz1->offset < frz2->offset)
- return -1;
- else if (frz1->offset > frz2->offset)
- return 1;
-
- Assert(false);
- return 0;
-}
-
-/*
- * Compare fields that describe actions required to freeze tuple with caller's
- * open plan. If everything matches then the frz tuple plan is equivalent to
- * caller's plan.
- */
-static inline bool
-heap_xlog_freeze_eq(xl_heap_freeze_plan *plan, HeapTupleFreeze *frz)
-{
- if (plan->xmax == frz->xmax &&
- plan->t_infomask2 == frz->t_infomask2 &&
- plan->t_infomask == frz->t_infomask &&
- plan->frzflags == frz->frzflags)
- return true;
-
- /* Caller must call heap_xlog_new_freeze_plan again for frz */
- return false;
-}
-
-/*
- * Start new plan initialized using tuple-level actions. At least one tuple
- * will have steps required to freeze described by caller's plan during REDO.
- */
-static inline void
-heap_xlog_new_freeze_plan(xl_heap_freeze_plan *plan, HeapTupleFreeze *frz)
-{
- plan->xmax = frz->xmax;
- plan->t_infomask2 = frz->t_infomask2;
- plan->t_infomask = frz->t_infomask;
- plan->frzflags = frz->frzflags;
- plan->ntuples = 1; /* for now */
-}
-
-/*
- * Deduplicate tuple-based freeze plans so that each distinct set of
- * processing steps is only stored once in XLOG_HEAP2_FREEZE_PAGE records.
- * Called during original execution of freezing (for logged relations).
- *
- * Return value is number of plans set in *plans_out for caller. Also writes
- * an array of offset numbers into *offsets_out output argument for caller
- * (actually there is one array per freeze plan, but that's not of immediate
- * concern to our caller).
- */
-static int
-heap_xlog_freeze_plan(HeapTupleFreeze *tuples, int ntuples,
- xl_heap_freeze_plan *plans_out,
- OffsetNumber *offsets_out)
-{
- int nplans = 0;
-
- /* Sort tuple-based freeze plans in the order required to deduplicate */
- qsort(tuples, ntuples, sizeof(HeapTupleFreeze), heap_xlog_freeze_cmp);
-
- for (int i = 0; i < ntuples; i++)
- {
- HeapTupleFreeze *frz = tuples + i;
-
- if (i == 0)
- {
- /* New canonical freeze plan starting with first tup */
- heap_xlog_new_freeze_plan(plans_out, frz);
- nplans++;
- }
- else if (heap_xlog_freeze_eq(plans_out, frz))
- {
- /* tup matches open canonical plan -- include tup in it */
- Assert(offsets_out[i - 1] < frz->offset);
- plans_out->ntuples++;
- }
- else
- {
- /* Tup doesn't match current plan -- done with it now */
- plans_out++;
-
- /* New canonical freeze plan starting with this tup */
- heap_xlog_new_freeze_plan(plans_out, frz);
- nplans++;
- }
-
- /*
- * Save page offset number in dedicated buffer in passing.
- *
- * REDO routine relies on the record's offset numbers array grouping
- * offset numbers by freeze plan. The sort order within each grouping
- * is ascending offset number order, just to keep things tidy.
- */
- offsets_out[i] = frz->offset;
- }
-
- Assert(nplans > 0 && nplans <= ntuples);
-
- return nplans;
-}
-
/*
* Replay XLOG_HEAP2_FREEZE_PAGE records
*/
@@ -9189,21 +9189,19 @@ heap_xlog_freeze_page(XLogReaderState *record)
sizeof(xl_heap_freeze_plan)));
for (int p = 0; p < xlrec->nplans; p++)
{
- xl_heap_freeze_plan plan;
HeapTupleFreeze frz;
/*
* Convert freeze plan representation from WAL record into
* per-tuple format used by heap_execute_freeze_tuple
*/
- memcpy(&plan, &plans[p], sizeof(xl_heap_freeze_plan));
- frz.xmax = plan.xmax;
- frz.t_infomask2 = plan.t_infomask2;
- frz.t_infomask = plan.t_infomask;
- frz.frzflags = plan.frzflags;
+ frz.xmax = plans[p].xmax;
+ frz.t_infomask2 = plans[p].t_infomask2;
+ frz.t_infomask = plans[p].t_infomask;
+ frz.frzflags = plans[p].frzflags;
frz.offset = InvalidOffsetNumber; /* unused, but be tidy */
- for (int i = 0; i < plan.ntuples; i++)
+ for (int i = 0; i < plans[p].ntuples; i++)
{
OffsetNumber offset = offsets[curoff++];
ItemId lp;
--
2.39.0
Hi,
On 2023-01-11 16:06:31 -0800, Peter Geoghegan wrote:
On Mon, Jan 9, 2023 at 2:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
I'll try to get back to it this week.
Attached patch fixes up these issues. It's almost totally mechanical.
Looks better, thanks!
(Ended up using "git diff --color-moved=dimmed-zebra
--color-moved-ws=ignore-all-space" with this, per your recent tip,
which did help.)
It's a really useful feature. I configured git to always use
--color-moved=dimmed-zebra, but haven't quite dared to enable
--color-moved-ws=ignore-all-space by default.
Greetings,
Andres Freund
On Wed, Jan 11, 2023 at 4:44 PM Andres Freund <andres@anarazel.de> wrote:
Attached patch fixes up these issues. It's almost totally mechanical.
Looks better, thanks!
Pushed that just now.
Thanks
--
Peter Geoghegan