Rework the way multixact truncations work

Started by Andres Freundover 10 years ago82 messages
#1Andres Freund
andres@anarazel.de
3 attachment(s)

Hi,

As discussed on list, over IM and in person at pgcon I want to make
multixact truncations be WAL logged to address various bugs.

Since that's a comparatively large and invasive change I thought it'd be
a good idea to start a new thread instead of burying it in a already
long thread.

Here's the commit message which hopefully explains what's being changed
and why:

Rework the way multixact truncations work.

The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.

We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for truncation, even in the back branches.

This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
this has gotten much easier

During the course of fixing this a bunch of bugs had to be fixed:
1) Data was not purged from memory the member's slru before deleting
segments. This happend to be hard or impossible to hit due to the
interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
that doesn't work for offsets that haven't yet been flushed to
disk. Flush out before running to fix. Not pretty, but it feels
slightly safer to only make decisions based on on-disk state.

To handle the case of an updated standby replaying WAL from a not-yet
upgraded primary we have to recognize that situation and use "old style"
truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to
before this now happens in the startup process, when replaying a
checkpoint record, instead of the checkpointer. Doing this in the
restartpoint was incorrect, they can happen much later than the original
checkpoint, thereby leading to wraparound. It's also more in line to how
the WAL logging now works.

To avoid "multixact_redo: unknown op code 48" errors standbys should be
upgraded before primaries. This needs to be expressed clearly in the
release notes.

Backpatch to 9.3, where the use of multixacts was expanded. Arguably
this could be backpatched further, but there doesn't seem to be
sufficient benefit to outweigh the risk of applying a significantly
different patch there.

I've tested this a bunch, including using a newer standby against a
older master and such. What I have yet to test is that the concurrency
protections against multiple backends truncating at the same time are
correct.

It'd be very welcome to see some wider testing and review on this.

I've attached three commits:
0001: Add functions to burn through multixacts - that should get its own file.
0002: Lower the lower bound limits for *_freeze_max_age - I think we should
just do that. There really is no reason for the current limits
and they make testing hard and force space wastage.
0003: The actual truncation patch.

Greetings,

Andres Freund

Attachments:

0001-WIP-dontcommit-Add-functions-to-burn-multixacts.patchtext/x-patch; charset=us-asciiDownload
>From cdc4f8f3341161b87a5d11171efb14c98c252ee6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 4 Jun 2015 19:38:32 +0200
Subject: [PATCH 1/3] WIP-dontcommit: Add functions to burn multixacts

This should live in its own module, but we don't have that yet.
---
 contrib/pageinspect/heapfuncs.c          | 43 ++++++++++++++++++++++++++++++++
 contrib/pageinspect/pageinspect--1.3.sql |  6 +++++
 src/backend/access/heap/heapam.c         |  2 +-
 src/backend/access/transam/multixact.c   | 15 ++++++-----
 src/include/access/multixact.h           |  3 ++-
 5 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 8d1666c..7a3aa14 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -29,6 +29,8 @@
 #include "funcapi.h"
 #include "utils/builtins.h"
 #include "miscadmin.h"
+#include "access/multixact.h"
+#include "access/transam.h"
 
 
 /*
@@ -223,3 +225,44 @@ heap_page_items(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(fctx);
 }
+
+extern Datum
+pg_burn_multixact(PG_FUNCTION_ARGS);
+PG_FUNCTION_INFO_V1(pg_burn_multixact);
+
+Datum
+pg_burn_multixact(PG_FUNCTION_ARGS)
+{
+	int		rep = PG_GETARG_INT32(0);
+	int		size = PG_GETARG_INT32(1);
+	MultiXactMember *members;
+	MultiXactId ret;
+	TransactionId id = ReadNewTransactionId() - size;
+	int		i;
+
+	if (rep < 1)
+		elog(ERROR, "need to burn, burn, burn");
+
+	members = palloc(size * sizeof(MultiXactMember));
+	for (i = 0; i < size; i++)
+	{
+		members[i].xid = id++;
+		members[i].status = MultiXactStatusForShare;
+
+		if (!TransactionIdIsNormal(members[i].xid))
+		{
+			id = FirstNormalTransactionId;
+			members[i].xid = id++;
+		}
+	}
+
+	MultiXactIdSetOldestMember();
+
+	for (i = 0; i < rep; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+		ret = MultiXactIdCreateFromMembers(size, members, true);
+	}
+
+	PG_RETURN_INT64((int64) ret);
+}
diff --git a/contrib/pageinspect/pageinspect--1.3.sql b/contrib/pageinspect/pageinspect--1.3.sql
index a99e058..22f51bc 100644
--- a/contrib/pageinspect/pageinspect--1.3.sql
+++ b/contrib/pageinspect/pageinspect--1.3.sql
@@ -187,3 +187,9 @@ CREATE FUNCTION gin_leafpage_items(IN page bytea,
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'gin_leafpage_items'
 LANGUAGE C STRICT;
+
+
+CREATE FUNCTION pg_burn_multixact(num int4, size int4)
+RETURNS int4
+AS 'MODULE_PATHNAME', 'pg_burn_multixact'
+LANGUAGE C STRICT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index caacc10..c57f99d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6050,7 +6050,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 * Create a new multixact with the surviving members of the previous
 		 * one, to set as new Xmax in the tuple.
 		 */
-		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
+		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers, false);
 		*flags |= FRM_RETURN_IS_MULTI;
 	}
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 377d084..cf43254 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -405,7 +405,7 @@ MultiXactIdCreate(TransactionId xid1, MultiXactStatus status1,
 	members[1].xid = xid2;
 	members[1].status = status2;
 
-	newMulti = MultiXactIdCreateFromMembers(2, members);
+	newMulti = MultiXactIdCreateFromMembers(2, members, false);
 
 	debug_elog3(DEBUG2, "Create: %s",
 				mxid_to_string(newMulti, 2, members));
@@ -471,7 +471,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
 		 */
 		member.xid = xid;
 		member.status = status;
-		newMulti = MultiXactIdCreateFromMembers(1, &member);
+		newMulti = MultiXactIdCreateFromMembers(1, &member, false);
 
 		debug_elog4(DEBUG2, "Expand: %u has no members, create singleton %u",
 					multi, newMulti);
@@ -523,7 +523,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
 
 	newMembers[j].xid = xid;
 	newMembers[j++].status = status;
-	newMulti = MultiXactIdCreateFromMembers(j, newMembers);
+	newMulti = MultiXactIdCreateFromMembers(j, newMembers, false);
 
 	pfree(members);
 	pfree(newMembers);
@@ -742,7 +742,7 @@ ReadNextMultiXactId(void)
  * NB: the passed members[] array will be sorted in-place.
  */
 MultiXactId
-MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
+MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members, bool nocache)
 {
 	MultiXactId multi;
 	MultiXactOffset offset;
@@ -761,7 +761,9 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	 * corner cases where someone else added us to a MultiXact without our
 	 * knowledge, but it's not worth checking for.)
 	 */
-	multi = mXactCacheGetBySet(nmembers, members);
+	multi = nocache ? InvalidMultiXactId :
+		mXactCacheGetBySet(nmembers, members);
+
 	if (MultiXactIdIsValid(multi))
 	{
 		debug_elog2(DEBUG2, "Create: in cache!");
@@ -834,7 +836,8 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	END_CRIT_SECTION();
 
 	/* Store the new MultiXactId in the local cache, too */
-	mXactCachePut(multi, nmembers, members);
+	if (!nocache)
+		mXactCachePut(multi, nmembers, members);
 
 	debug_elog2(DEBUG2, "Create: all done");
 
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index f1448fe..6213f8a 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -86,10 +86,11 @@ typedef struct xl_multixact_create
 extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status1, TransactionId xid2,
 				  MultiXactStatus status2);
+extern MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members, bool nocache);
 extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid,
 				  MultiXactStatus status);
 extern MultiXactId MultiXactIdCreateFromMembers(int nmembers,
-							 MultiXactMember *members);
+							 MultiXactMember *members, bool nocache);
 
 extern MultiXactId ReadNextMultiXactId(void);
 extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
-- 
2.4.0.rc2.1.g3d6bc9a

0002-Lower-_freeze_max_age-minimum-values.patchtext/x-patch; charset=us-asciiDownload
>From ceb5683e62bcae51689dc27e0764f822e26595f7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 15 Jun 2015 19:12:52 +0200
Subject: [PATCH 2/3] Lower *_freeze_max_age minimum values.

The old minimum values are rather large, making it time consuming to
test related behaviour. Additionally the current limits, especially for
multixacts, can be problematic in space-constrained systems. 10000000
multixacts can contain a lot of members.

Since there's no good reason for the current limits, lower them a good
bit. Setting them to 0 would be a bad idea, triggering endless vacuums,
so still retain a limit.
---
 src/backend/utils/misc/guc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 230c5cc..aec4adc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2516,17 +2516,17 @@ static struct config_int ConfigureNamesInt[] =
 		},
 		&autovacuum_freeze_max_age,
 		/* see pg_resetxlog if you change the upper-limit value */
-		200000000, 100000000, 2000000000,
+		200000000, 100000, 2000000000,
 		NULL, NULL, NULL
 	},
 	{
-		/* see varsup.c for why this is PGC_POSTMASTER not PGC_SIGHUP */
+		/* see multixact.c for why this is PGC_POSTMASTER not PGC_SIGHUP */
 		{"autovacuum_multixact_freeze_max_age", PGC_POSTMASTER, AUTOVACUUM,
 			gettext_noop("Multixact age at which to autovacuum a table to prevent multixact wraparound."),
 			NULL
 		},
 		&autovacuum_multixact_freeze_max_age,
-		400000000, 10000000, 2000000000,
+		400000000, 10000, 2000000000,
 		NULL, NULL, NULL
 	},
 	{
-- 
2.4.0.rc2.1.g3d6bc9a

0003-Rework-the-way-multixact-truncations-work.patchtext/x-patch; charset=us-asciiDownload
>From c74948481b68c6519c85b344275074ce51be1b0a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 21 Jun 2015 21:02:45 +0200
Subject: [PATCH 3/3] Rework the way multixact truncations work.

The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.

We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for truncation, even in the back branches.

This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
   to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
   we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
   this has gotten much easier

During the course of fixing this a bunch of bugs had to be fixed:
1) Data was not purged from memory the member's slru before deleting
   segments. This happend to be hard or impossible to hit due to the
   interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
   that doesn't work for offsets that haven't yet been flushed to
   disk. Flush out before running to fix. Not pretty, but it feels
   slightly safer to only make decisions based on on-disk state.

To handle the case of an updated standby replaying WAL from a not-yet
upgraded primary we have to recognize that situation and use "old style"
truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to
before this now happens in the startup process, when replaying a
checkpoint record, instead of the checkpointer. Doing this in the
restartpoint was incorrect, they can happen much later than the original
checkpoint, thereby leading to wraparound. It's also more in line to how
the WAL logging now works.

To avoid "multixact_redo: unknown op code 48" errors standbys should be
upgraded before primaries. This needs to be expressed clearly in the
release notes.

Backpatch to 9.3, where the use of multixacts was expanded. Arguably
this could be backpatched further, but there doesn't seem to be
sufficient benefit to outweigh the risk of applying a significantly
different patch there.
---
 src/backend/access/rmgrdesc/mxactdesc.c |  10 +
 src/backend/access/transam/multixact.c  | 550 +++++++++++++++++---------------
 src/backend/access/transam/slru.c       |  89 +++++-
 src/backend/access/transam/xlog.c       |  51 +--
 src/backend/commands/vacuum.c           |   4 +-
 src/include/access/multixact.h          |  14 +-
 src/include/access/slru.h               |   4 +-
 src/include/storage/lwlock.h            |   3 +-
 8 files changed, 423 insertions(+), 302 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 572951e..df55df1 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -70,6 +70,13 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
+	else if (info == XLOG_MULTIXACT_TRUNCATE_ID)
+	{
+		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
+		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+						 xlrec->startOff, xlrec->endOff,
+						 xlrec->startMemb, xlrec->endMemb);
+	}
 }
 
 const char *
@@ -88,6 +95,9 @@ multixact_identify(uint8 info)
 		case XLOG_MULTIXACT_CREATE_ID:
 			id = "CREATE_ID";
 			break;
+		case XLOG_MULTIXACT_TRUNCATE_ID:
+			id = "TRUNCATE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index cf43254..764cc42 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -49,9 +49,7 @@
  * value is removed; the cutoff value is stored in pg_class.  The minimum value
  * across all tables in each database is stored in pg_database, and the global
  * minimum across all databases is part of pg_control and is kept in shared
- * memory.  At checkpoint time, after the value is known flushed in WAL, any
- * files that correspond to multixacts older than that value are removed.
- * (These files are also removed when a restartpoint is executed.)
+ * memory.  Whenever that minimum is advanced, the SLRUs are truncated.
  *
  * When new multixactid values are to be created, care is taken that the
  * counter does not fall within the wraparound horizon considering the global
@@ -83,6 +81,7 @@
 #include "postmaster/autovacuum.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -153,6 +152,7 @@
 
 /* page in which a member is to be found */
 #define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
+#define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
 
 /* Location (byte offset within page) of flag word for a given member */
 #define MXOffsetToFlagsOffset(xid) \
@@ -218,11 +218,12 @@ typedef struct MultiXactStateData
 	bool		oldestOffsetKnown;
 
 	/*
-	 * This is what the previous checkpoint stored as the truncate position.
-	 * This value is the oldestMultiXactId that was valid when a checkpoint
-	 * was last executed.
+	 * True if a multixact truncation WAL record was replayed since the last
+	 * checkpoint. This is used to trigger 'legacy truncations', i.e. truncate
+	 * by looking at the data directory during WAL replay, when the primary is
+	 * too old to general truncation records.
 	 */
-	MultiXactId lastCheckpointedOldest;
+	bool		sawTruncationCkptCyle;
 
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
@@ -231,8 +232,7 @@ typedef struct MultiXactStateData
 	MultiXactId multiWrapLimit;
 
 	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;
-	bool offsetStopLimitKnown;
+	MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
 
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
@@ -362,12 +362,13 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 						MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static void DetermineSafeOldestOffset(MultiXactId oldestMXact);
 static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
 						 MultiXactOffset start, uint32 distance);
-static bool SetOffsetVacuumLimit(bool finish_setup);
+static bool SetOffsetVacuumLimit(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int pageno, uint8 info);
+static void WriteMTruncateXlogRec(MultiXactOffset startOff, MultiXactOffset endOff,
+								  MultiXactOffset startMemb, MultiXactOffset endMemb);
 
 
 /*
@@ -1100,7 +1101,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 *----------
 	 */
 #define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->offsetStopLimitKnown &&
+	if (MultiXactState->oldestOffsetKnown &&
 		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
 								 nmembers))
 	{
@@ -1140,7 +1141,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 	}
 
-	if (MultiXactState->offsetStopLimitKnown &&
+	if (MultiXactState->oldestOffsetKnown &&
 		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
 								 nextOffset,
 								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
@@ -2018,13 +2019,21 @@ StartupMultiXact(void)
 void
 TrimMultiXact(void)
 {
-	MultiXactId multi = MultiXactState->nextMXact;
-	MultiXactOffset offset = MultiXactState->nextOffset;
+	MultiXactId nextMXact;
+	MultiXactOffset offset;
 	MultiXactId	oldestMXact;
+	MultiXactId	oldestMXactDB;
 	int			pageno;
 	int			entryno;
 	int			flagsoff;
 
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	nextMXact = MultiXactState->nextMXact;
+	offset = MultiXactState->nextOffset;
+	oldestMXact = MultiXactState->oldestMultiXactId;
+	oldestMXactDB = MultiXactState->oldestMultiXactDB;
+	MultiXactState->finishedStartup = true;
+	LWLockRelease(MultiXactGenLock);
 
 	/* Clean up offsets state */
 	LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
@@ -2032,20 +2041,20 @@ TrimMultiXact(void)
 	/*
 	 * (Re-)Initialize our idea of the latest page number for offsets.
 	 */
-	pageno = MultiXactIdToOffsetPage(multi);
+	pageno = MultiXactIdToOffsetPage(nextMXact);
 	MultiXactOffsetCtl->shared->latest_page_number = pageno;
 
 	/*
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for motivation.
 	 */
-	entryno = MultiXactIdToOffsetEntry(multi);
+	entryno = MultiXactIdToOffsetEntry(nextMXact);
 	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
@@ -2094,12 +2103,11 @@ TrimMultiXact(void)
 
 	LWLockRelease(MultiXactMemberControlLock);
 
-	if (SetOffsetVacuumLimit(true) && IsUnderPostmaster)
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	oldestMXact = MultiXactState->lastCheckpointedOldest;
-	LWLockRelease(MultiXactGenLock);
-	DetermineSafeOldestOffset(oldestMXact);
+	/*
+	 * Recompute limits once fully started, we now can compute how far an
+	 * members wraparound is away.
+	 */
+	SetMultiXactIdLimit(oldestMXact, oldestMXactDB);
 }
 
 /*
@@ -2268,8 +2276,19 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	 (errmsg("MultiXactId wrap limit is %u, limited by database with OID %u",
 			 multiWrapLimit, oldest_datoid)));
 
+	/*
+	 * Computing the actual limits is only possible once the data directory is
+	 * in a consistent state. There's no need to compute the limits while
+	 * still replaying WAL as no new multis can be computed anyway. So we'll
+	 * only do further checks once TrimMultiXact() has been called.
+	 */
+	if (!MultiXactState->finishedStartup)
+		return;
+
+	Assert(!InRecovery);
+
 	/* Set limits for offset vacuum. */
-	needs_offset_vacuum = SetOffsetVacuumLimit(false);
+	needs_offset_vacuum = SetOffsetVacuumLimit();
 
 	/*
 	 * If past the autovacuum force point, immediately signal an autovac
@@ -2279,11 +2298,11 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	 * another iteration immediately if there are still any old databases.
 	 */
 	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
-		 needs_offset_vacuum) && IsUnderPostmaster && !InRecovery)
+		 needs_offset_vacuum) && IsUnderPostmaster)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
-	if (MultiXactIdPrecedes(multiWarnLimit, curMulti) && !InRecovery)
+	if (MultiXactIdPrecedes(multiWarnLimit, curMulti))
 	{
 		char	   *oldest_datname;
 
@@ -2351,27 +2370,32 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 }
 
 /*
- * Update our oldestMultiXactId value, but only if it's more recent than
- * what we had.  However, even if not, always update the oldest multixact
- * offset limit.
+ * During WAL replay update our oldestMultiXactId value, but only if it's more
+ * recent than what we had.
  */
 void
 MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 {
 	if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
+	{
+		/*
+		 * If there has been a truncation on the master, detected via a moving
+		 * oldestMulti, without a corresponding truncation record we know that
+		 * the primary is still running an older version of postgres that
+		 * doesn't yet log multixact truncations. So perform truncation
+		 * ourselves.
+		 */
+		if (!MultiXactState->sawTruncationCkptCyle)
+		{
+			ereport(LOG, (errmsg("performing legacy multixact truncation, upgrade master")));
+			TruncateMultiXact(oldestMulti, oldestMultiDB, true);
+		}
+
 		SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
-}
+	}
 
-/*
- * Update the "safe truncation point".  This is the newest value of oldestMulti
- * that is known to be flushed as part of a checkpoint record.
- */
-void
-MultiXactSetSafeTruncate(MultiXactId safeTruncateMulti)
-{
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->lastCheckpointedOldest = safeTruncateMulti;
-	LWLockRelease(MultiXactGenLock);
+	/* only looked at in the startup process, no lock necessary */
+	MultiXactState->sawTruncationCkptCyle = false;
 }
 
 /*
@@ -2527,126 +2551,44 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Based on the given oldest MultiXactId, determine what's the oldest member
- * offset and install the limit info in MultiXactState, where it can be used to
- * prevent overrun of old data in the members SLRU area.
- */
-static void
-DetermineSafeOldestOffset(MultiXactId oldestMXact)
-{
-	MultiXactOffset oldestOffset;
-	MultiXactOffset nextOffset;
-	MultiXactOffset offsetStopLimit;
-	MultiXactOffset prevOffsetStopLimit;
-	MultiXactId		nextMXact;
-	bool			finishedStartup;
-	bool			prevOffsetStopLimitKnown;
-
-	/* Fetch values from shared memory. */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	finishedStartup = MultiXactState->finishedStartup;
-	nextMXact = MultiXactState->nextMXact;
-	nextOffset = MultiXactState->nextOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
-	prevOffsetStopLimitKnown = MultiXactState->offsetStopLimitKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	/* Don't worry about this until after we've started up. */
-	if (!finishedStartup)
-		return;
-
-	/*
-	 * Determine the offset of the oldest multixact.  Normally, we can read
-	 * the offset from the multixact itself, but there's an important special
-	 * case: if there are no multixacts in existence at all, oldestMXact
-	 * obviously can't point to one.  It will instead point to the multixact
-	 * ID that will be assigned the next time one is needed.
-	 *
-	 * NB: oldestMXact should be the oldest multixact that still exists in the
-	 * SLRU, unlike in SetOffsetVacuumLimit, where we do this same computation
-	 * based on the oldest value that might be referenced in a table.
-	 */
-	if (nextMXact == oldestMXact)
-		oldestOffset = nextOffset;
-	else
-	{
-		bool		oldestOffsetKnown;
-
-		oldestOffsetKnown = find_multixact_start(oldestMXact, &oldestOffset);
-		if (!oldestOffsetKnown)
-		{
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
-						oldestMXact)));
-			return;
-		}
-	}
-
-	/* move back to start of the corresponding segment */
-	offsetStopLimit = oldestOffset - (oldestOffset %
-		(MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-	/* always leave one segment before the wraparound point */
-	offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-	/* if nothing has changed, we're done */
-	if (prevOffsetStopLimitKnown && offsetStopLimit == prevOffsetStopLimit)
-		return;
-
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->offsetStopLimit = offsetStopLimit;
-	MultiXactState->offsetStopLimitKnown = true;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!prevOffsetStopLimitKnown && IsUnderPostmaster)
-		ereport(LOG,
-				(errmsg("MultiXact member wraparound protections are now enabled")));
-	ereport(DEBUG1,
-			(errmsg("MultiXact member stop limit is now %u based on MultiXact %u",
-				offsetStopLimit, oldestMXact)));
-}
-
-/*
  * Determine how aggressively we need to vacuum in order to prevent member
  * wraparound.
  *
- * To determine the oldest multixact ID, we look at oldestMultiXactId, not
- * lastCheckpointedOldest.  That's because vacuuming can't help with anything
- * older than oldestMultiXactId; anything older than that isn't referenced
- * by any table.  Offsets older than oldestMultiXactId but not as old as
- * lastCheckpointedOldest will go away after the next checkpoint.
+ * To do so determine what's the oldest member offset and install the limit
+ * info in MultiXactState, where it can be used to prevent overrun of old data
+ * in the members SLRU area.
  *
  * The return value is true if emergency autovacuum is required and false
  * otherwise.
  */
 static bool
-SetOffsetVacuumLimit(bool finish_setup)
+SetOffsetVacuumLimit(void)
 {
 	MultiXactId	oldestMultiXactId;
 	MultiXactId nextMXact;
-	bool		finishedStartup;
 	MultiXactOffset oldestOffset = 0;		/* placate compiler */
+	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
-	MultiXactOffset prevOldestOffset;
-	bool		prevOldestOffsetKnown;
+	bool			prevOldestOffsetKnown;
+	MultiXactOffset offsetStopLimit = 0;
 
 	/* Read relevant fields from shared memory. */
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	finishedStartup = MultiXactState->finishedStartup;
-	prevOldestOffset = MultiXactState->oldestOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
+	prevOldestOffset = MultiXactState->oldestOffset;
+	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
-	/* Don't do this until after any recovery is complete. */
-	if (!finishedStartup && !finish_setup)
-		return false;
-
 	/*
-	 * If no multixacts exist, then oldestMultiXactId will be the next
-	 * multixact that will be created, rather than an existing multixact.
+	 * Determine the offset of the oldest multixact.  Normally, we can read
+	 * the offset from the multixact itself, but there's an important special
+	 * case: if there are no multixacts in existence at all, oldestMXact
+	 * obviously can't point to one.  It will instead point to the multixact
+	 * ID that will be assigned the next time one is needed.
 	 */
 	if (oldestMultiXactId == nextMXact)
 	{
@@ -2667,34 +2609,46 @@ SetOffsetVacuumLimit(bool finish_setup)
 		 */
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
+
+		if (oldestOffsetKnown)
+			ereport(DEBUG1,
+					(errmsg("oldest MultiXactId member is at offset %u",
+							oldestOffset)));
+		else
+			ereport(LOG,
+					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+						oldestMultiXactId)));
 	}
 
 	/*
-	 * Except when initializing the system for the first time, there's no
-	 * need to update anything if we don't know the oldest offset or if it
-	 * hasn't changed.
+	 * If we can, compute limits (and install them MultiXactState) to prevent
+	 * overrun of old data in the members SLRU area. We can only do so if the
+	 * oldest offset is known though.
 	 */
-	if (finish_setup ||
-		(oldestOffsetKnown && !prevOldestOffsetKnown) ||
-		(oldestOffsetKnown && prevOldestOffset != oldestOffset))
+	if (oldestOffsetKnown)
 	{
-		/* Install the new limits. */
-		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-		MultiXactState->oldestOffset = oldestOffset;
-		MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-		MultiXactState->finishedStartup = true;
-		LWLockRelease(MultiXactGenLock);
+		/* move back to start of the corresponding segment */
+		offsetStopLimit = oldestOffset - (oldestOffset %
+			(MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
 
-		/* Log the info */
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg("oldest MultiXactId member is at offset %u",
-						oldestOffset)));
-		else
-			ereport(DEBUG1,
-					(errmsg("oldest MultiXactId member offset unknown")));
+		/* always leave one segment before the wraparound point */
+		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
+
+		if (!prevOldestOffsetKnown && IsUnderPostmaster)
+			ereport(LOG,
+					(errmsg("MultiXact member wraparound protections are now enabled")));
+		ereport(DEBUG1,
+				(errmsg("MultiXact member stop limit is now %u based on MultiXact %u",
+						offsetStopLimit, oldestMultiXactId)));
 	}
 
+	/* Install the computed values */
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestOffset = oldestOffset;
+	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
+	MultiXactState->offsetStopLimit = offsetStopLimit;
+	LWLockRelease(MultiXactGenLock);
+
 	/*
 	 * If we failed to get the oldest offset this time, but we have a value
 	 * from a previous pass through this function, assess the need for
@@ -2777,9 +2731,18 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	int			slotno;
 	MultiXactOffset *offptr;
 
+	/* XXX: Remove || Startup after WAL page magic bump */
+	Assert(MultiXactState->finishedStartup || AmStartupProcess());
+
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/*
+	 * FIXME: We need to flush out dirty data, so PhysicalPageExists can work
+	 * correctly, but SimpleLruFlush() is a pretty big hammer for that.
+	 */
+	SimpleLruFlush(MultiXactOffsetCtl, true);
+
 	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
 		return false;
 
@@ -2885,65 +2848,6 @@ MultiXactMemberFreezeThreshold(void)
 	return multixacts - victim_multixacts;
 }
 
-/*
- * SlruScanDirectory callback.
- *		This callback deletes segments that are outside the range determined by
- *		the given page numbers.
- *
- * Both range endpoints are exclusive (that is, segments containing any of
- * those pages are kept.)
- */
-typedef struct MembersLiveRange
-{
-	int			rangeStart;
-	int			rangeEnd;
-} MembersLiveRange;
-
-static bool
-SlruScanDirCbRemoveMembers(SlruCtl ctl, char *filename, int segpage,
-						   void *data)
-{
-	MembersLiveRange *range = (MembersLiveRange *) data;
-	MultiXactOffset nextOffset;
-
-	if ((segpage == range->rangeStart) ||
-		(segpage == range->rangeEnd))
-		return false;			/* easy case out */
-
-	/*
-	 * To ensure that no segment is spuriously removed, we must keep track of
-	 * new segments added since the start of the directory scan; to do this,
-	 * we update our end-of-range point as we run.
-	 *
-	 * As an optimization, we can skip looking at shared memory if we know for
-	 * certain that the current segment must be kept.  This is so because
-	 * nextOffset never decreases, and we never increase rangeStart during any
-	 * one run.
-	 */
-	if (!((range->rangeStart > range->rangeEnd &&
-		   segpage > range->rangeEnd && segpage < range->rangeStart) ||
-		  (range->rangeStart < range->rangeEnd &&
-		   (segpage < range->rangeStart || segpage > range->rangeEnd))))
-		return false;
-
-	/*
-	 * Update our idea of the end of the live range.
-	 */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	LWLockRelease(MultiXactGenLock);
-	range->rangeEnd = MXOffsetToMemberPage(nextOffset);
-
-	/* Recheck the deletion condition.  If it still holds, perform deletion */
-	if ((range->rangeStart > range->rangeEnd &&
-		 segpage > range->rangeEnd && segpage < range->rangeStart) ||
-		(range->rangeStart < range->rangeEnd &&
-		 (segpage < range->rangeStart || segpage > range->rangeEnd)))
-		SlruDeleteSegment(ctl, filename);
-
-	return false;				/* keep going */
-}
-
 typedef struct mxtruncinfo
 {
 	int			earliestExistingPage;
@@ -2967,6 +2871,32 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
 	return false;				/* keep going */
 }
 
+
+/*
+ * Delete any members segment that doesn't contain the start or end point.
+*/
+static void
+PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset oldestAliveOffset)
+{
+	int startsegment = MXOffsetToMemberSegment(oldestOffset);
+	int endsegment = MXOffsetToMemberSegment(oldestAliveOffset);
+	int maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
+	int segment = startsegment;
+
+	while (segment != endsegment)
+	{
+		/* verify whether the current segment is to be deleted */
+		if (segment != startsegment && segment != endsegment)
+			SlruDeleteSegment(MultiXactMemberCtl, segment);
+
+		/* move to next segment, handle wraparound correctly */
+		if (segment == maxsegment)
+			segment = 0;
+		else
+			segment += 1;
+	}
+}
+
 /*
  * Remove all MultiXactOffset and MultiXactMember segments before the oldest
  * ones still of interest.
@@ -2979,32 +2909,60 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
  * and kept up to date as new pages are zeroed.
  */
 void
-TruncateMultiXact(void)
+TruncateMultiXact(MultiXactId frozenMulti, Oid minmulti_datoid, bool in_recovery)
 {
 	MultiXactId oldestMXact;
 	MultiXactOffset oldestOffset;
 	MultiXactId		nextMXact;
 	MultiXactOffset	nextOffset;
+	MultiXactOffset oldestAliveOffset;
 	mxtruncinfo trunc;
 	MultiXactId earliest;
-	MembersLiveRange range;
 
-	Assert(AmCheckpointerProcess() || AmStartupProcess() ||
-		   !IsPostmasterEnvironment);
+	/*
+	 * Need to allow being called in recovery for backward compatibility, when
+	 * a updated standby replays WAL generated by a non-updated primary.
+	 */
+	Assert(in_recovery || !RecoveryInProgress());
+	Assert(!in_recovery || AmStartupProcess());
+	Assert(in_recovery || MultiXactState->finishedStartup);
+
+	/*
+	 * We can only allow one truncation to happen at once. Otherwise parts of
+	 * members might vanish while we're doing lookups or similar. There's no
+	 * need to have an interlock with creating new multis or such, since those
+	 * are constrained by the limits (which only grow, never shrink).
+	 */
+	LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
 
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	oldestMXact = MultiXactState->lastCheckpointedOldest;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
+	oldestMXact = MultiXactState->oldestMultiXactId;
 	LWLockRelease(MultiXactGenLock);
 	Assert(MultiXactIdIsValid(oldestMXact));
 
 	/*
+	 * Make sure to only attempt truncation if there's values to truncate
+	 * away. In normal processing values shouldn't go backwards, but there's
+	 * some corner cases (due to bugs) where that's possible.
+	 */
+	if (MultiXactIdPrecedesOrEquals(frozenMulti, oldestMXact))
+	{
+		LWLockRelease(MultiXactTruncationLock);
+		return;
+	}
+
+	/*
 	 * Note we can't just plow ahead with the truncation; it's possible that
 	 * there are no segments to truncate, which is a problem because we are
 	 * going to attempt to read the offsets page to determine where to
 	 * truncate the members SLRU.  So we first scan the directory to determine
 	 * the earliest offsets page number that we can read without error.
+	 *
+	 * XXX: It's also possible that the page that oldestMXact is on has
+	 * already been truncated away, and we crashed before updating
+	 * oldestMXact.
 	 */
 	trunc.earliestExistingPage = -1;
 	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
@@ -3012,19 +2970,10 @@ TruncateMultiXact(void)
 	if (earliest < FirstMultiXactId)
 		earliest = FirstMultiXactId;
 
-	/*
-	 * If there's nothing to remove, we can bail out early.
-	 *
-	 * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X,
-	 * oldestMXact might point to a multixact that does not exist.
-	 * Autovacuum will eventually advance it to a value that does exist,
-	 * and we want to set a proper offsetStopLimit when that happens,
-	 * so call DetermineSafeOldestOffset here even if we're not actually
-	 * truncating.
-	 */
+	/* If there's nothing to remove, we can bail out early. */
 	if (MultiXactIdPrecedes(oldestMXact, earliest))
 	{
-		DetermineSafeOldestOffset(oldestMXact);
+		LWLockRelease(MultiXactTruncationLock);
 		return;
 	}
 
@@ -3043,34 +2992,78 @@ TruncateMultiXact(void)
 		ereport(LOG,
 				(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
 					oldestMXact, earliest)));
+		LWLockRelease(MultiXactTruncationLock);
 		return;
 	}
 
 	/*
-	 * To truncate MultiXactMembers, we need to figure out the active page
-	 * range and delete all files outside that range.  The start point is the
-	 * start of the segment containing the oldest offset; an end point of the
-	 * segment containing the next offset to use is enough.  The end point is
-	 * updated as MultiXactMember gets extended concurrently, elsewhere.
+	 * Secondly compute up to where to truncate. Lookup the corresponding
+	 * member offset for frozenMulti for that.
 	 */
-	range.rangeStart = MXOffsetToMemberPage(oldestOffset);
-	range.rangeStart -= range.rangeStart % SLRU_PAGES_PER_SEGMENT;
-
-	range.rangeEnd = MXOffsetToMemberPage(nextOffset);
+	if (frozenMulti == nextMXact)
+		oldestAliveOffset = nextOffset;		/* there are NO MultiXacts */
+	else if (!find_multixact_start(frozenMulti, &oldestAliveOffset))
+	{
+		ereport(LOG,
+				(errmsg("supposedly still alive MultiXact %u not found, skipping truncation",
+						frozenMulti)));
+		LWLockRelease(MultiXactTruncationLock);
+		return;
+	}
 
-	SlruScanDirectory(MultiXactMemberCtl, SlruScanDirCbRemoveMembers, &range);
+	elog(DEBUG1, "performing multixact truncation starting (%u, %u), segments (%x to %x)",
+		 oldestOffset,
+		 oldestAliveOffset,
+		 MXOffsetToMemberSegment(oldestOffset),
+		 MXOffsetToMemberSegment(oldestAliveOffset));
 
-	/* Now we can truncate MultiXactOffset */
-	SimpleLruTruncate(MultiXactOffsetCtl,
-					  MultiXactIdToOffsetPage(oldestMXact));
+	/*
+	 * Do truncation, and the WAL logging of the truncation, in a critical
+	 * section. That way offsets/members cannot get out of sync anymore,
+	 * i.e. once consistent the oldestMulti will always exist in members, even
+	 * if we crashed in the wrong moment.
+	 */
+	START_CRIT_SECTION();
 
+	/*
+	 * Prevent checkpoints from being scheduled concurrently. This is critical
+	 * because otherwise a truncation record might not be replayed after a
+	 * crash/basebackup, even though the state of the data directory would
+	 * require it.  It's not possible, and not needed, to do this during
+	 * recovery, when performing a old-style truncation, though, as the
+	 * startup process doesn't have a PGXACT entry.
+	 */
+	if (!in_recovery)
+	{
+		Assert(!MyPgXact->delayChkpt);
+		MyPgXact->delayChkpt = true;
+	}
 
 	/*
-	 * Now, and only now, we can advance the stop point for multixact members.
-	 * If we did it any sooner, the segments we deleted above might already
-	 * have been overwritten with new members.  That would be bad.
+	 * Wal log truncation - this has to be flushed before the truncation is
+	 * actually performed, for the reasons explained in TruncateCLOG().
 	 */
-	DetermineSafeOldestOffset(oldestMXact);
+	if (!in_recovery)
+		WriteMTruncateXlogRec(oldestMXact, frozenMulti,
+							  oldestOffset, oldestAliveOffset);
+
+	/* First truncate members */
+	PerformMembersTruncation(oldestOffset, oldestAliveOffset);
+
+	/* Then offsets */
+	SimpleLruTruncate(MultiXactOffsetCtl,
+					  MultiXactIdToOffsetPage(frozenMulti));
+
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestMultiXactId = frozenMulti;
+	MultiXactState->oldestMultiXactDB = minmulti_datoid;
+	LWLockRelease(MultiXactGenLock);
+
+	if (!in_recovery)
+		MyPgXact->delayChkpt = false;
+
+	END_CRIT_SECTION();
+	LWLockRelease(MultiXactTruncationLock);
 }
 
 /*
@@ -3167,6 +3160,30 @@ WriteMZeroPageXlogRec(int pageno, uint8 info)
 }
 
 /*
+ * Write a TRUNCATE xlog record
+ *
+ * We must flush the xlog record to disk before returning --- see notes
+ * in TruncateMultiXact().
+ */
+static void
+WriteMTruncateXlogRec(MultiXactOffset startOff, MultiXactOffset endOff,
+					  MultiXactOffset startMemb, MultiXactOffset endMemb)
+{
+	XLogRecPtr	recptr;
+	xl_multixact_truncate xlrec;
+
+	xlrec.startOff = startOff;
+	xlrec.endOff = endOff;
+	xlrec.startMemb = startMemb;
+	xlrec.endMemb = endMemb;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), SizeOfMultiXactTruncate);
+	recptr = XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_TRUNCATE_ID);
+	XLogFlush(recptr);
+}
+
+/*
  * MULTIXACT resource manager's routines
  */
 void
@@ -3248,6 +3265,41 @@ multixact_redo(XLogReaderState *record)
 			LWLockRelease(XidGenLock);
 		}
 	}
+	else if (info == XLOG_MULTIXACT_TRUNCATE_ID)
+	{
+		xl_multixact_truncate xlrec;
+		int pageno;
+
+		memcpy(&xlrec, XLogRecGetData(record),
+			   SizeOfMultiXactTruncate);
+
+		pageno = MultiXactIdToOffsetPage(xlrec.endOff);
+
+		elog(LOG, "replaying multixact truncation start: %u, %u, %x to %x",
+			 xlrec.startMemb,
+			 xlrec.endMemb,
+			 MXOffsetToMemberSegment(xlrec.startMemb),
+			 MXOffsetToMemberSegment(xlrec.endMemb));
+
+		/* should not be required, but more than cheap enough */
+		LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
+
+		PerformMembersTruncation(xlrec.startMemb, xlrec.endMemb);
+
+		/*
+		 * During XLOG replay, latest_page_number isn't necessarily set up
+		 * yet; insert a suitable value to bypass the sanity test in
+		 * SimpleLruTruncate.
+		 */
+		MultiXactOffsetCtl->shared->latest_page_number = pageno;
+		SimpleLruTruncate(MultiXactOffsetCtl,
+						  MultiXactIdToOffsetPage(xlrec.endOff));
+
+		/* only looked at in the startup process, no lock necessary */
+		MultiXactState->sawTruncationCkptCyle = true;
+
+		LWLockRelease(MultiXactTruncationLock);
+	}
 	else
 		elog(PANIC, "multixact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5fcea11..e2b79ae 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -134,6 +134,7 @@ static int	SlruSelectLRUPage(SlruCtl ctl, int pageno);
 
 static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
 						  int segpage, void *data);
+static void SlruInternalDeleteSegment(SlruCtl ctl, char *filename);
 
 /*
  * Initialization of shared memory
@@ -1075,7 +1076,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
  * Flush dirty pages to disk during checkpoint or database shutdown
  */
 void
-SimpleLruFlush(SlruCtl ctl, bool checkpoint)
+SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 {
 	SlruShared	shared = ctl->shared;
 	SlruFlushData fdata;
@@ -1096,11 +1097,11 @@ SimpleLruFlush(SlruCtl ctl, bool checkpoint)
 		SlruInternalWritePage(ctl, slotno, &fdata);
 
 		/*
-		 * When called during a checkpoint, we cannot assert that the slot is
-		 * clean now, since another process might have re-dirtied it already.
-		 * That's okay.
+		 * In some places (e.g. checkpoints), we cannot assert that the slot
+		 * is clean now, since another process might have re-dirtied it
+		 * already.  That's okay.
 		 */
-		Assert(checkpoint ||
+		Assert(allow_redirtied ||
 			   shared->page_status[slotno] == SLRU_PAGE_EMPTY ||
 			   (shared->page_status[slotno] == SLRU_PAGE_VALID &&
 				!shared->page_dirty[slotno]));
@@ -1210,8 +1211,14 @@ restart:;
 	(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
 }
 
-void
-SlruDeleteSegment(SlruCtl ctl, char *filename)
+/*
+ * Delete an individual SLRU segment, identified by the filename.
+ *
+ * NB: This does not touch the SLRU buffers themselves, callers have to ensure
+ * they either can't yet contain anything, or have already been cleaned out.
+ */
+static void
+SlruInternalDeleteSegment(SlruCtl ctl, char *filename)
 {
 	char		path[MAXPGPATH];
 
@@ -1222,6 +1229,70 @@ SlruDeleteSegment(SlruCtl ctl, char *filename)
 }
 
 /*
+ * Delete an individual SLRU segment, identified by the segment number.
+ */
+void
+SlruDeleteSegment(SlruCtl ctl, int segno)
+{
+	SlruShared	shared = ctl->shared;
+	int			slotno;
+	char		path[MAXPGPATH];
+	bool		did_write;
+
+	/* Clean out any possibly existing references to the segment. */
+	LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+restart:
+	did_write = false;
+	for (slotno = 0; slotno < shared->num_slots; slotno++)
+	{
+		int pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+
+		if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
+			continue;
+
+		/* not the segment we're looking for */
+		if (pagesegno != segno)
+			continue;
+
+		/* If page is clean, just change state to EMPTY (expected case). */
+		if (shared->page_status[slotno] == SLRU_PAGE_VALID &&
+			!shared->page_dirty[slotno])
+		{
+			shared->page_status[slotno] = SLRU_PAGE_EMPTY;
+			continue;
+		}
+
+		/*
+		 * Hmm, we have (or may have) I/O operations acting on the page, so
+		 * we've got to wait for them to finish and then start again. This is
+		 * the same logic as in SlruSelectLRUPage.  (XXX if page is dirty,
+		 * wouldn't it be OK to just discard it without writing it?  For now,
+		 * keep the logic the same as it was.)
+		 */
+		if (shared->page_status[slotno] == SLRU_PAGE_VALID)
+			SlruInternalWritePage(ctl, slotno, NULL);
+		else
+			SimpleLruWaitIO(ctl, slotno);
+
+		did_write = true;
+	}
+
+	/*
+	 * Be extra careful and re-check. The IO functions release the control
+	 * lock, so new pages could have been read in.
+	 */
+	if (did_write)
+		goto restart;
+
+	snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
+	ereport(DEBUG2,
+			(errmsg("removing file \"%s\"", path)));
+	unlink(path);
+
+	LWLockRelease(shared->ControlLock);
+}
+
+/*
  * SlruScanDirectory callback
  *		This callback reports true if there's any segment prior to the one
  *		containing the page passed as "data".
@@ -1249,7 +1320,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 	int			cutoffPage = *(int *) data;
 
 	if (ctl->PagePrecedes(segpage, cutoffPage))
-		SlruDeleteSegment(ctl, filename);
+		SlruInternalDeleteSegment(ctl, filename);
 
 	return false;				/* keep going */
 }
@@ -1261,7 +1332,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 bool
 SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage, void *data)
 {
-	SlruDeleteSegment(ctl, filename);
+	SlruInternalDeleteSegment(ctl, filename);
 
 	return false;				/* keep going */
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e37ad3..4f7f74b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6287,7 +6287,6 @@ StartupXLOG(void)
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTs,
 					 checkPoint.newestCommitTs);
-	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
 
@@ -6304,10 +6303,8 @@ StartupXLOG(void)
 	StartupReorderBuffer();
 
 	/*
-	 * Startup MultiXact.  We need to do this early for two reasons: one is
-	 * that we might try to access multixacts when we do tuple freezing, and
-	 * the other is we need its state initialized because we attempt
-	 * truncation during restartpoints.
+	 * Startup MultiXact, we need to do this early, to be able to replay
+	 * truncations.
 	 */
 	StartupMultiXact();
 
@@ -8465,12 +8462,6 @@ CreateCheckPoint(int flags)
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that the checkpoint is safely on disk, we can update the point to
-	 * which multixact can be truncated.
-	 */
-	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
-
-	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
 	smgrpostckpt();
@@ -8509,11 +8500,6 @@ CreateCheckPoint(int flags)
 	if (!RecoveryInProgress())
 		TruncateSUBTRANS(GetOldestXmin(NULL, false));
 
-	/*
-	 * Truncate pg_multixact too.
-	 */
-	TruncateMultiXact();
-
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
 
@@ -8844,21 +8830,6 @@ CreateRestartPoint(int flags)
 	}
 
 	/*
-	 * Due to a historical accident multixact truncations are not WAL-logged,
-	 * but just performed everytime the mxact horizon is increased. So, unless
-	 * we explicitly execute truncations on a standby it will never clean out
-	 * /pg_multixact which obviously is bad, both because it uses space and
-	 * because we can wrap around into pre-existing data...
-	 *
-	 * We can only do the truncation here, after the UpdateControlFile()
-	 * above, because we've now safely established a restart point.  That
-	 * guarantees we will not need to access those multis.
-	 *
-	 * It's probably worth improving this.
-	 */
-	TruncateMultiXact();
-
-	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
 	 * the oldest XMIN of any running transaction.  No future transaction will
 	 * attempt to reference any pg_subtrans entry older than that (see Asserts
@@ -9218,9 +9189,13 @@ xlog_redo(XLogReaderState *record)
 		LWLockRelease(OidGenLock);
 		MultiXactSetNextMXact(checkPoint.nextMulti,
 							  checkPoint.nextMultiOffset);
+		/*
+		 * NB: This may perform multixact truncation when replaying WAL
+		 * generated by an older primary.
+		 */
+		MultiXactAdvanceOldest(checkPoint.oldestMulti,
+							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-		SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
-		MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9310,14 +9285,16 @@ xlog_redo(XLogReaderState *record)
 		LWLockRelease(OidGenLock);
 		MultiXactAdvanceNextMXact(checkPoint.nextMulti,
 								  checkPoint.nextMultiOffset);
+		/*
+		 * NB: This may perform multixact truncation when replaying WAL
+		 * generated by an older primary.
+		 */
+		MultiXactAdvanceOldest(checkPoint.oldestMulti,
+							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
-		MultiXactAdvanceOldest(checkPoint.oldestMulti,
-							   checkPoint.oldestMultiDB);
-		MultiXactSetSafeTruncate(checkPoint.oldestMulti);
-
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..c1433e9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1134,11 +1134,11 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 
 	/*
-	 * Truncate CLOG and CommitTs to the oldest computed value. Note we don't
-	 * truncate multixacts; that will be done by the next checkpoint.
+	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
 	 */
 	TruncateCLOG(frozenXID);
 	TruncateCommitTs(frozenXID, true);
+	TruncateMultiXact(minMulti, minmulti_datoid, false);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6213f8a..bfcbbc4 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -71,6 +71,7 @@ typedef struct MultiXactMember
 #define XLOG_MULTIXACT_ZERO_OFF_PAGE	0x00
 #define XLOG_MULTIXACT_ZERO_MEM_PAGE	0x10
 #define XLOG_MULTIXACT_CREATE_ID		0x20
+#define XLOG_MULTIXACT_TRUNCATE_ID		0x30
 
 typedef struct xl_multixact_create
 {
@@ -82,6 +83,16 @@ typedef struct xl_multixact_create
 
 #define SizeOfMultiXactCreate (offsetof(xl_multixact_create, members))
 
+typedef struct xl_multixact_truncate
+{
+	MultiXactOffset startOff;
+	MultiXactOffset endOff;
+
+	MultiXactOffset startMemb;
+	MultiXactOffset endMemb;
+} xl_multixact_truncate;
+#define SizeOfMultiXactTruncate (sizeof(xl_multixact_truncate))
+
 
 extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status1, TransactionId xid2,
@@ -120,13 +131,12 @@ extern void MultiXactGetCheckptMulti(bool is_shutdown,
 						 Oid *oldestMultiDB);
 extern void CheckPointMultiXact(void);
 extern MultiXactId GetOldestMultiXactId(void);
-extern void TruncateMultiXact(void);
+extern void TruncateMultiXact(MultiXactId oldestMulti, Oid oldestMultiDB, bool inRecovery);
 extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset);
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 						  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern void MultiXactSetSafeTruncate(MultiXactId safeTruncateMulti);
 extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 9c7f019..f60e75b 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -143,14 +143,14 @@ extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
 extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
 						   TransactionId xid);
 extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
-extern void SimpleLruFlush(SlruCtl ctl, bool checkpoint);
+extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
 extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
 extern bool SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno);
 
 typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
 											  void *data);
 extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
-extern void SlruDeleteSegment(SlruCtl ctl, char *filename);
+extern void SlruDeleteSegment(SlruCtl ctl, int segno);
 
 /* SlruScanDirectory public callbacks */
 extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cff3b99..6f0688c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -135,8 +135,9 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define CommitTsControlLock			(&MainLWLockArray[38].lock)
 #define CommitTsLock				(&MainLWLockArray[39].lock)
 #define ReplicationOriginLock		(&MainLWLockArray[40].lock)
+#define MultiXactTruncationLock		(&MainLWLockArray[41].lock)
 
-#define NUM_INDIVIDUAL_LWLOCKS		41
+#define NUM_INDIVIDUAL_LWLOCKS		42
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
-- 
2.4.0.rc2.1.g3d6bc9a

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#1)
Re: Rework the way multixact truncations work

Andres Freund wrote:

Rework the way multixact truncations work.

I spent some time this morning reviewing this patch and had some
comments that I relayed over IM to Andres. The vast majority is
cosmetic, but there are two larger things:

1. I think this part of PerformMembersTruncation() is very confusing:

/* verify whether the current segment is to be deleted */
if (segment != startsegment && segment != endsegment)
SlruDeleteSegment(MultiXactMemberCtl, segment);

I think this works correctly in that it preserves both endpoint files,
but the files in between are removed ... which is a confusing interface,
IMO. I think this merits a longer explanation.

2. We set PGXACT->delayChkpt while the truncation is executed. This
seems reasonable, and there's a good reason for it, but all the other
users of this facility only do small operations with this thing grabbed,
while the multixact truncation could take a long time because a large
number of files might be deleted. Maybe it's not a problem to have
checkpoints be delayed by several seconds, or who knows maybe even a
minute in a busy system. (We will have checkpointer sleeping in 10ms
intervals until the truncation is complete).

Maybe this is fine, not sure.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#2)
Re: Rework the way multixact truncations work

On 2015-06-26 14:48:35 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

Rework the way multixact truncations work.

I spent some time this morning reviewing this patch and had some
comments that I relayed over IM to Andres.

Thanks for that!

2. We set PGXACT->delayChkpt while the truncation is executed. This
seems reasonable, and there's a good reason for it, but all the other
users of this facility only do small operations with this thing grabbed,
while the multixact truncation could take a long time because a large
number of files might be deleted. Maybe it's not a problem to have
checkpoints be delayed by several seconds, or who knows maybe even a
minute in a busy system. (We will have checkpointer sleeping in 10ms
intervals until the truncation is complete).

I don't think this is a problem. Consider that we're doing all this in
the checkpointer today, blocking much more than just the actual xlog
insertion. That's a bigger problem, as we'll not do the paced writing
during that and such. The worst thatthis can cause is a bunch of sleeps,
that seems fairly harmless.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andres Freund (#1)
Re: Rework the way multixact truncations work

On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund <andres@anarazel.de> wrote:

It'd be very welcome to see some wider testing and review on this.

I started looking at this and doing some testing. Here is some
initial feedback:

Perhaps vac_truncate_clog needs a new name now that it does more,
maybe vac_truncate_transaction_logs?

MultiXactState->sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'?

In the struct xl_multixact_truncate, and also the function
WriteMTruncateXlogRec and other places, I think you have confused the
typedefs MultiXactOffset and MultiXactId. If I'm not mistaken,
startMemb and endMemb have the correct type, but startOff and endOff
should be of type MultiXactId despite their names (the *values* stored
inside pg_multixact/offsets are indeed offsets (into
pg_multixact/members), but their *location* is what a multixact ID
represents).

I was trying to understand if there could be any problem caused by
setting latest_page_number to the pageno that holds (or will hold)
xlrec.endOff in multixact_redo. The only thing that jumps out at me
is that the next call to SlruSelectLRUPage will no longer be prevented
from evicting the *real* latest page (the most recently created page).

In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU
control lock?

In find_multixact_start you call SimpleLruFlush before calling
SimpleLruDoesPhysicalPageExist. Should we do something like this
instead? https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a

I think saw some extra autovacuum activity that I didn't expect in a
simple scenario, but I'm not sure and will continue looking tomorrow.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#4)
Re: Rework the way multixact truncations work

On 2015-06-29 23:54:40 +1200, Thomas Munro wrote:

On Mon, Jun 22, 2015 at 7:24 AM, Andres Freund <andres@anarazel.de> wrote:

It'd be very welcome to see some wider testing and review on this.

I started looking at this and doing some testing. Here is some
initial feedback:

Perhaps vac_truncate_clog needs a new name now that it does more,
maybe vac_truncate_transaction_logs?

It has done more before, so I don't really see a connection to this
patch...

MultiXactState->sawTruncationCkptCyle: is 'Cyle' supposed to be 'Cycle'?

Oops.

In the struct xl_multixact_truncate, and also the function
WriteMTruncateXlogRec and other places, I think you have confused the
typedefs MultiXactOffset and MultiXactId. If I'm not mistaken,
startMemb and endMemb have the correct type, but startOff and endOff
should be of type MultiXactId despite their names (the *values* stored
inside pg_multixact/offsets are indeed offsets (into
pg_multixact/members), but their *location* is what a multixact ID
represents).

IIRC I did it that way to make clear this is just 'byte' type offsets,
without other meaning. Wasn't a good idea.

I was trying to understand if there could be any problem caused by
setting latest_page_number to the pageno that holds (or will hold)
xlrec.endOff in multixact_redo. The only thing that jumps out at me
is that the next call to SlruSelectLRUPage will no longer be prevented
from evicting the *real* latest page (the most recently created page).

That hasn't changed unless I miss something?

In SlruDeleteSegment, is it OK to call unlink() while holding the SLRU
control lock?

I think it's safer than not doing it, but don't particularly care.

In find_multixact_start you call SimpleLruFlush before calling
SimpleLruDoesPhysicalPageExist. Should we do something like this
instead? https://gist.github.com/macdice/8e5b2f0fe3827fdf3d5a

I'm currently slightly inclined to do it "my way". They way these
functions are used it doesn't seem like a bad property to ensure things
are on disk.

I think saw some extra autovacuum activity that I didn't expect in a
simple scenario, but I'm not sure and will continue looking tomorrow.

Cool, thanks!

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
3 attachment(s)
Re: Rework the way multixact truncations work

Hi,

I'm still working on this patch. I found a bunch of issues. Non of them
super critical and some pre-existing; but nonetheless I don't feel like
it's ready to push yet. So we'll have the first alpha without those
fixes :(

The biggest issues so far are:

* There's, both in the posted patch and as-is in all the branches, a lot
of places that aren't actually safe against a concurrent truncation. A
bunch of places grab oldestMultiXactId with an lwlock held, release
it, and then make decisions based on that.

A bunch of places (including the find_multixact_start callsites) is
actually vulnerable to that, for a bunch of others it's less likely to
be a problem. All callers of GetMultiXactIdMembers() are vulnerable,
but with the exception of pg_get_multixact_members() they'll never
pass in a value that's older than the new oldest member value.

That's a problem for the current branches. Afaics that can lead to a
useless round of emergency autovacs via
SetMultiXactIdLimit()->SetOffsetVacuumLimit().

SetOffsetVacuumLimit() can protect easily agains that by taking the
new MultiXactTruncationLock lock. We could do the same for
pg_get_multixact_members() - afaics the only caller that'll look up too
old values otherwise - but I don't think it matters, you'll get a
slightly obscure error if you access a too old xact that's just being
truncated away and that is that.

* There was no update of the in-memory oldest* values when replaying a
truncation. That's problematic if a standby is promoted after
replaying a truncation record, but before a following checkpoint
record. This would be fixed by an emergency autovacuum, but that's
obviously not nice. Trivial to fix.

* The in-memory oldest values were updated *after* the truncation
happened. It's unlikely to matter in reality, but it's safer to
update them before, so a concurrent GetMultiXactIdMembers() of stuff
from before the truncation will get the proper error.

* PerformMembersTruncation() probably confused Alvaro because it wasn't
actually correct - there's no need to have the segment containing old
oldestOffset (in contrast to oldestOffsetAlive) survive. Except
leaking a segment that's harmless, but obviously not desirable.

Additionally I'm changing some stuff, some requested by review:
* xl_multixact_truncate's members are now called
(start|end)Trunc(Off|Memb)
* (start|end)TruncOff have the appropriate type now
* typo fixes
* comment improvements
* pgindent

New version attached.

Greetings,

Andres Freund

Attachments:

0001-WIP-dontcommit-Add-functions-to-burn-multixacts.patchtext/x-patch; charset=us-asciiDownload
>From cdc4f8f3341161b87a5d11171efb14c98c252ee6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 4 Jun 2015 19:38:32 +0200
Subject: [PATCH 1/3] WIP-dontcommit: Add functions to burn multixacts

This should live in its own module, but we don't have that yet.
---
 contrib/pageinspect/heapfuncs.c          | 43 ++++++++++++++++++++++++++++++++
 contrib/pageinspect/pageinspect--1.3.sql |  6 +++++
 src/backend/access/heap/heapam.c         |  2 +-
 src/backend/access/transam/multixact.c   | 15 ++++++-----
 src/include/access/multixact.h           |  3 ++-
 5 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 8d1666c..7a3aa14 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -29,6 +29,8 @@
 #include "funcapi.h"
 #include "utils/builtins.h"
 #include "miscadmin.h"
+#include "access/multixact.h"
+#include "access/transam.h"
 
 
 /*
@@ -223,3 +225,44 @@ heap_page_items(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(fctx);
 }
+
+extern Datum
+pg_burn_multixact(PG_FUNCTION_ARGS);
+PG_FUNCTION_INFO_V1(pg_burn_multixact);
+
+Datum
+pg_burn_multixact(PG_FUNCTION_ARGS)
+{
+	int		rep = PG_GETARG_INT32(0);
+	int		size = PG_GETARG_INT32(1);
+	MultiXactMember *members;
+	MultiXactId ret;
+	TransactionId id = ReadNewTransactionId() - size;
+	int		i;
+
+	if (rep < 1)
+		elog(ERROR, "need to burn, burn, burn");
+
+	members = palloc(size * sizeof(MultiXactMember));
+	for (i = 0; i < size; i++)
+	{
+		members[i].xid = id++;
+		members[i].status = MultiXactStatusForShare;
+
+		if (!TransactionIdIsNormal(members[i].xid))
+		{
+			id = FirstNormalTransactionId;
+			members[i].xid = id++;
+		}
+	}
+
+	MultiXactIdSetOldestMember();
+
+	for (i = 0; i < rep; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+		ret = MultiXactIdCreateFromMembers(size, members, true);
+	}
+
+	PG_RETURN_INT64((int64) ret);
+}
diff --git a/contrib/pageinspect/pageinspect--1.3.sql b/contrib/pageinspect/pageinspect--1.3.sql
index a99e058..22f51bc 100644
--- a/contrib/pageinspect/pageinspect--1.3.sql
+++ b/contrib/pageinspect/pageinspect--1.3.sql
@@ -187,3 +187,9 @@ CREATE FUNCTION gin_leafpage_items(IN page bytea,
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'gin_leafpage_items'
 LANGUAGE C STRICT;
+
+
+CREATE FUNCTION pg_burn_multixact(num int4, size int4)
+RETURNS int4
+AS 'MODULE_PATHNAME', 'pg_burn_multixact'
+LANGUAGE C STRICT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index caacc10..c57f99d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6050,7 +6050,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 * Create a new multixact with the surviving members of the previous
 		 * one, to set as new Xmax in the tuple.
 		 */
-		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
+		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers, false);
 		*flags |= FRM_RETURN_IS_MULTI;
 	}
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 377d084..cf43254 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -405,7 +405,7 @@ MultiXactIdCreate(TransactionId xid1, MultiXactStatus status1,
 	members[1].xid = xid2;
 	members[1].status = status2;
 
-	newMulti = MultiXactIdCreateFromMembers(2, members);
+	newMulti = MultiXactIdCreateFromMembers(2, members, false);
 
 	debug_elog3(DEBUG2, "Create: %s",
 				mxid_to_string(newMulti, 2, members));
@@ -471,7 +471,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
 		 */
 		member.xid = xid;
 		member.status = status;
-		newMulti = MultiXactIdCreateFromMembers(1, &member);
+		newMulti = MultiXactIdCreateFromMembers(1, &member, false);
 
 		debug_elog4(DEBUG2, "Expand: %u has no members, create singleton %u",
 					multi, newMulti);
@@ -523,7 +523,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
 
 	newMembers[j].xid = xid;
 	newMembers[j++].status = status;
-	newMulti = MultiXactIdCreateFromMembers(j, newMembers);
+	newMulti = MultiXactIdCreateFromMembers(j, newMembers, false);
 
 	pfree(members);
 	pfree(newMembers);
@@ -742,7 +742,7 @@ ReadNextMultiXactId(void)
  * NB: the passed members[] array will be sorted in-place.
  */
 MultiXactId
-MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
+MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members, bool nocache)
 {
 	MultiXactId multi;
 	MultiXactOffset offset;
@@ -761,7 +761,9 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	 * corner cases where someone else added us to a MultiXact without our
 	 * knowledge, but it's not worth checking for.)
 	 */
-	multi = mXactCacheGetBySet(nmembers, members);
+	multi = nocache ? InvalidMultiXactId :
+		mXactCacheGetBySet(nmembers, members);
+
 	if (MultiXactIdIsValid(multi))
 	{
 		debug_elog2(DEBUG2, "Create: in cache!");
@@ -834,7 +836,8 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	END_CRIT_SECTION();
 
 	/* Store the new MultiXactId in the local cache, too */
-	mXactCachePut(multi, nmembers, members);
+	if (!nocache)
+		mXactCachePut(multi, nmembers, members);
 
 	debug_elog2(DEBUG2, "Create: all done");
 
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index f1448fe..6213f8a 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -86,10 +86,11 @@ typedef struct xl_multixact_create
 extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status1, TransactionId xid2,
 				  MultiXactStatus status2);
+extern MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members, bool nocache);
 extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid,
 				  MultiXactStatus status);
 extern MultiXactId MultiXactIdCreateFromMembers(int nmembers,
-							 MultiXactMember *members);
+							 MultiXactMember *members, bool nocache);
 
 extern MultiXactId ReadNextMultiXactId(void);
 extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
-- 
2.4.0.rc2.1.g3d6bc9a

0002-Lower-_freeze_max_age-minimum-values.patchtext/x-patch; charset=us-asciiDownload
>From ceb5683e62bcae51689dc27e0764f822e26595f7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 15 Jun 2015 19:12:52 +0200
Subject: [PATCH 2/3] Lower *_freeze_max_age minimum values.

The old minimum values are rather large, making it time consuming to
test related behaviour. Additionally the current limits, especially for
multixacts, can be problematic in space-constrained systems. 10000000
multixacts can contain a lot of members.

Since there's no good reason for the current limits, lower them a good
bit. Setting them to 0 would be a bad idea, triggering endless vacuums,
so still retain a limit.
---
 src/backend/utils/misc/guc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 230c5cc..aec4adc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2516,17 +2516,17 @@ static struct config_int ConfigureNamesInt[] =
 		},
 		&autovacuum_freeze_max_age,
 		/* see pg_resetxlog if you change the upper-limit value */
-		200000000, 100000000, 2000000000,
+		200000000, 100000, 2000000000,
 		NULL, NULL, NULL
 	},
 	{
-		/* see varsup.c for why this is PGC_POSTMASTER not PGC_SIGHUP */
+		/* see multixact.c for why this is PGC_POSTMASTER not PGC_SIGHUP */
 		{"autovacuum_multixact_freeze_max_age", PGC_POSTMASTER, AUTOVACUUM,
 			gettext_noop("Multixact age at which to autovacuum a table to prevent multixact wraparound."),
 			NULL
 		},
 		&autovacuum_multixact_freeze_max_age,
-		400000000, 10000000, 2000000000,
+		400000000, 10000, 2000000000,
 		NULL, NULL, NULL
 	},
 	{
-- 
2.4.0.rc2.1.g3d6bc9a

0003-Rework-the-way-multixact-truncations-work.patchtext/x-patch; charset=us-asciiDownload
>From 482ac1f480cf9ffe8fe01f1ab94a95f9655b76bd Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 29 Jun 2015 21:47:42 +0200
Subject: [PATCH 3/3] Rework the way multixact truncations work.

The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.

We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for truncation, even in the back branches.

This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
   to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
   we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
   this has gotten much easier

During the course of fixing this a bunch of bugs had to be fixed:
1) Data was not purged from memory the member's slru before deleting
   segments. This happend to be hard or impossible to hit due to the
   interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
   that doesn't work for offsets that haven't yet been flushed to
   disk. Flush out before running to fix. Not pretty, but it feels
   slightly safer to only make decisions based on on-disk state.
3) find_multixact_start() could be called concurrently with a truncation
   and thus fail. Via SetOffsetVacuumLimit() that could lead to a round
   of emergency vacuuming. The problem remains in
   pg_get_multixact_members(), but that's quite harmless.

To handle the case of an updated standby replaying WAL from a not-yet
upgraded primary we have to recognize that situation and use "old style"
truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to
before this now happens in the startup process, when replaying a
checkpoint record, instead of the checkpointer. Doing this in the
restartpoint was incorrect, they can happen much later than the original
checkpoint, thereby leading to wraparound. It's also more in line to how
the WAL logging now works.

To avoid "multixact_redo: unknown op code 48" errors standbys should be
upgraded before primaries. This needs to be expressed clearly in the
release notes.

Backpatch to 9.3, where the use of multixacts was expanded. Arguably
this could be backpatched further, but there doesn't seem to be
sufficient benefit to outweigh the risk of applying a significantly
different patch there.
---
 src/backend/access/rmgrdesc/mxactdesc.c |  11 +
 src/backend/access/transam/multixact.c  | 601 ++++++++++++++++++--------------
 src/backend/access/transam/slru.c       |  83 ++++-
 src/backend/access/transam/xlog.c       |  53 +--
 src/backend/commands/vacuum.c           |   4 +-
 src/include/access/multixact.h          |  18 +-
 src/include/access/slru.h               |   4 +-
 src/include/storage/lwlock.h            |   3 +-
 src/tools/pgindent/typedefs.list        |   1 +
 9 files changed, 471 insertions(+), 307 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 572951e..e44e100 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -70,6 +70,14 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
+	else if (info == XLOG_MULTIXACT_TRUNCATE_ID)
+	{
+		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
+
+		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+						 xlrec->startTruncOff, xlrec->endTruncOff,
+						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+	}
 }
 
 const char *
@@ -88,6 +96,9 @@ multixact_identify(uint8 info)
 		case XLOG_MULTIXACT_CREATE_ID:
 			id = "CREATE_ID";
 			break;
+		case XLOG_MULTIXACT_TRUNCATE_ID:
+			id = "TRUNCATE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index cf43254..7dfac45 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -49,9 +49,7 @@
  * value is removed; the cutoff value is stored in pg_class.  The minimum value
  * across all tables in each database is stored in pg_database, and the global
  * minimum across all databases is part of pg_control and is kept in shared
- * memory.  At checkpoint time, after the value is known flushed in WAL, any
- * files that correspond to multixacts older than that value are removed.
- * (These files are also removed when a restartpoint is executed.)
+ * memory.  Whenever that minimum is advanced, the SLRUs are truncated.
  *
  * When new multixactid values are to be created, care is taken that the
  * counter does not fall within the wraparound horizon considering the global
@@ -83,6 +81,7 @@
 #include "postmaster/autovacuum.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -153,6 +152,7 @@
 
 /* page in which a member is to be found */
 #define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
+#define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
 
 /* Location (byte offset within page) of flag word for a given member */
 #define MXOffsetToFlagsOffset(xid) \
@@ -218,11 +218,12 @@ typedef struct MultiXactStateData
 	bool		oldestOffsetKnown;
 
 	/*
-	 * This is what the previous checkpoint stored as the truncate position.
-	 * This value is the oldestMultiXactId that was valid when a checkpoint
-	 * was last executed.
+	 * True if a multixact truncation WAL record was replayed since the last
+	 * checkpoint. This is used to trigger 'legacy truncations', i.e. truncate
+	 * by looking at the data directory during WAL replay, when the primary is
+	 * too old to general truncation records.
 	 */
-	MultiXactId lastCheckpointedOldest;
+	bool		sawTruncationCkptCycle;
 
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
@@ -231,8 +232,7 @@ typedef struct MultiXactStateData
 	MultiXactId multiWrapLimit;
 
 	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;
-	bool offsetStopLimitKnown;
+	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
@@ -362,12 +362,14 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 						MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static void DetermineSafeOldestOffset(MultiXactId oldestMXact);
 static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
 						 MultiXactOffset start, uint32 distance);
-static bool SetOffsetVacuumLimit(bool finish_setup);
+static bool SetOffsetVacuumLimit(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int pageno, uint8 info);
+static void WriteMTruncateXlogRec(Oid oldestMultiDB,
+					  MultiXactId startOff, MultiXactId endOff,
+					  MultiXactOffset startMemb, MultiXactOffset endMemb);
 
 
 /*
@@ -1100,7 +1102,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 *----------
 	 */
 #define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->offsetStopLimitKnown &&
+	if (MultiXactState->oldestOffsetKnown &&
 		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
 								 nmembers))
 	{
@@ -1140,7 +1142,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 	}
 
-	if (MultiXactState->offsetStopLimitKnown &&
+	if (MultiXactState->oldestOffsetKnown &&
 		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
 								 nextOffset,
 								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
@@ -2018,13 +2020,21 @@ StartupMultiXact(void)
 void
 TrimMultiXact(void)
 {
-	MultiXactId multi = MultiXactState->nextMXact;
-	MultiXactOffset offset = MultiXactState->nextOffset;
-	MultiXactId	oldestMXact;
+	MultiXactId nextMXact;
+	MultiXactOffset offset;
+	MultiXactId oldestMXact;
+	MultiXactId oldestMXactDB;
 	int			pageno;
 	int			entryno;
 	int			flagsoff;
 
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	nextMXact = MultiXactState->nextMXact;
+	offset = MultiXactState->nextOffset;
+	oldestMXact = MultiXactState->oldestMultiXactId;
+	oldestMXactDB = MultiXactState->oldestMultiXactDB;
+	MultiXactState->finishedStartup = true;
+	LWLockRelease(MultiXactGenLock);
 
 	/* Clean up offsets state */
 	LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
@@ -2032,20 +2042,20 @@ TrimMultiXact(void)
 	/*
 	 * (Re-)Initialize our idea of the latest page number for offsets.
 	 */
-	pageno = MultiXactIdToOffsetPage(multi);
+	pageno = MultiXactIdToOffsetPage(nextMXact);
 	MultiXactOffsetCtl->shared->latest_page_number = pageno;
 
 	/*
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for motivation.
 	 */
-	entryno = MultiXactIdToOffsetEntry(multi);
+	entryno = MultiXactIdToOffsetEntry(nextMXact);
 	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
@@ -2094,12 +2104,11 @@ TrimMultiXact(void)
 
 	LWLockRelease(MultiXactMemberControlLock);
 
-	if (SetOffsetVacuumLimit(true) && IsUnderPostmaster)
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	oldestMXact = MultiXactState->lastCheckpointedOldest;
-	LWLockRelease(MultiXactGenLock);
-	DetermineSafeOldestOffset(oldestMXact);
+	/*
+	 * Recompute limits as we are now fully started, we now can correctly
+	 * compute how far a members wraparound is away.
+	 */
+	SetMultiXactIdLimit(oldestMXact, oldestMXactDB);
 }
 
 /*
@@ -2268,8 +2277,19 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	 (errmsg("MultiXactId wrap limit is %u, limited by database with OID %u",
 			 multiWrapLimit, oldest_datoid)));
 
+	/*
+	 * Computing the actual limits is only possible once the data directory is
+	 * in a consistent state. There's no need to compute the limits while
+	 * still replaying WAL as no new multis can be created anyway. So we'll
+	 * only do further checks after TrimMultiXact() has been called.
+	 */
+	if (!MultiXactState->finishedStartup)
+		return;
+
+	Assert(!InRecovery);
+
 	/* Set limits for offset vacuum. */
-	needs_offset_vacuum = SetOffsetVacuumLimit(false);
+	needs_offset_vacuum = SetOffsetVacuumLimit();
 
 	/*
 	 * If past the autovacuum force point, immediately signal an autovac
@@ -2279,11 +2299,11 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	 * another iteration immediately if there are still any old databases.
 	 */
 	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
-		 needs_offset_vacuum) && IsUnderPostmaster && !InRecovery)
+		 needs_offset_vacuum) && IsUnderPostmaster)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
-	if (MultiXactIdPrecedes(multiWarnLimit, curMulti) && !InRecovery)
+	if (MultiXactIdPrecedes(multiWarnLimit, curMulti))
 	{
 		char	   *oldest_datname;
 
@@ -2351,27 +2371,35 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 }
 
 /*
- * Update our oldestMultiXactId value, but only if it's more recent than
- * what we had.  However, even if not, always update the oldest multixact
- * offset limit.
+ * During WAL replay update our oldestMultiXactId value, but only if it's more
+ * recent than what we had.
  */
 void
 MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 {
+	Assert(InRecovery);
+
 	if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
+	{
+		/*
+		 * If there has been a truncation on the master, detected via a moving
+		 * oldestMulti, without a corresponding truncation record we know that
+		 * the primary is still running an older version of postgres that
+		 * doesn't yet log multixact truncations. So perform truncation
+		 * ourselves.
+		 */
+		if (!MultiXactState->sawTruncationCkptCycle)
+		{
+			ereport(LOG,
+					(errmsg("performing legacy multixact truncation, upgrade master")));
+			TruncateMultiXact(oldestMulti, oldestMultiDB, true);
+		}
+
 		SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
-}
+	}
 
-/*
- * Update the "safe truncation point".  This is the newest value of oldestMulti
- * that is known to be flushed as part of a checkpoint record.
- */
-void
-MultiXactSetSafeTruncate(MultiXactId safeTruncateMulti)
-{
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->lastCheckpointedOldest = safeTruncateMulti;
-	LWLockRelease(MultiXactGenLock);
+	/* only looked at in the startup process, no lock necessary */
+	MultiXactState->sawTruncationCkptCycle = false;
 }
 
 /*
@@ -2527,126 +2555,50 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Based on the given oldest MultiXactId, determine what's the oldest member
- * offset and install the limit info in MultiXactState, where it can be used to
- * prevent overrun of old data in the members SLRU area.
- */
-static void
-DetermineSafeOldestOffset(MultiXactId oldestMXact)
-{
-	MultiXactOffset oldestOffset;
-	MultiXactOffset nextOffset;
-	MultiXactOffset offsetStopLimit;
-	MultiXactOffset prevOffsetStopLimit;
-	MultiXactId		nextMXact;
-	bool			finishedStartup;
-	bool			prevOffsetStopLimitKnown;
-
-	/* Fetch values from shared memory. */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	finishedStartup = MultiXactState->finishedStartup;
-	nextMXact = MultiXactState->nextMXact;
-	nextOffset = MultiXactState->nextOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
-	prevOffsetStopLimitKnown = MultiXactState->offsetStopLimitKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	/* Don't worry about this until after we've started up. */
-	if (!finishedStartup)
-		return;
-
-	/*
-	 * Determine the offset of the oldest multixact.  Normally, we can read
-	 * the offset from the multixact itself, but there's an important special
-	 * case: if there are no multixacts in existence at all, oldestMXact
-	 * obviously can't point to one.  It will instead point to the multixact
-	 * ID that will be assigned the next time one is needed.
-	 *
-	 * NB: oldestMXact should be the oldest multixact that still exists in the
-	 * SLRU, unlike in SetOffsetVacuumLimit, where we do this same computation
-	 * based on the oldest value that might be referenced in a table.
-	 */
-	if (nextMXact == oldestMXact)
-		oldestOffset = nextOffset;
-	else
-	{
-		bool		oldestOffsetKnown;
-
-		oldestOffsetKnown = find_multixact_start(oldestMXact, &oldestOffset);
-		if (!oldestOffsetKnown)
-		{
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
-						oldestMXact)));
-			return;
-		}
-	}
-
-	/* move back to start of the corresponding segment */
-	offsetStopLimit = oldestOffset - (oldestOffset %
-		(MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-	/* always leave one segment before the wraparound point */
-	offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-	/* if nothing has changed, we're done */
-	if (prevOffsetStopLimitKnown && offsetStopLimit == prevOffsetStopLimit)
-		return;
-
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->offsetStopLimit = offsetStopLimit;
-	MultiXactState->offsetStopLimitKnown = true;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!prevOffsetStopLimitKnown && IsUnderPostmaster)
-		ereport(LOG,
-				(errmsg("MultiXact member wraparound protections are now enabled")));
-	ereport(DEBUG1,
-			(errmsg("MultiXact member stop limit is now %u based on MultiXact %u",
-				offsetStopLimit, oldestMXact)));
-}
-
-/*
  * Determine how aggressively we need to vacuum in order to prevent member
  * wraparound.
  *
- * To determine the oldest multixact ID, we look at oldestMultiXactId, not
- * lastCheckpointedOldest.  That's because vacuuming can't help with anything
- * older than oldestMultiXactId; anything older than that isn't referenced
- * by any table.  Offsets older than oldestMultiXactId but not as old as
- * lastCheckpointedOldest will go away after the next checkpoint.
+ * To do so determine what's the oldest member offset and install the limit
+ * info in MultiXactState, where it can be used to prevent overrun of old data
+ * in the members SLRU area.
  *
  * The return value is true if emergency autovacuum is required and false
  * otherwise.
  */
 static bool
-SetOffsetVacuumLimit(bool finish_setup)
+SetOffsetVacuumLimit(void)
 {
-	MultiXactId	oldestMultiXactId;
+	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
-	bool		finishedStartup;
-	MultiXactOffset oldestOffset = 0;		/* placate compiler */
+	MultiXactOffset oldestOffset = 0;	/* placate compiler */
+	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
-	MultiXactOffset prevOldestOffset;
 	bool		prevOldestOffsetKnown;
+	MultiXactOffset offsetStopLimit = 0;
+
+	/*
+	 * NB: Have to prevent concurrent truncation, we might otherwise try to
+	 * lookup a oldestMulti that's concurrently getting truncated away.
+	 */
+	LWLockAcquire(MultiXactTruncationLock, LW_SHARED);
 
 	/* Read relevant fields from shared memory. */
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	finishedStartup = MultiXactState->finishedStartup;
-	prevOldestOffset = MultiXactState->oldestOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
+	prevOldestOffset = MultiXactState->oldestOffset;
+	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
-	/* Don't do this until after any recovery is complete. */
-	if (!finishedStartup && !finish_setup)
-		return false;
-
 	/*
-	 * If no multixacts exist, then oldestMultiXactId will be the next
-	 * multixact that will be created, rather than an existing multixact.
+	 * Determine the offset of the oldest multixact.  Normally, we can read
+	 * the offset from the multixact itself, but there's an important special
+	 * case: if there are no multixacts in existence at all, oldestMXact
+	 * obviously can't point to one.  It will instead point to the multixact
+	 * ID that will be assigned the next time one is needed.
 	 */
 	if (oldestMultiXactId == nextMXact)
 	{
@@ -2667,34 +2619,48 @@ SetOffsetVacuumLimit(bool finish_setup)
 		 */
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
+
+		if (oldestOffsetKnown)
+			ereport(DEBUG1,
+					(errmsg("oldest MultiXactId member is at offset %u",
+							oldestOffset)));
+		else
+			ereport(LOG,
+					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+							oldestMultiXactId)));
 	}
 
+	LWLockRelease(MultiXactTruncationLock);
+
 	/*
-	 * Except when initializing the system for the first time, there's no
-	 * need to update anything if we don't know the oldest offset or if it
-	 * hasn't changed.
+	 * If we can, compute limits (and install them MultiXactState) to prevent
+	 * overrun of old data in the members SLRU area. We can only do so if the
+	 * oldest offset is known though.
 	 */
-	if (finish_setup ||
-		(oldestOffsetKnown && !prevOldestOffsetKnown) ||
-		(oldestOffsetKnown && prevOldestOffset != oldestOffset))
+	if (oldestOffsetKnown)
 	{
-		/* Install the new limits. */
-		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-		MultiXactState->oldestOffset = oldestOffset;
-		MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-		MultiXactState->finishedStartup = true;
-		LWLockRelease(MultiXactGenLock);
+		/* move back to start of the corresponding segment */
+		offsetStopLimit = oldestOffset - (oldestOffset %
+					  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
 
-		/* Log the info */
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg("oldest MultiXactId member is at offset %u",
-						oldestOffset)));
-		else
-			ereport(DEBUG1,
-					(errmsg("oldest MultiXactId member offset unknown")));
+		/* always leave one segment before the wraparound point */
+		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
+
+		if (!prevOldestOffsetKnown && IsUnderPostmaster)
+			ereport(LOG,
+					(errmsg("MultiXact member wraparound protections are now enabled")));
+		ereport(DEBUG1,
+		(errmsg("MultiXact member stop limit is now %u based on MultiXact %u",
+				offsetStopLimit, oldestMultiXactId)));
 	}
 
+	/* Install the computed values */
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestOffset = oldestOffset;
+	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
+	MultiXactState->offsetStopLimit = offsetStopLimit;
+	LWLockRelease(MultiXactGenLock);
+
 	/*
 	 * If we failed to get the oldest offset this time, but we have a value
 	 * from a previous pass through this function, assess the need for
@@ -2721,7 +2687,7 @@ SetOffsetVacuumLimit(bool finish_setup)
  * boundary point, hence the name.  The reason we don't want to use the regular
  * 2^31-modulo arithmetic here is that we want to be able to use the whole of
  * the 2^32-1 space here, allowing for more multixacts that would fit
- * otherwise.  See also SlruScanDirCbRemoveMembers.
+ * otherwise.
  */
 static bool
 MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
@@ -2767,6 +2733,9 @@ MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
  *
  * Returns false if the file containing the multi does not exist on disk.
  * Otherwise, returns true and sets *result to the starting member offset.
+ *
+ * This function does not prevent concurrent truncation, so if that's
+ * required, the caller has to protect against that.
  */
 static bool
 find_multixact_start(MultiXactId multi, MultiXactOffset *result)
@@ -2777,9 +2746,21 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	int			slotno;
 	MultiXactOffset *offptr;
 
+	/* XXX: Remove || Startup after WAL page magic bump */
+	Assert(MultiXactState->finishedStartup || AmStartupProcess());
+
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/*
+	 * We need to flush out dirty data, so PhysicalPageExists can work
+	 * correctly, but SimpleLruFlush() is a pretty big hammer for that.
+	 * Alternatively we could add a in-memory version of page exists, but
+	 * find_multixact_start is called infrequently, and it doesn't seem bad to
+	 * flush buffers to disk before truncation.
+	 */
+	SimpleLruFlush(MultiXactOffsetCtl, true);
+
 	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
 		return false;
 
@@ -2885,65 +2866,6 @@ MultiXactMemberFreezeThreshold(void)
 	return multixacts - victim_multixacts;
 }
 
-/*
- * SlruScanDirectory callback.
- *		This callback deletes segments that are outside the range determined by
- *		the given page numbers.
- *
- * Both range endpoints are exclusive (that is, segments containing any of
- * those pages are kept.)
- */
-typedef struct MembersLiveRange
-{
-	int			rangeStart;
-	int			rangeEnd;
-} MembersLiveRange;
-
-static bool
-SlruScanDirCbRemoveMembers(SlruCtl ctl, char *filename, int segpage,
-						   void *data)
-{
-	MembersLiveRange *range = (MembersLiveRange *) data;
-	MultiXactOffset nextOffset;
-
-	if ((segpage == range->rangeStart) ||
-		(segpage == range->rangeEnd))
-		return false;			/* easy case out */
-
-	/*
-	 * To ensure that no segment is spuriously removed, we must keep track of
-	 * new segments added since the start of the directory scan; to do this,
-	 * we update our end-of-range point as we run.
-	 *
-	 * As an optimization, we can skip looking at shared memory if we know for
-	 * certain that the current segment must be kept.  This is so because
-	 * nextOffset never decreases, and we never increase rangeStart during any
-	 * one run.
-	 */
-	if (!((range->rangeStart > range->rangeEnd &&
-		   segpage > range->rangeEnd && segpage < range->rangeStart) ||
-		  (range->rangeStart < range->rangeEnd &&
-		   (segpage < range->rangeStart || segpage > range->rangeEnd))))
-		return false;
-
-	/*
-	 * Update our idea of the end of the live range.
-	 */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	LWLockRelease(MultiXactGenLock);
-	range->rangeEnd = MXOffsetToMemberPage(nextOffset);
-
-	/* Recheck the deletion condition.  If it still holds, perform deletion */
-	if ((range->rangeStart > range->rangeEnd &&
-		 segpage > range->rangeEnd && segpage < range->rangeStart) ||
-		(range->rangeStart < range->rangeEnd &&
-		 (segpage < range->rangeStart || segpage > range->rangeEnd)))
-		SlruDeleteSegment(ctl, filename);
-
-	return false;				/* keep going */
-}
-
 typedef struct mxtruncinfo
 {
 	int			earliestExistingPage;
@@ -2967,6 +2889,35 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
 	return false;				/* keep going */
 }
 
+
+/*
+ * Delete member segments [oldest, oldestAlive)
+ */
+static void
+PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset oldestAliveOffset)
+{
+	const int	maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
+	int			startsegment = MXOffsetToMemberSegment(oldestOffset);
+	int			endsegment = MXOffsetToMemberSegment(oldestAliveOffset);
+	int			segment = startsegment;
+
+	/*
+	 * Delete all the segments but the last one. The last segment can still
+	 * contains, possibly partially, valid data.
+	 */
+	while (segment != endsegment)
+	{
+		elog(DEBUG2, "truncating multixact members segment %x", segment);
+		SlruDeleteSegment(MultiXactMemberCtl, segment);
+
+		/* move to next segment, handling wraparound correctly */
+		if (segment == maxsegment)
+			segment = 0;
+		else
+			segment += 1;
+	}
+}
+
 /*
  * Remove all MultiXactOffset and MultiXactMember segments before the oldest
  * ones still of interest.
@@ -2979,32 +2930,60 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
  * and kept up to date as new pages are zeroed.
  */
 void
-TruncateMultiXact(void)
+TruncateMultiXact(MultiXactId frozenMulti, Oid minmulti_datoid, bool in_recovery)
 {
 	MultiXactId oldestMXact;
 	MultiXactOffset oldestOffset;
-	MultiXactId		nextMXact;
-	MultiXactOffset	nextOffset;
+	MultiXactId nextMXact;
+	MultiXactOffset nextOffset;
+	MultiXactOffset oldestAliveOffset;
 	mxtruncinfo trunc;
 	MultiXactId earliest;
-	MembersLiveRange range;
 
-	Assert(AmCheckpointerProcess() || AmStartupProcess() ||
-		   !IsPostmasterEnvironment);
+	/*
+	 * Need to allow being called in recovery for backward compatibility, when
+	 * a updated standby replays WAL generated by a non-updated primary.
+	 */
+	Assert(in_recovery || !RecoveryInProgress());
+	Assert(!in_recovery || AmStartupProcess());
+	Assert(in_recovery || MultiXactState->finishedStartup);
+
+	/*
+	 * We can only allow one truncation to happen at once. Otherwise parts of
+	 * members might vanish while we're doing lookups or similar. There's no
+	 * need to have an interlock with creating new multis or such, since those
+	 * are constrained by the limits (which only grow, never shrink).
+	 */
+	LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
 
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	oldestMXact = MultiXactState->lastCheckpointedOldest;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
+	oldestMXact = MultiXactState->oldestMultiXactId;
 	LWLockRelease(MultiXactGenLock);
 	Assert(MultiXactIdIsValid(oldestMXact));
 
 	/*
+	 * Make sure to only attempt truncation if there's values to truncate
+	 * away. In normal processing values shouldn't go backwards, but there's
+	 * some corner cases (due to bugs) where that's possible.
+	 */
+	if (MultiXactIdPrecedesOrEquals(frozenMulti, oldestMXact))
+	{
+		LWLockRelease(MultiXactTruncationLock);
+		return;
+	}
+
+	/*
 	 * Note we can't just plow ahead with the truncation; it's possible that
 	 * there are no segments to truncate, which is a problem because we are
 	 * going to attempt to read the offsets page to determine where to
 	 * truncate the members SLRU.  So we first scan the directory to determine
 	 * the earliest offsets page number that we can read without error.
+	 *
+	 * XXX: It's also possible that the page that oldestMXact is on has
+	 * already been truncated away, and we crashed before updating
+	 * oldestMXact.
 	 */
 	trunc.earliestExistingPage = -1;
 	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
@@ -3012,19 +2991,10 @@ TruncateMultiXact(void)
 	if (earliest < FirstMultiXactId)
 		earliest = FirstMultiXactId;
 
-	/*
-	 * If there's nothing to remove, we can bail out early.
-	 *
-	 * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X,
-	 * oldestMXact might point to a multixact that does not exist.
-	 * Autovacuum will eventually advance it to a value that does exist,
-	 * and we want to set a proper offsetStopLimit when that happens,
-	 * so call DetermineSafeOldestOffset here even if we're not actually
-	 * truncating.
-	 */
+	/* If there's nothing to remove, we can bail out early. */
 	if (MultiXactIdPrecedes(oldestMXact, earliest))
 	{
-		DetermineSafeOldestOffset(oldestMXact);
+		LWLockRelease(MultiXactTruncationLock);
 		return;
 	}
 
@@ -3042,35 +3012,89 @@ TruncateMultiXact(void)
 	{
 		ereport(LOG,
 				(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
-					oldestMXact, earliest)));
+						oldestMXact, earliest)));
+		LWLockRelease(MultiXactTruncationLock);
 		return;
 	}
 
 	/*
-	 * To truncate MultiXactMembers, we need to figure out the active page
-	 * range and delete all files outside that range.  The start point is the
-	 * start of the segment containing the oldest offset; an end point of the
-	 * segment containing the next offset to use is enough.  The end point is
-	 * updated as MultiXactMember gets extended concurrently, elsewhere.
+	 * Secondly compute up to where to truncate. Lookup the corresponding
+	 * member offset for frozenMulti for that.
 	 */
-	range.rangeStart = MXOffsetToMemberPage(oldestOffset);
-	range.rangeStart -= range.rangeStart % SLRU_PAGES_PER_SEGMENT;
+	if (frozenMulti == nextMXact)
+		oldestAliveOffset = nextOffset; /* there are NO MultiXacts */
+	else if (!find_multixact_start(frozenMulti, &oldestAliveOffset))
+	{
+		ereport(LOG,
+				(errmsg("supposedly still alive MultiXact %u not found, skipping truncation",
+						frozenMulti)));
+		LWLockRelease(MultiXactTruncationLock);
+		return;
+	}
 
-	range.rangeEnd = MXOffsetToMemberPage(nextOffset);
+	elog(DEBUG1, "performing multixact truncation: offsets [%u, %u),  members [%u, %u), member segments [%x to %x)",
+		 oldestMXact, frozenMulti,
+		 oldestOffset,
+		 oldestAliveOffset,
+		 MXOffsetToMemberSegment(oldestOffset),
+		 MXOffsetToMemberSegment(oldestAliveOffset));
 
-	SlruScanDirectory(MultiXactMemberCtl, SlruScanDirCbRemoveMembers, &range);
+	/*
+	 * Do truncation, and the WAL logging of the truncation, in a critical
+	 * section. That way offsets/members cannot get out of sync anymore, i.e.
+	 * once consistent the oldestMulti will always exist in members, even if
+	 * we crashed in the wrong moment.
+	 */
+	START_CRIT_SECTION();
 
-	/* Now we can truncate MultiXactOffset */
-	SimpleLruTruncate(MultiXactOffsetCtl,
-					  MultiXactIdToOffsetPage(oldestMXact));
+	/*
+	 * Prevent checkpoints from being scheduled concurrently. This is critical
+	 * because otherwise a truncation record might not be replayed after a
+	 * crash/basebackup, even though the state of the data directory would
+	 * require it.  It's not possible, and not needed, to do this during
+	 * recovery, when performing a old-style truncation, though, as the
+	 * startup process doesn't have a PGXACT entry.
+	 */
+	if (!in_recovery)
+	{
+		Assert(!MyPgXact->delayChkpt);
+		MyPgXact->delayChkpt = true;
+	}
 
+	/*
+	 * Wal log truncation - this has to be flushed before the truncation is
+	 * actually performed, for the reasons explained in TruncateCLOG().
+	 */
+	if (!in_recovery)
+		WriteMTruncateXlogRec(minmulti_datoid,
+							  oldestMXact, frozenMulti,
+							  oldestOffset, oldestAliveOffset);
 
 	/*
-	 * Now, and only now, we can advance the stop point for multixact members.
-	 * If we did it any sooner, the segments we deleted above might already
-	 * have been overwritten with new members.  That would be bad.
+	 * Update in-memory limits before performing the truncation, while inside
+	 * the critical section: Have to do it before truncation, to prevent
+	 * concurrent lookups of those values. Has to be inside the critical
+	 * section asotherwise a future call to this function would error out,
+	 * while looking up the oldest member in offsets, if our caller crashes
+	 * before updating the limits.
 	 */
-	DetermineSafeOldestOffset(oldestMXact);
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestMultiXactId = frozenMulti;
+	MultiXactState->oldestMultiXactDB = minmulti_datoid;
+	LWLockRelease(MultiXactGenLock);
+
+	/* First truncate members */
+	PerformMembersTruncation(oldestOffset, oldestAliveOffset);
+
+	/* Then offsets */
+	SimpleLruTruncate(MultiXactOffsetCtl,
+					  MultiXactIdToOffsetPage(frozenMulti));
+
+	if (!in_recovery)
+		MyPgXact->delayChkpt = false;
+
+	END_CRIT_SECTION();
+	LWLockRelease(MultiXactTruncationLock);
 }
 
 /*
@@ -3167,6 +3191,34 @@ WriteMZeroPageXlogRec(int pageno, uint8 info)
 }
 
 /*
+ * Write a TRUNCATE xlog record
+ *
+ * We must flush the xlog record to disk before returning --- see notes
+ * in TruncateMultiXact().
+ */
+static void
+WriteMTruncateXlogRec(Oid oldestMultiDB,
+					  MultiXactId startTruncOff, MultiXactId endTruncOff,
+				MultiXactOffset startTruncMemb, MultiXactOffset endTruncMemb)
+{
+	XLogRecPtr	recptr;
+	xl_multixact_truncate xlrec;
+
+	xlrec.oldestMultiDB = oldestMultiDB;
+
+	xlrec.startTruncOff = startTruncOff;
+	xlrec.endTruncOff = endTruncOff;
+
+	xlrec.startTruncMemb = startTruncMemb;
+	xlrec.endTruncMemb = endTruncMemb;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), SizeOfMultiXactTruncate);
+	recptr = XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_TRUNCATE_ID);
+	XLogFlush(recptr);
+}
+
+/*
  * MULTIXACT resource manager's routines
  */
 void
@@ -3248,6 +3300,47 @@ multixact_redo(XLogReaderState *record)
 			LWLockRelease(XidGenLock);
 		}
 	}
+	else if (info == XLOG_MULTIXACT_TRUNCATE_ID)
+	{
+		xl_multixact_truncate xlrec;
+		int			pageno;
+
+		memcpy(&xlrec, XLogRecGetData(record),
+			   SizeOfMultiXactTruncate);
+
+		pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
+
+		elog(LOG, "replaying multixact truncation start: %u, %u, %x to %x",
+			 xlrec.startTruncMemb,
+			 xlrec.endTruncMemb,
+			 MXOffsetToMemberSegment(xlrec.startTruncMemb),
+			 MXOffsetToMemberSegment(xlrec.endTruncMemb));
+
+		/*
+		 * Advance the horizon values, so they're current at the end of
+		 * recovery.
+		 */
+		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB);
+
+		/* should not be required, but more than cheap enough */
+		LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
+
+		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
+
+		/*
+		 * During XLOG replay, latest_page_number isn't necessarily set up
+		 * yet; insert a suitable value to bypass the sanity test in
+		 * SimpleLruTruncate.
+		 */
+		MultiXactOffsetCtl->shared->latest_page_number = pageno;
+		SimpleLruTruncate(MultiXactOffsetCtl,
+						  MultiXactIdToOffsetPage(xlrec.endTruncOff));
+
+		LWLockRelease(MultiXactTruncationLock);
+
+		/* only looked at in the startup process, no lock necessary */
+		MultiXactState->sawTruncationCkptCycle = true;
+	}
 	else
 		elog(PANIC, "multixact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5fcea11..90c7cf5 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -134,6 +134,7 @@ static int	SlruSelectLRUPage(SlruCtl ctl, int pageno);
 
 static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
 						  int segpage, void *data);
+static void SlruInternalDeleteSegment(SlruCtl ctl, char *filename);
 
 /*
  * Initialization of shared memory
@@ -1075,7 +1076,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
  * Flush dirty pages to disk during checkpoint or database shutdown
  */
 void
-SimpleLruFlush(SlruCtl ctl, bool checkpoint)
+SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 {
 	SlruShared	shared = ctl->shared;
 	SlruFlushData fdata;
@@ -1096,11 +1097,11 @@ SimpleLruFlush(SlruCtl ctl, bool checkpoint)
 		SlruInternalWritePage(ctl, slotno, &fdata);
 
 		/*
-		 * When called during a checkpoint, we cannot assert that the slot is
-		 * clean now, since another process might have re-dirtied it already.
-		 * That's okay.
+		 * In some places (e.g. checkpoints), we cannot assert that the slot
+		 * is clean now, since another process might have re-dirtied it
+		 * already.  That's okay.
 		 */
-		Assert(checkpoint ||
+		Assert(allow_redirtied ||
 			   shared->page_status[slotno] == SLRU_PAGE_EMPTY ||
 			   (shared->page_status[slotno] == SLRU_PAGE_VALID &&
 				!shared->page_dirty[slotno]));
@@ -1210,8 +1211,14 @@ restart:;
 	(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
 }
 
-void
-SlruDeleteSegment(SlruCtl ctl, char *filename)
+/*
+ * Delete an individual SLRU segment, identified by the filename.
+ *
+ * NB: This does not touch the SLRU buffers themselves, callers have to ensure
+ * they either can't yet contain anything, or have already been cleaned out.
+ */
+static void
+SlruInternalDeleteSegment(SlruCtl ctl, char *filename)
 {
 	char		path[MAXPGPATH];
 
@@ -1222,6 +1229,64 @@ SlruDeleteSegment(SlruCtl ctl, char *filename)
 }
 
 /*
+ * Delete an individual SLRU segment, identified by the segment number.
+ */
+void
+SlruDeleteSegment(SlruCtl ctl, int segno)
+{
+	SlruShared	shared = ctl->shared;
+	int			slotno;
+	char		path[MAXPGPATH];
+	bool		did_write;
+
+	/* Clean out any possibly existing references to the segment. */
+	LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+restart:
+	did_write = false;
+	for (slotno = 0; slotno < shared->num_slots; slotno++)
+	{
+		int			pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+
+		if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
+			continue;
+
+		/* not the segment we're looking for */
+		if (pagesegno != segno)
+			continue;
+
+		/* If page is clean, just change state to EMPTY (expected case). */
+		if (shared->page_status[slotno] == SLRU_PAGE_VALID &&
+			!shared->page_dirty[slotno])
+		{
+			shared->page_status[slotno] = SLRU_PAGE_EMPTY;
+			continue;
+		}
+
+		/* Same logic as SimpleLruTruncate() */
+		if (shared->page_status[slotno] == SLRU_PAGE_VALID)
+			SlruInternalWritePage(ctl, slotno, NULL);
+		else
+			SimpleLruWaitIO(ctl, slotno);
+
+		did_write = true;
+	}
+
+	/*
+	 * Be extra careful and re-check. The IO functions release the control
+	 * lock, so new pages could have been read in.
+	 */
+	if (did_write)
+		goto restart;
+
+	snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
+	ereport(DEBUG2,
+			(errmsg("removing file \"%s\"", path)));
+	unlink(path);
+
+	LWLockRelease(shared->ControlLock);
+}
+
+/*
  * SlruScanDirectory callback
  *		This callback reports true if there's any segment prior to the one
  *		containing the page passed as "data".
@@ -1249,7 +1314,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 	int			cutoffPage = *(int *) data;
 
 	if (ctl->PagePrecedes(segpage, cutoffPage))
-		SlruDeleteSegment(ctl, filename);
+		SlruInternalDeleteSegment(ctl, filename);
 
 	return false;				/* keep going */
 }
@@ -1261,7 +1326,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 bool
 SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage, void *data)
 {
-	SlruDeleteSegment(ctl, filename);
+	SlruInternalDeleteSegment(ctl, filename);
 
 	return false;				/* keep going */
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e37ad3..541cd3b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6287,7 +6287,6 @@ StartupXLOG(void)
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTs,
 					 checkPoint.newestCommitTs);
-	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
 
@@ -6304,10 +6303,8 @@ StartupXLOG(void)
 	StartupReorderBuffer();
 
 	/*
-	 * Startup MultiXact.  We need to do this early for two reasons: one is
-	 * that we might try to access multixacts when we do tuple freezing, and
-	 * the other is we need its state initialized because we attempt
-	 * truncation during restartpoints.
+	 * Startup MultiXact, we need to do this early, to be able to replay
+	 * truncations.
 	 */
 	StartupMultiXact();
 
@@ -8465,12 +8462,6 @@ CreateCheckPoint(int flags)
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that the checkpoint is safely on disk, we can update the point to
-	 * which multixact can be truncated.
-	 */
-	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
-
-	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
 	smgrpostckpt();
@@ -8509,11 +8500,6 @@ CreateCheckPoint(int flags)
 	if (!RecoveryInProgress())
 		TruncateSUBTRANS(GetOldestXmin(NULL, false));
 
-	/*
-	 * Truncate pg_multixact too.
-	 */
-	TruncateMultiXact();
-
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
 
@@ -8844,21 +8830,6 @@ CreateRestartPoint(int flags)
 	}
 
 	/*
-	 * Due to a historical accident multixact truncations are not WAL-logged,
-	 * but just performed everytime the mxact horizon is increased. So, unless
-	 * we explicitly execute truncations on a standby it will never clean out
-	 * /pg_multixact which obviously is bad, both because it uses space and
-	 * because we can wrap around into pre-existing data...
-	 *
-	 * We can only do the truncation here, after the UpdateControlFile()
-	 * above, because we've now safely established a restart point.  That
-	 * guarantees we will not need to access those multis.
-	 *
-	 * It's probably worth improving this.
-	 */
-	TruncateMultiXact();
-
-	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
 	 * the oldest XMIN of any running transaction.  No future transaction will
 	 * attempt to reference any pg_subtrans entry older than that (see Asserts
@@ -9218,9 +9189,14 @@ xlog_redo(XLogReaderState *record)
 		LWLockRelease(OidGenLock);
 		MultiXactSetNextMXact(checkPoint.nextMulti,
 							  checkPoint.nextMultiOffset);
+
+		/*
+		 * NB: This may perform multixact truncation when replaying WAL
+		 * generated by an older primary.
+		 */
+		MultiXactAdvanceOldest(checkPoint.oldestMulti,
+							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-		SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
-		MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9310,14 +9286,17 @@ xlog_redo(XLogReaderState *record)
 		LWLockRelease(OidGenLock);
 		MultiXactAdvanceNextMXact(checkPoint.nextMulti,
 								  checkPoint.nextMultiOffset);
+
+		/*
+		 * NB: This may perform multixact truncation when replaying WAL
+		 * generated by an older primary.
+		 */
+		MultiXactAdvanceOldest(checkPoint.oldestMulti,
+							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
-		MultiXactAdvanceOldest(checkPoint.oldestMulti,
-							   checkPoint.oldestMultiDB);
-		MultiXactSetSafeTruncate(checkPoint.oldestMulti);
-
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index baf66f1..c1433e9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1134,11 +1134,11 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 
 	/*
-	 * Truncate CLOG and CommitTs to the oldest computed value. Note we don't
-	 * truncate multixacts; that will be done by the next checkpoint.
+	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
 	 */
 	TruncateCLOG(frozenXID);
 	TruncateCommitTs(frozenXID, true);
+	TruncateMultiXact(minMulti, minmulti_datoid, false);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6213f8a..7817546 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -71,6 +71,7 @@ typedef struct MultiXactMember
 #define XLOG_MULTIXACT_ZERO_OFF_PAGE	0x00
 #define XLOG_MULTIXACT_ZERO_MEM_PAGE	0x10
 #define XLOG_MULTIXACT_CREATE_ID		0x20
+#define XLOG_MULTIXACT_TRUNCATE_ID		0x30
 
 typedef struct xl_multixact_create
 {
@@ -82,6 +83,20 @@ typedef struct xl_multixact_create
 
 #define SizeOfMultiXactCreate (offsetof(xl_multixact_create, members))
 
+typedef struct xl_multixact_truncate
+{
+	Oid			oldestMultiDB;
+
+	/* to-be-truncated range of multixact offsets */
+	MultiXactId startTruncOff;	/* just for completeness' sake */
+	MultiXactId endTruncOff;
+
+	/* to-be-truncated range of multixact members */
+	MultiXactOffset startTruncMemb;
+	MultiXactOffset endTruncMemb;
+} xl_multixact_truncate;
+#define SizeOfMultiXactTruncate (sizeof(xl_multixact_truncate))
+
 
 extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status1, TransactionId xid2,
@@ -120,13 +135,12 @@ extern void MultiXactGetCheckptMulti(bool is_shutdown,
 						 Oid *oldestMultiDB);
 extern void CheckPointMultiXact(void);
 extern MultiXactId GetOldestMultiXactId(void);
-extern void TruncateMultiXact(void);
+extern void TruncateMultiXact(MultiXactId oldestMulti, Oid oldestMultiDB, bool inRecovery);
 extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset);
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 						  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern void MultiXactSetSafeTruncate(MultiXactId safeTruncateMulti);
 extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 9c7f019..f60e75b 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -143,14 +143,14 @@ extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
 extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
 						   TransactionId xid);
 extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
-extern void SimpleLruFlush(SlruCtl ctl, bool checkpoint);
+extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
 extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
 extern bool SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno);
 
 typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
 											  void *data);
 extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
-extern void SlruDeleteSegment(SlruCtl ctl, char *filename);
+extern void SlruDeleteSegment(SlruCtl ctl, int segno);
 
 /* SlruScanDirectory public callbacks */
 extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cff3b99..6f0688c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -135,8 +135,9 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define CommitTsControlLock			(&MainLWLockArray[38].lock)
 #define CommitTsLock				(&MainLWLockArray[39].lock)
 #define ReplicationOriginLock		(&MainLWLockArray[40].lock)
+#define MultiXactTruncationLock		(&MainLWLockArray[41].lock)
 
-#define NUM_INDIVIDUAL_LWLOCKS		41
+#define NUM_INDIVIDUAL_LWLOCKS		42
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4b650d1..23ba334 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2749,6 +2749,7 @@ xl_invalid_page
 xl_invalid_page_key
 xl_multi_insert_tuple
 xl_multixact_create
+xl_multixact_truncate
 xl_parameter_change
 xl_relmap_update
 xl_replorigin_drop
-- 
2.4.0.rc2.1.g3d6bc9a

#7Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#6)
Re: Rework the way multixact truncations work

On Mon, Jun 29, 2015 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote:

New version attached.

0002 looks good, but the commit message should perhaps mention the
comment fix. Or commit that separately.

Will look at 0003 next.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#7)
Re: Rework the way multixact truncations work

On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Will look at 0003 next.

+ appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",

I don't think we typically use this style for notating intervals.

         case XLOG_MULTIXACT_CREATE_ID:
             id = "CREATE_ID";
             break;
+        case XLOG_MULTIXACT_TRUNCATE_ID:
+            id = "TRUNCATE";
+            break;

If XLOG_MULTIXACT_CREATE_ID -> "CREATE_ID", then why not
XLOG_MULTIXACT_TRUNCATE_ID -> "TRUNCATE_ID"?

+ * too old to general truncation records.

s/general/generate/

+ MultiXactId oldestMXactDB;

Data type should be OID.

+     * Recompute limits as we are now fully started, we now can correctly
+     * compute how far a members wraparound is away.

s/,/:/ or something. This isn't particularly good English as written.

+     * Computing the actual limits is only possible once the data directory is
+     * in a consistent state. There's no need to compute the limits while
+     * still replaying WAL as no new multis can be created anyway. So we'll
+     * only do further checks after TrimMultiXact() has been called.

Multis can be and are created during replay. What this should really
say is that we have no choice about whether to create them or not: we
just have to replay whatever's there.

+ (errmsg("performing legacy multixact truncation,
upgrade master")));

This message needs work. I'm not sure exactly what it should say, but
I'm pretty sure that's not clear enough.

I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact(). The checkpoint hasn't really happened at that
point yet; you might truncate away stuff, then crash before the
checkpoint is complete, and then we you restart recovery, you've got
trouble. I think you should be very, very cautious about rejiggering
the order of operations here. The current situation is not good, but
casually rejiggering it can make things much worse.

-     * If no multixacts exist, then oldestMultiXactId will be the next
-     * multixact that will be created, rather than an existing multixact.
+     * Determine the offset of the oldest multixact.  Normally, we can read
+     * the offset from the multixact itself, but there's an important special
+     * case: if there are no multixacts in existence at all, oldestMXact
+     * obviously can't point to one.  It will instead point to the multixact
+     * ID that will be assigned the next time one is needed.

There's no need to change this; it means the same thing either way.

Generally, I think you've weakened the logic in SetOffsetVacuumLimit()
appreciably here. The existing code is careful never to set
oldestOffsetKnown false when it was previously true. Your rewrite
removes that property. Generally, I see no need for this function to
be overhauled to the extent that you have, and would suggest reverting
the changes that aren't absolutely required.

I don't particularly like the fact that find_multixact_start() calls
SimpleLruFlush(). I think that's really a POLA violation: you don't
expect that a function that looks like a simple inquiry is going to go
do a bunch of unrelated I/O in the background. If somebody called
find_multixact_start() with any frequency, you'd be sad. You're just
doing it this way because of the context *in which you expect
find_multixact_start* to be called, which does not seem very
future-proof. I prefer Thomas's approach.

If TruncateMultiXact() fails to acquire MultiXactTruncationLock right
away, does it need to wait, or could it ConditionalAcquire and bail
out if the lock isn't obtained?

+     * Make sure to only attempt truncation if there's values to truncate
+     * away. In normal processing values shouldn't go backwards, but there's
+     * some corner cases (due to bugs) where that's possible.

I think this comment should be more detailed. Is that talking about
the same thing as this comment:

- * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X,
- * oldestMXact might point to a multixact that does not exist.
- * Autovacuum will eventually advance it to a value that does exist,
- * and we want to set a proper offsetStopLimit when that happens,
- * so call DetermineSafeOldestOffset here even if we're not actually
- * truncating.

This comment seems to be saying there's a race condition:

+     * XXX: It's also possible that the page that oldestMXact is on has
+     * already been truncated away, and we crashed before updating
+     * oldestMXact.

But why is that an XXX? I think this is just a case of recovery
needing tolerate redo of an action already redone.

I'm not convinced that it's a good idea to remove
lastCheckpointedOldest and replace it with nothing. It seems like a
very good idea to have two separate pieces of state in shared memory:

- The oldest offset that we think anyone might need to access to make
a visibility check for a tuple.
- The oldest offset that we still have on disk.

The latter would now need to be called something else, not
lastCheckpointedOldest, but I think it's still good to have it.
Otherwise, I don't see how you protect against the on-disk state
wrapping around before you finish truncating, and then maybe
truncation eats something that was busy getting reused. We might be
kind of hosed in that situation anyway, because TruncateMultiXact()
and some other places assume that circular comparisons will return
sensible values. But that could be fixed, and probably should be
fixed eventually.

+ (errmsg("supposedly still alive MultiXact %u not
found, skipping truncation",

Maybe "cannot truncate MultiXact %u because it does not exist on disk,
skipping truncation"?

I think "frozenMulti" is a slightly confusing variable name and
deserves a comment. AUIU, that's the oldest multiXact we need to
keep. So it's actually the oldest multi that is NOT guaranteed to be
frozen. minMulti might be a better variable name, but a comment is
justified either way.

+     * Update in-memory limits before performing the truncation, while inside
+     * the critical section: Have to do it before truncation, to prevent
+     * concurrent lookups of those values. Has to be inside the critical
+     * section asotherwise a future call to this function would error out,
+     * while looking up the oldest member in offsets, if our caller crashes
+     * before updating the limits.

Missing space: asotherwise.

Who else might be concurrently looking up those values? Nobody else
can truncate while we're truncating, because we hold
MultiXactTruncationLock. And nobody else should be getting here from
looking up tuples, because if they are, we truncated too soon.

-     * Startup MultiXact.  We need to do this early for two reasons: one is
-     * that we might try to access multixacts when we do tuple freezing, and
-     * the other is we need its state initialized because we attempt
-     * truncation during restartpoints.
+     * Startup MultiXact, we need to do this early, to be able to replay
+     * truncations.

The period after "Startup MultiXact" was more correct than the comma
you've replaced it with.

Phew. That's all I see on a first read-through, but I've only spent a
couple of hours on this, so I might easily have missed some things.
But let me stop here, hit send, and see what you think of these
comments.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#8)
Re: Rework the way multixact truncations work

On 2015-07-02 13:58:45 -0400, Robert Haas wrote:

On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Will look at 0003 next.

+ appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",

I don't think we typically use this style for notating intervals.

I don't think we really have a very consistent style for xlog messages -
this seems to describe the meaning accurately?

[several good points]

+ (errmsg("performing legacy multixact truncation,
upgrade master")));

This message needs work. I'm not sure exactly what it should say, but
I'm pretty sure that's not clear enough.

I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().

But where should TruncateMultiXact() be called from? I mean, we could
move the logic from inside MultiXactAdvanceOldest() to some special case
in the replay routine, but what'd be the advantage?

The checkpoint hasn't really happened at that point yet; you might
truncate away stuff, then crash before the checkpoint is complete, and
then we you restart recovery, you've got trouble.

We're only talking about restartpoints here, right? And I don't see the
problem - we don't read the slru anymore until the end of recovery, and
the end of recovery can't happen before reaching the minimum revovery
location?

I think you should
be very, very cautious about rejiggering the order of operations here.
The current situation is not good, but casually rejiggering it can
make things much worse.

The current placement - as part of the restartpoint - is utterly broken
and unpredictable. There'll frequently be no restartpoints performed at
all (due to different checkpoint segments, slow writeout, or pending
actions). Because there's no careful timing of when this happens it's
much harder to understand the exact state in which the truncation
happens - I think moving it to a specific location during replay makes
things considerably easier.

Generally, I think you've weakened the logic in SetOffsetVacuumLimit()
appreciably here. The existing code is careful never to set
oldestOffsetKnown false when it was previously true. Your rewrite
removes that property. Generally, I see no need for this function to
be overhauled to the extent that you have, and would suggest reverting
the changes that aren't absolutely required.

A lot of that has to do that the whole stuff about truncations happening
during checkpoints is gone and that thus the split with
DetermineSafeOldestOffset() doesn't make sense anymore.

That oldestOffsetKnown is unset is wrong - the if (prevOldestOffsetKnown
&& !oldestOffsetKnown) block should be a bit earlier.

I don't particularly like the fact that find_multixact_start() calls
SimpleLruFlush(). I think that's really a POLA violation: you don't
expect that a function that looks like a simple inquiry is going to go
do a bunch of unrelated I/O in the background. If somebody called
find_multixact_start() with any frequency, you'd be sad. You're just
doing it this way because of the context *in which you expect
find_multixact_start* to be called, which does not seem very
future-proof. I prefer Thomas's approach.

I don't strongly care, but I do think it has some value to be sure about
the on-disk state for the current callers. I think it'd be a pretty odd
thing to call find_multixact_start() frequently.

If TruncateMultiXact() fails to acquire MultiXactTruncationLock right
away, does it need to wait, or could it ConditionalAcquire and bail
out if the lock isn't obtained?

That seems like premature optimization to me. And one that's not that
easy to do correctly - what if the current caller actually has a new,
lower, minimum mxid?

+     * XXX: It's also possible that the page that oldestMXact is on has
+     * already been truncated away, and we crashed before updating
+     * oldestMXact.

But why is that an XXX? I think this is just a case of recovery
needing tolerate redo of an action already redone.

Should rather have been NB.

I'm not convinced that it's a good idea to remove
lastCheckpointedOldest and replace it with nothing. It seems like a
very good idea to have two separate pieces of state in shared memory:

- The oldest offset that we think anyone might need to access to make
a visibility check for a tuple.
- The oldest offset that we still have on disk.

The latter would now need to be called something else, not
lastCheckpointedOldest, but I think it's still good to have it.

Otherwise, I don't see how you protect against the on-disk state
wrapping around before you finish truncating, and then maybe
truncation eats something that was busy getting reused.

Unless I miss something the stop limits will prevent that from
happening? SetMultiXactIdLimit() is called only *after* the truncation
has finished?

+     * Update in-memory limits before performing the truncation, while inside
+     * the critical section: Have to do it before truncation, to prevent
+     * concurrent lookups of those values. Has to be inside the critical
+     * section asotherwise a future call to this function would error out,
+     * while looking up the oldest member in offsets, if our caller crashes
+     * before updating the limits.

Missing space: asotherwise.

Who else might be concurrently looking up those values? Nobody else
can truncate while we're truncating, because we hold
MultiXactTruncationLock. And nobody else should be getting here from
looking up tuples, because if they are, we truncated too soon.

pg_get_multixact_members(), a concurrent call to SetMultiXactIdLimit()
(SetOffsetLimit()->find_multixact_start()) from vac_truncate_clog().

Phew. That's all I see on a first read-through, but I've only spent a
couple of hours on this, so I might easily have missed some things.
But let me stop here, hit send, and see what you think of these
comments.

Thanks for the look so far!

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#9)
Re: Rework the way multixact truncations work

On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-02 13:58:45 -0400, Robert Haas wrote:

On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Will look at 0003 next.

+ appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",

I don't think we typically use this style for notating intervals.

I don't think we really have a very consistent style for xlog messages -
this seems to describe the meaning accurately?

Although I realize this is supposed to be interval notation, I'm not
sure everyone will immediately figure that out. I believe it has
created some confusion in the past. I'm not going to spend a lot of
time arguing with you about it, but I'd do something else, like
offsets from %u stop before %u, members %u stop before %u.

[several good points]

+ (errmsg("performing legacy multixact truncation,
upgrade master")));

This message needs work. I'm not sure exactly what it should say, but
I'm pretty sure that's not clear enough.

I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().

But where should TruncateMultiXact() be called from? I mean, we could
move the logic from inside MultiXactAdvanceOldest() to some special case
in the replay routine, but what'd be the advantage?

I think you should call it from where TruncateMultiXact() is being
called from today. Doing legacy truncations from a different place
than we're currently doing them just gives us more ways to be wrong.

The checkpoint hasn't really happened at that point yet; you might
truncate away stuff, then crash before the checkpoint is complete, and
then we you restart recovery, you've got trouble.

We're only talking about restartpoints here, right? And I don't see the
problem - we don't read the slru anymore until the end of recovery, and
the end of recovery can't happen before reaching the minimum revovery
location?

You're still going to have to read the SLRU for as long as you are
doing legacy truncations, at least.

If TruncateMultiXact() fails to acquire MultiXactTruncationLock right
away, does it need to wait, or could it ConditionalAcquire and bail
out if the lock isn't obtained?

That seems like premature optimization to me. And one that's not that
easy to do correctly - what if the current caller actually has a new,
lower, minimum mxid?

Doesn't the next truncation just catch up? But sure, I agree this is
inessential (and maybe better left alone for now).

I'm not convinced that it's a good idea to remove
lastCheckpointedOldest and replace it with nothing. It seems like a
very good idea to have two separate pieces of state in shared memory:

- The oldest offset that we think anyone might need to access to make
a visibility check for a tuple.
- The oldest offset that we still have on disk.

The latter would now need to be called something else, not
lastCheckpointedOldest, but I think it's still good to have it.

Otherwise, I don't see how you protect against the on-disk state
wrapping around before you finish truncating, and then maybe
truncation eats something that was busy getting reused.

Unless I miss something the stop limits will prevent that from
happening? SetMultiXactIdLimit() is called only *after* the truncation
has finished?

Hmm, that might be, I'd have to reread the patch. The reason we
originally had it this way was because VACUUM was updating the limit
and then checkpoint was truncating, but now I guess vacuum + truncate
happen so close together that you might only need one value.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#10)
Re: Rework the way multixact truncations work

(quick answer, off now)

On 2015-07-05 14:20:11 -0400, Robert Haas wrote:

On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-02 13:58:45 -0400, Robert Haas wrote:

I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().

But where should TruncateMultiXact() be called from? I mean, we could
move the logic from inside MultiXactAdvanceOldest() to some special case
in the replay routine, but what'd be the advantage?

I think you should call it from where TruncateMultiXact() is being
called from today. Doing legacy truncations from a different place
than we're currently doing them just gives us more ways to be wrong.

The problem with that is that the current location is just plain
wrong. Restartpoints can be skipped (due different checkpoint segments
settings), may not happen at all (pending incomplete actions), and can
just be slowed down.

That's a currently existing bug that's easy to reproduce.

The checkpoint hasn't really happened at that point yet; you might
truncate away stuff, then crash before the checkpoint is complete, and
then we you restart recovery, you've got trouble.

We're only talking about restartpoints here, right? And I don't see the
problem - we don't read the slru anymore until the end of recovery, and
the end of recovery can't happen before reaching the minimum revovery
location?

You're still going to have to read the SLRU for as long as you are
doing legacy truncations, at least.

I'm not following. Sure, we read the SLRUs as we do today. But, in
contrast to the current positioning in recovery, with the patch they're
done at pretty much the same point on the standby as on the primary
today?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#11)
Re: Rework the way multixact truncations work

On Sun, Jul 5, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:

(quick answer, off now)

On 2015-07-05 14:20:11 -0400, Robert Haas wrote:

On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-02 13:58:45 -0400, Robert Haas wrote:

I seriously, seriously doubt that it is a good idea to perform the
legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().

But where should TruncateMultiXact() be called from? I mean, we could
move the logic from inside MultiXactAdvanceOldest() to some special case
in the replay routine, but what'd be the advantage?

I think you should call it from where TruncateMultiXact() is being
called from today. Doing legacy truncations from a different place
than we're currently doing them just gives us more ways to be wrong.

The problem with that is that the current location is just plain
wrong. Restartpoints can be skipped (due different checkpoint segments
settings), may not happen at all (pending incomplete actions), and can
just be slowed down.

That's a currently existing bug that's easy to reproduce.

You might be right; I haven't tested that.

On the other hand, in the common case, by the time we perform a
restartpoint, we're consistent: I think the main exception to that is
if we do a base backup that spans multiple checkpoints. I think that
in the new location, the chances that the legacy truncation is trying
to read inconsistent data is probably higher.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#12)
Re: Rework the way multixact truncations work

On July 5, 2015 8:50:57 PM GMT+02:00, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Jul 5, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de>
wrote:

(quick answer, off now)

On 2015-07-05 14:20:11 -0400, Robert Haas wrote:

On Thu, Jul 2, 2015 at 2:28 PM, Andres Freund <andres@anarazel.de>

wrote:

On 2015-07-02 13:58:45 -0400, Robert Haas wrote:

I seriously, seriously doubt that it is a good idea to perform

the

legacy truncation from MultiXactAdvanceOldest() rather than
TruncateMultiXact().

But where should TruncateMultiXact() be called from? I mean, we

could

move the logic from inside MultiXactAdvanceOldest() to some

special case

in the replay routine, but what'd be the advantage?

I think you should call it from where TruncateMultiXact() is being
called from today. Doing legacy truncations from a different place
than we're currently doing them just gives us more ways to be wrong.

The problem with that is that the current location is just plain
wrong. Restartpoints can be skipped (due different checkpoint

segments

settings), may not happen at all (pending incomplete actions), and

can

just be slowed down.

That's a currently existing bug that's easy to reproduce.

You might be right; I haven't tested that.

On the other hand, in the common case, by the time we perform a
restartpoint, we're consistent: I think the main exception to that is
if we do a base backup that spans multiple checkpoints. I think that
in the new location, the chances that the legacy truncation is trying
to read inconsistent data is probably higher.

The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to the primary by a considerable amount. All the while continuing to replay multi creations.

I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we *have* to do something while inconsistent. A start/stop backup can easily span a day or four.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#13)
Re: Rework the way multixact truncations work

On Sun, Jul 5, 2015 at 3:16 PM, Andres Freund <andres@anarazel.de> wrote:

On the other hand, in the common case, by the time we perform a
restartpoint, we're consistent: I think the main exception to that is
if we do a base backup that spans multiple checkpoints. I think that
in the new location, the chances that the legacy truncation is trying
to read inconsistent data is probably higher.

The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to the primary by a considerable amount. All the while continuing to replay multi creations.

I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we *have* to do something while inconsistent. A start/stop backup can easily span a day or four.

So, where are we with this patch?

In my opinion, we ought to do something about master and 9.5 before
beta, so that we're doing *yet another* major release with unfixed
multixact bugs. Let's make the relevant truncation changes in master
and 9.5 and bump the WAL page magic, so that a 9.5alpha standby can't
be used with a 9.5beta master. Then, we don't need any of this legacy
truncation stuff at all, and 9.5 is hopefully in a much better state
than 9.4 and 9.3.

Now, that still potentially leaves 9.4 and 9.3 users hanging out to
dry. But we don't have a tremendous number of those people clamoring
about this, and if we get 9.5+ correct, then we can go and change the
logic in 9.4 and 9.3 later when, and if, we are confident that's the
right thing to do. I am still not altogether convinced that it's a
good idea, nor am I altogether convinced that this code is right.
Perhaps it is, and if we consensus on it, fine. But regardless of
that, we should not send a third major release to beta with the
current broken system unless there is really no viable alternative.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#14)
Re: Rework the way multixact truncations work

On 2015-09-21 10:31:17 -0400, Robert Haas wrote:

On Sun, Jul 5, 2015 at 3:16 PM, Andres Freund <andres@anarazel.de> wrote:

On the other hand, in the common case, by the time we perform a
restartpoint, we're consistent: I think the main exception to that is
if we do a base backup that spans multiple checkpoints. I think that
in the new location, the chances that the legacy truncation is trying
to read inconsistent data is probably higher.

The primary problem isn't that we truncate too early, it's that we delay truncation on the standby in comparison to the primary by a considerable amount. All the while continuing to replay multi creations.

I don't see the difference wrt. consistency right now, but I don't have access to the code right now. I mean we *have* to do something while inconsistent. A start/stop backup can easily span a day or four.

So, where are we with this patch?

Uh. I'd basically been waiting on further review and then forgot about
it.

In my opinion, we ought to do something about master and 9.5 before
beta, so that we're doing *yet another* major release with unfixed
multixact bugs. Let's make the relevant truncation changes in master
and 9.5 and bump the WAL page magic, so that a 9.5alpha standby can't
be used with a 9.5beta master. Then, we don't need any of this legacy
truncation stuff at all, and 9.5 is hopefully in a much better state
than 9.4 and 9.3.

Hm.

Now, that still potentially leaves 9.4 and 9.3 users hanging out to
dry. But we don't have a tremendous number of those people clamoring
about this, and if we get 9.5+ correct, then we can go and change the
logic in 9.4 and 9.3 later when, and if, we are confident that's the
right thing to do. I am still not altogether convinced that it's a
good idea, nor am I altogether convinced that this code is right.
Perhaps it is, and if we consensus on it, fine.

To me the current logic is much worse than what's in the patch, so I
don't think that's the best way to go. But I'm not not absolutely gung
ho on that.

But regardless of that, we should not send a third major release to
beta with the current broken system unless there is really no viable
alternative.

Agreed. I'll update the patch.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Josh Berkus
josh@agliodbs.com
In reply to: Andres Freund (#1)
Re: Rework the way multixact truncations work

On 09/21/2015 07:36 AM, Andres Freund wrote:

On 2015-09-21 10:31:17 -0400, Robert Haas wrote:

So, where are we with this patch?

Uh. I'd basically been waiting on further review and then forgot about
it.

Does the current plan to never expire XIDs in 9.6 affect multixact
truncation at all?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@anarazel.de
In reply to: Josh Berkus (#16)
Re: Rework the way multixact truncations work

On 2015-09-21 10:30:59 -0700, Josh Berkus wrote:

On 09/21/2015 07:36 AM, Andres Freund wrote:

On 2015-09-21 10:31:17 -0400, Robert Haas wrote:

So, where are we with this patch?

Uh. I'd basically been waiting on further review and then forgot about
it.

Does the current plan to never expire XIDs in 9.6 affect multixact
truncation at all?

I doubt that it'd in a meaningful manner. Truncations will still need to
happen to contain space usage.

Besides, I'm pretty sceptical of shaping the design of bug fixes to suit
some unwritten feature we only know the highest level design of as of
yet.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)
Re: Rework the way multixact truncations work

On 2015-07-02 11:52:04 -0400, Robert Haas wrote:

On Mon, Jun 29, 2015 at 3:48 PM, Andres Freund <andres@anarazel.de> wrote:

New version attached.

0002 looks good, but the commit message should perhaps mention the
comment fix. Or commit that separately.

I'm inclined to backpatch the applicable parts to 9.0 - seems pointless
to have differing autovacuum_freeze_max_age values and the current value
sucks for testing and space consumption there as well.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#15)
3 attachment(s)
Re: Rework the way multixact truncations work

On 2015-09-21 16:36:03 +0200, Andres Freund wrote:

Agreed. I'll update the patch.

Here's updated patches against master. These include the "legacy"
truncation support. There's no meaningful functional differences in this
version except addressing the review comments that I agreed with, and a
fair amount of additional polishing.

I've not:
* removed legacy truncation support
* removed SimpleLruFlush() from find_multixact_start() - imo it's easier
to reason about the system when the disk state is in sync with the in
memory state.
* removed the interval syntax from debug messages and xlogdump - they're
a fair bit more concise and the target audience of those will be able
to figure it out.
* unsplit DetermineSafeOldestOffset & SetOffsetVacuumLimit - imo the
separate functions don't make sense anymore now that limits and
truncations aren't as separate anymore.

What I've tested is the following:
* continous burning of multis, both triggered via members and offsets
* a standby keeping up when the primary is old
* a standby keeping up when the primary is new
* basebackups made while a new primary is under load
* verified that we properly PANIC when a truncation record is replayed
in an old standby.

Does anybody have additional tests in mind?

I plan to push 0002 fairly soon, it seemed to be pretty
uncontroversial. I'll then work tomorrow afternoon on producing branch
specific versions of 0003 and on producing 0004 removing all the legacy
stuff for 9.5 & master.

Greetings,

Andres Freund

Attachments:

0001-WIP-dontcommit-Add-functions-to-burn-multixacts.patchtext/x-patch; charset=us-asciiDownload
>From b4a6fade2dc06c4658dbfd1bf1801b30d5c3e388 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 4 Jun 2015 19:38:32 +0200
Subject: [PATCH 1/3] WIP-dontcommit: Add functions to burn multixacts

This should live in its own module, but we don't have that yet.
---
 contrib/pageinspect/heapfuncs.c          | 43 ++++++++++++++++++++++++++++++++
 contrib/pageinspect/pageinspect--1.3.sql |  6 +++++
 src/backend/access/heap/heapam.c         |  2 +-
 src/backend/access/transam/multixact.c   | 15 ++++++-----
 src/include/access/multixact.h           |  3 ++-
 5 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 8d1666c..7a3aa14 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -29,6 +29,8 @@
 #include "funcapi.h"
 #include "utils/builtins.h"
 #include "miscadmin.h"
+#include "access/multixact.h"
+#include "access/transam.h"
 
 
 /*
@@ -223,3 +225,44 @@ heap_page_items(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(fctx);
 }
+
+extern Datum
+pg_burn_multixact(PG_FUNCTION_ARGS);
+PG_FUNCTION_INFO_V1(pg_burn_multixact);
+
+Datum
+pg_burn_multixact(PG_FUNCTION_ARGS)
+{
+	int		rep = PG_GETARG_INT32(0);
+	int		size = PG_GETARG_INT32(1);
+	MultiXactMember *members;
+	MultiXactId ret;
+	TransactionId id = ReadNewTransactionId() - size;
+	int		i;
+
+	if (rep < 1)
+		elog(ERROR, "need to burn, burn, burn");
+
+	members = palloc(size * sizeof(MultiXactMember));
+	for (i = 0; i < size; i++)
+	{
+		members[i].xid = id++;
+		members[i].status = MultiXactStatusForShare;
+
+		if (!TransactionIdIsNormal(members[i].xid))
+		{
+			id = FirstNormalTransactionId;
+			members[i].xid = id++;
+		}
+	}
+
+	MultiXactIdSetOldestMember();
+
+	for (i = 0; i < rep; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+		ret = MultiXactIdCreateFromMembers(size, members, true);
+	}
+
+	PG_RETURN_INT64((int64) ret);
+}
diff --git a/contrib/pageinspect/pageinspect--1.3.sql b/contrib/pageinspect/pageinspect--1.3.sql
index a99e058..22f51bc 100644
--- a/contrib/pageinspect/pageinspect--1.3.sql
+++ b/contrib/pageinspect/pageinspect--1.3.sql
@@ -187,3 +187,9 @@ CREATE FUNCTION gin_leafpage_items(IN page bytea,
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'gin_leafpage_items'
 LANGUAGE C STRICT;
+
+
+CREATE FUNCTION pg_burn_multixact(num int4, size int4)
+RETURNS int4
+AS 'MODULE_PATHNAME', 'pg_burn_multixact'
+LANGUAGE C STRICT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..e167684 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6099,7 +6099,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 * Create a new multixact with the surviving members of the previous
 		 * one, to set as new Xmax in the tuple.
 		 */
-		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
+		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers, false);
 		*flags |= FRM_RETURN_IS_MULTI;
 	}
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1933a87..34c5370 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -407,7 +407,7 @@ MultiXactIdCreate(TransactionId xid1, MultiXactStatus status1,
 	members[1].xid = xid2;
 	members[1].status = status2;
 
-	newMulti = MultiXactIdCreateFromMembers(2, members);
+	newMulti = MultiXactIdCreateFromMembers(2, members, false);
 
 	debug_elog3(DEBUG2, "Create: %s",
 				mxid_to_string(newMulti, 2, members));
@@ -473,7 +473,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
 		 */
 		member.xid = xid;
 		member.status = status;
-		newMulti = MultiXactIdCreateFromMembers(1, &member);
+		newMulti = MultiXactIdCreateFromMembers(1, &member, false);
 
 		debug_elog4(DEBUG2, "Expand: %u has no members, create singleton %u",
 					multi, newMulti);
@@ -525,7 +525,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
 
 	newMembers[j].xid = xid;
 	newMembers[j++].status = status;
-	newMulti = MultiXactIdCreateFromMembers(j, newMembers);
+	newMulti = MultiXactIdCreateFromMembers(j, newMembers, false);
 
 	pfree(members);
 	pfree(newMembers);
@@ -744,7 +744,7 @@ ReadNextMultiXactId(void)
  * NB: the passed members[] array will be sorted in-place.
  */
 MultiXactId
-MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
+MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members, bool nocache)
 {
 	MultiXactId multi;
 	MultiXactOffset offset;
@@ -763,7 +763,9 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	 * corner cases where someone else added us to a MultiXact without our
 	 * knowledge, but it's not worth checking for.)
 	 */
-	multi = mXactCacheGetBySet(nmembers, members);
+	multi = nocache ? InvalidMultiXactId :
+		mXactCacheGetBySet(nmembers, members);
+
 	if (MultiXactIdIsValid(multi))
 	{
 		debug_elog2(DEBUG2, "Create: in cache!");
@@ -836,7 +838,8 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	END_CRIT_SECTION();
 
 	/* Store the new MultiXactId in the local cache, too */
-	mXactCachePut(multi, nmembers, members);
+	if (!nocache)
+		mXactCachePut(multi, nmembers, members);
 
 	debug_elog2(DEBUG2, "Create: all done");
 
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index f1448fe..6213f8a 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -86,10 +86,11 @@ typedef struct xl_multixact_create
 extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status1, TransactionId xid2,
 				  MultiXactStatus status2);
+extern MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members, bool nocache);
 extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid,
 				  MultiXactStatus status);
 extern MultiXactId MultiXactIdCreateFromMembers(int nmembers,
-							 MultiXactMember *members);
+							 MultiXactMember *members, bool nocache);
 
 extern MultiXactId ReadNextMultiXactId(void);
 extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
-- 
2.5.0.400.gff86faf

0002-Lower-_freeze_max_age-minimum-values.patchtext/x-patch; charset=us-asciiDownload
>From a7b0af940ecb1140deee3582508fc0a2dd64e03b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 15 Jun 2015 19:12:52 +0200
Subject: [PATCH 2/3] Lower *_freeze_max_age minimum values.

The old minimum values are rather large, making it time consuming to
test related behaviour. Additionally the current limits, especially for
multixacts, can be problematic in space-constrained systems. 10000000
multixacts can contain a lot of members.

Since there's no good reason for the current limits, lower them a good
bit. Setting them to 0 would be a bad idea, triggering endless vacuums,
so still retain a limit.

While at it fix autovacuum_multixact_freeze_max_age to refer to
multixact.c instead of varsup.c.

Reviewed-By: Robert Haas
Discussion: CA+TgmoYmQPHcrc3GSs7vwvrbTkbcGD9Gik=OztbDGGrovkkEzQ@mail.gmail.com
Backpatch: back to 9.0 (in parts)
---
 src/backend/utils/misc/guc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fcba3c5..17053af 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2510,17 +2510,17 @@ static struct config_int ConfigureNamesInt[] =
 		},
 		&autovacuum_freeze_max_age,
 		/* see pg_resetxlog if you change the upper-limit value */
-		200000000, 100000000, 2000000000,
+		200000000, 100000, 2000000000,
 		NULL, NULL, NULL
 	},
 	{
-		/* see varsup.c for why this is PGC_POSTMASTER not PGC_SIGHUP */
+		/* see multixact.c for why this is PGC_POSTMASTER not PGC_SIGHUP */
 		{"autovacuum_multixact_freeze_max_age", PGC_POSTMASTER, AUTOVACUUM,
 			gettext_noop("Multixact age at which to autovacuum a table to prevent multixact wraparound."),
 			NULL
 		},
 		&autovacuum_multixact_freeze_max_age,
-		400000000, 10000000, 2000000000,
+		400000000, 10000, 2000000000,
 		NULL, NULL, NULL
 	},
 	{
-- 
2.5.0.400.gff86faf

0003-Rework-the-way-multixact-truncations-work.patchtext/x-patch; charset=us-asciiDownload
>From 7d83bd78b7ca0bad52a6d54a996b8fe8c15b4f65 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 22 Sep 2015 15:17:09 +0200
Subject: [PATCH 3/3] Rework the way multixact truncations work.

The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.

We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for truncation, even in the back branches.

This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
   to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
   we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
   this has gotten much easier

During the course of fixing this a bunch of bugs had to be fixed:
1) Data was not purged from memory the member's slru before deleting
   segments. This happend to be hard or impossible to hit due to the
   interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
   that doesn't work for offsets that haven't yet been flushed to
   disk. Flush out before running to fix. Not pretty, but it feels
   slightly safer to only make decisions based on on-disk state.
3) find_multixact_start() could be called concurrently with a truncation
   and thus fail. Via SetOffsetVacuumLimit() that could lead to a round
   of emergency vacuuming. The problem remains in
   pg_get_multixact_members(), but that's quite harmless.

To handle the case of an updated standby replaying WAL from a not-yet
upgraded primary we have to recognize that situation and use "old style"
truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to
before this now happens in the startup process, when replaying a
checkpoint record, instead of the checkpointer. Doing this in the
restartpoint was incorrect, they can happen much later than the original
checkpoint, thereby leading to wraparound. It's also more in line to how
the WAL logging now works.

To avoid "multixact_redo: unknown op code 48" errors standbys should be
upgraded before primaries. This needs to be expressed clearly in the
release notes.

Backpatch to 9.3, where the use of multixacts was expanded. Arguably
this could be backpatched further, but there doesn't seem to be
sufficient benefit to outweigh the risk of applying a significantly
different patch there.

Discussion: 20150621192409.GA4797@alap3.anarazel.de
Reviewed-By: Robert Haas, Alvaro Herrera, Thomas Munro
Backpatch: 9.3
---
 src/backend/access/rmgrdesc/mxactdesc.c  |  11 +
 src/backend/access/transam/multixact.c   | 651 ++++++++++++++++++-------------
 src/backend/access/transam/slru.c        |  83 +++-
 src/backend/access/transam/xlog.c        |  53 +--
 src/backend/commands/vacuum.c            |   4 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/access/multixact.h           |  19 +-
 src/include/access/slru.h                |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 9 files changed, 507 insertions(+), 320 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 572951e..5b8134f 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -70,6 +70,14 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
+	else if (info == XLOG_MULTIXACT_TRUNCATE_ID)
+	{
+		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
+
+		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+						 xlrec->startTruncOff, xlrec->endTruncOff,
+						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+	}
 }
 
 const char *
@@ -88,6 +96,9 @@ multixact_identify(uint8 info)
 		case XLOG_MULTIXACT_CREATE_ID:
 			id = "CREATE_ID";
 			break;
+		case XLOG_MULTIXACT_TRUNCATE_ID:
+			id = "TRUNCATE_ID";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 34c5370..1c3cfbe 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -49,9 +49,7 @@
  * value is removed; the cutoff value is stored in pg_class.  The minimum value
  * across all tables in each database is stored in pg_database, and the global
  * minimum across all databases is part of pg_control and is kept in shared
- * memory.  At checkpoint time, after the value is known flushed in WAL, any
- * files that correspond to multixacts older than that value are removed.
- * (These files are also removed when a restartpoint is executed.)
+ * memory.  Whenever that minimum is advanced, the SLRUs are truncated.
  *
  * When new multixactid values are to be created, care is taken that the
  * counter does not fall within the wraparound horizon considering the global
@@ -83,6 +81,7 @@
 #include "postmaster/autovacuum.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -109,6 +108,7 @@
 	((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
 #define MultiXactIdToOffsetEntry(xid) \
 	((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
+#define MultiXactIdToOffsetSegment(xid) (MultiXactIdToOffsetPage(xid) / SLRU_PAGES_PER_SEGMENT)
 
 /*
  * The situation for members is a bit more complex: we store one byte of
@@ -153,6 +153,7 @@
 
 /* page in which a member is to be found */
 #define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
+#define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
 
 /* Location (byte offset within page) of flag word for a given member */
 #define MXOffsetToFlagsOffset(xid) \
@@ -220,11 +221,12 @@ typedef struct MultiXactStateData
 	bool		oldestOffsetKnown;
 
 	/*
-	 * This is what the previous checkpoint stored as the truncate position.
-	 * This value is the oldestMultiXactId that was valid when a checkpoint
-	 * was last executed.
+	 * True if a multixact truncation WAL record was replayed since the last
+	 * checkpoint. This is used to trigger 'legacy truncations', i.e. truncate
+	 * by looking at the data directory during WAL replay, when the primary is
+	 * too old to generate truncation records.
 	 */
-	MultiXactId lastCheckpointedOldest;
+	bool		sawTruncationInCkptCycle;
 
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
@@ -233,8 +235,7 @@ typedef struct MultiXactStateData
 	MultiXactId multiWrapLimit;
 
 	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;
-	bool offsetStopLimitKnown;
+	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
@@ -364,12 +365,14 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 						MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static void DetermineSafeOldestOffset(MultiXactId oldestMXact);
 static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
 						 MultiXactOffset start, uint32 distance);
-static bool SetOffsetVacuumLimit(bool finish_setup);
+static bool SetOffsetVacuumLimit(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int pageno, uint8 info);
+static void WriteMTruncateXlogRec(Oid oldestMultiDB,
+					  MultiXactId startOff, MultiXactId endOff,
+					  MultiXactOffset startMemb, MultiXactOffset endMemb);
 
 
 /*
@@ -1102,7 +1105,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 *----------
 	 */
 #define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->offsetStopLimitKnown &&
+	if (MultiXactState->oldestOffsetKnown &&
 		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
 								 nmembers))
 	{
@@ -1142,7 +1145,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 	}
 
-	if (MultiXactState->offsetStopLimitKnown &&
+	if (MultiXactState->oldestOffsetKnown &&
 		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
 								 nextOffset,
 								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
@@ -2020,13 +2023,21 @@ StartupMultiXact(void)
 void
 TrimMultiXact(void)
 {
-	MultiXactId multi = MultiXactState->nextMXact;
-	MultiXactOffset offset = MultiXactState->nextOffset;
-	MultiXactId	oldestMXact;
+	MultiXactId nextMXact;
+	MultiXactOffset offset;
+	MultiXactId oldestMXact;
+	Oid			oldestMXactDB;
 	int			pageno;
 	int			entryno;
 	int			flagsoff;
 
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	nextMXact = MultiXactState->nextMXact;
+	offset = MultiXactState->nextOffset;
+	oldestMXact = MultiXactState->oldestMultiXactId;
+	oldestMXactDB = MultiXactState->oldestMultiXactDB;
+	MultiXactState->finishedStartup = true;
+	LWLockRelease(MultiXactGenLock);
 
 	/* Clean up offsets state */
 	LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
@@ -2034,20 +2045,20 @@ TrimMultiXact(void)
 	/*
 	 * (Re-)Initialize our idea of the latest page number for offsets.
 	 */
-	pageno = MultiXactIdToOffsetPage(multi);
+	pageno = MultiXactIdToOffsetPage(nextMXact);
 	MultiXactOffsetCtl->shared->latest_page_number = pageno;
 
 	/*
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for motivation.
 	 */
-	entryno = MultiXactIdToOffsetEntry(multi);
+	entryno = MultiXactIdToOffsetEntry(nextMXact);
 	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
@@ -2096,12 +2107,11 @@ TrimMultiXact(void)
 
 	LWLockRelease(MultiXactMemberControlLock);
 
-	if (SetOffsetVacuumLimit(true) && IsUnderPostmaster)
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	oldestMXact = MultiXactState->lastCheckpointedOldest;
-	LWLockRelease(MultiXactGenLock);
-	DetermineSafeOldestOffset(oldestMXact);
+	/*
+	 * Now that we are fully started we can accurately compute how far the
+	 * next members wraparound is away.
+	 */
+	SetMultiXactIdLimit(oldestMXact, oldestMXactDB);
 }
 
 /*
@@ -2270,8 +2280,20 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	 (errmsg("MultiXactId wrap limit is %u, limited by database with OID %u",
 			 multiWrapLimit, oldest_datoid)));
 
+	/*
+	 * Computing the actual limits is only possible once the data directory is
+	 * in a consistent state. There's no need to compute the limits while
+	 * still replaying WAL - no decisions about new multis are made even
+	 * though multixact creations might be replayed. So we'll only do further
+	 * checks after TrimMultiXact() has been called.
+	 */
+	if (!MultiXactState->finishedStartup)
+		return;
+
+	Assert(!InRecovery);
+
 	/* Set limits for offset vacuum. */
-	needs_offset_vacuum = SetOffsetVacuumLimit(false);
+	needs_offset_vacuum = SetOffsetVacuumLimit();
 
 	/*
 	 * If past the autovacuum force point, immediately signal an autovac
@@ -2281,11 +2303,11 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	 * another iteration immediately if there are still any old databases.
 	 */
 	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
-		 needs_offset_vacuum) && IsUnderPostmaster && !InRecovery)
+		 needs_offset_vacuum) && IsUnderPostmaster)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
-	if (MultiXactIdPrecedes(multiWarnLimit, curMulti) && !InRecovery)
+	if (MultiXactIdPrecedes(multiWarnLimit, curMulti))
 	{
 		char	   *oldest_datname;
 
@@ -2353,27 +2375,39 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 }
 
 /*
- * Update our oldestMultiXactId value, but only if it's more recent than
- * what we had.  However, even if not, always update the oldest multixact
- * offset limit.
+ * Update our oldestMultiXactId value, but only if it's more recent than what
+ * we had.
+ *
+ * This may only be called during WAL replay.
  */
 void
 MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 {
+	Assert(InRecovery);
+
 	if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
+	{
+		/*
+		 * If there has been a truncation on the master, detected by seing a
+		 * moving oldestMulti without a corresponding truncation record, we
+		 * know that the primary is still running an older version of postgres
+		 * that doesn't yet log multixact truncations. So perform truncation
+		 * ourselves.
+		 */
+		if (!MultiXactState->sawTruncationInCkptCycle)
+		{
+			ereport(LOG,
+					(errmsg("performing legacy multixact truncation"),
+					 errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."),
+					 errhint("Upgrade the primary, it is susceptible to data corruption.")));
+			TruncateMultiXact(oldestMulti, oldestMultiDB, true);
+		}
+
 		SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
-}
+	}
 
-/*
- * Update the "safe truncation point".  This is the newest value of oldestMulti
- * that is known to be flushed as part of a checkpoint record.
- */
-void
-MultiXactSetSafeTruncate(MultiXactId safeTruncateMulti)
-{
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->lastCheckpointedOldest = safeTruncateMulti;
-	LWLockRelease(MultiXactGenLock);
+	/* only looked at in the startup process, no lock necessary */
+	MultiXactState->sawTruncationInCkptCycle = false;
 }
 
 /*
@@ -2529,126 +2563,50 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Based on the given oldest MultiXactId, determine what's the oldest member
- * offset and install the limit info in MultiXactState, where it can be used to
- * prevent overrun of old data in the members SLRU area.
- */
-static void
-DetermineSafeOldestOffset(MultiXactId oldestMXact)
-{
-	MultiXactOffset oldestOffset;
-	MultiXactOffset nextOffset;
-	MultiXactOffset offsetStopLimit;
-	MultiXactOffset prevOffsetStopLimit;
-	MultiXactId		nextMXact;
-	bool			finishedStartup;
-	bool			prevOffsetStopLimitKnown;
-
-	/* Fetch values from shared memory. */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	finishedStartup = MultiXactState->finishedStartup;
-	nextMXact = MultiXactState->nextMXact;
-	nextOffset = MultiXactState->nextOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
-	prevOffsetStopLimitKnown = MultiXactState->offsetStopLimitKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	/* Don't worry about this until after we've started up. */
-	if (!finishedStartup)
-		return;
-
-	/*
-	 * Determine the offset of the oldest multixact.  Normally, we can read
-	 * the offset from the multixact itself, but there's an important special
-	 * case: if there are no multixacts in existence at all, oldestMXact
-	 * obviously can't point to one.  It will instead point to the multixact
-	 * ID that will be assigned the next time one is needed.
-	 *
-	 * NB: oldestMXact should be the oldest multixact that still exists in the
-	 * SLRU, unlike in SetOffsetVacuumLimit, where we do this same computation
-	 * based on the oldest value that might be referenced in a table.
-	 */
-	if (nextMXact == oldestMXact)
-		oldestOffset = nextOffset;
-	else
-	{
-		bool		oldestOffsetKnown;
-
-		oldestOffsetKnown = find_multixact_start(oldestMXact, &oldestOffset);
-		if (!oldestOffsetKnown)
-		{
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
-						oldestMXact)));
-			return;
-		}
-	}
-
-	/* move back to start of the corresponding segment */
-	offsetStopLimit = oldestOffset - (oldestOffset %
-		(MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-	/* always leave one segment before the wraparound point */
-	offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-	/* if nothing has changed, we're done */
-	if (prevOffsetStopLimitKnown && offsetStopLimit == prevOffsetStopLimit)
-		return;
-
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->offsetStopLimit = offsetStopLimit;
-	MultiXactState->offsetStopLimitKnown = true;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!prevOffsetStopLimitKnown && IsUnderPostmaster)
-		ereport(LOG,
-				(errmsg("MultiXact member wraparound protections are now enabled")));
-	ereport(DEBUG1,
-			(errmsg("MultiXact member stop limit is now %u based on MultiXact %u",
-				offsetStopLimit, oldestMXact)));
-}
-
-/*
  * Determine how aggressively we need to vacuum in order to prevent member
  * wraparound.
  *
- * To determine the oldest multixact ID, we look at oldestMultiXactId, not
- * lastCheckpointedOldest.  That's because vacuuming can't help with anything
- * older than oldestMultiXactId; anything older than that isn't referenced
- * by any table.  Offsets older than oldestMultiXactId but not as old as
- * lastCheckpointedOldest will go away after the next checkpoint.
+ * To do so determine what's the oldest member offset and install the limit
+ * info in MultiXactState, where it can be used to prevent overrun of old data
+ * in the members SLRU area.
  *
  * The return value is true if emergency autovacuum is required and false
  * otherwise.
  */
 static bool
-SetOffsetVacuumLimit(bool finish_setup)
+SetOffsetVacuumLimit(void)
 {
-	MultiXactId	oldestMultiXactId;
+	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
-	bool		finishedStartup;
-	MultiXactOffset oldestOffset = 0;		/* placate compiler */
+	MultiXactOffset oldestOffset = 0;	/* placate compiler */
+	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
-	MultiXactOffset prevOldestOffset;
 	bool		prevOldestOffsetKnown;
+	MultiXactOffset offsetStopLimit = 0;
+
+	/*
+	 * NB: Have to prevent concurrent truncation, we might otherwise try to
+	 * lookup a oldestMulti that's concurrently getting truncated away.
+	 */
+	LWLockAcquire(MultiXactTruncationLock, LW_SHARED);
 
 	/* Read relevant fields from shared memory. */
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	finishedStartup = MultiXactState->finishedStartup;
-	prevOldestOffset = MultiXactState->oldestOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
+	prevOldestOffset = MultiXactState->oldestOffset;
+	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
-	/* Don't do this until after any recovery is complete. */
-	if (!finishedStartup && !finish_setup)
-		return false;
-
 	/*
-	 * If no multixacts exist, then oldestMultiXactId will be the next
-	 * multixact that will be created, rather than an existing multixact.
+	 * Determine the offset of the oldest multixact.  Normally, we can read
+	 * the offset from the multixact itself, but there's an important special
+	 * case: if there are no multixacts in existence at all, oldestMXact
+	 * obviously can't point to one.  It will instead point to the multixact
+	 * ID that will be assigned the next time one is needed.
 	 */
 	if (oldestMultiXactId == nextMXact)
 	{
@@ -2669,39 +2627,45 @@ SetOffsetVacuumLimit(bool finish_setup)
 		 */
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
+
+		if (oldestOffsetKnown)
+			ereport(DEBUG1,
+					(errmsg("oldest MultiXactId member is at offset %u",
+							oldestOffset)));
+		else
+			ereport(LOG,
+					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+							oldestMultiXactId)));
 	}
 
+	LWLockRelease(MultiXactTruncationLock);
+
 	/*
-	 * Except when initializing the system for the first time, there's no
-	 * need to update anything if we don't know the oldest offset or if it
-	 * hasn't changed.
+	 * If we can, compute limits (and install them MultiXactState) to prevent
+	 * overrun of old data in the members SLRU area. We can only do so if the
+	 * oldest offset is known though.
 	 */
-	if (finish_setup ||
-		(oldestOffsetKnown && !prevOldestOffsetKnown) ||
-		(oldestOffsetKnown && prevOldestOffset != oldestOffset))
+	if (oldestOffsetKnown)
 	{
-		/* Install the new limits. */
-		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-		MultiXactState->oldestOffset = oldestOffset;
-		MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-		MultiXactState->finishedStartup = true;
-		LWLockRelease(MultiXactGenLock);
+		/* move back to start of the corresponding segment */
+		offsetStopLimit = oldestOffset - (oldestOffset %
+					  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
 
-		/* Log the info */
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg("oldest MultiXactId member is at offset %u",
-						oldestOffset)));
-		else
-			ereport(DEBUG1,
-					(errmsg("oldest MultiXactId member offset unknown")));
+		/* always leave one segment before the wraparound point */
+		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
+
+		if (!prevOldestOffsetKnown && IsUnderPostmaster)
+			ereport(LOG,
+					(errmsg("MultiXact member wraparound protections are now enabled")));
+		ereport(DEBUG1,
+		(errmsg("MultiXact member stop limit is now %u based on MultiXact %u",
+				offsetStopLimit, oldestMultiXactId)));
 	}
 
 	/*
 	 * If we failed to get the oldest offset this time, but we have a value
-	 * from a previous pass through this function, assess the need for
-	 * autovacuum based on that old value rather than automatically forcing
-	 * it.
+	 * from a previous pass through this function, use the old value rather
+	 * than automatically forcing it.
 	 */
 	if (prevOldestOffsetKnown && !oldestOffsetKnown)
 	{
@@ -2709,6 +2673,13 @@ SetOffsetVacuumLimit(bool finish_setup)
 		oldestOffsetKnown = true;
 	}
 
+	/* Install the computed values */
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestOffset = oldestOffset;
+	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
+	MultiXactState->offsetStopLimit = offsetStopLimit;
+	LWLockRelease(MultiXactGenLock);
+
 	/*
 	 * Do we need an emergency autovacuum?  If we're not sure, assume yes.
 	 */
@@ -2723,7 +2694,7 @@ SetOffsetVacuumLimit(bool finish_setup)
  * boundary point, hence the name.  The reason we don't want to use the regular
  * 2^31-modulo arithmetic here is that we want to be able to use the whole of
  * the 2^32-1 space here, allowing for more multixacts that would fit
- * otherwise.  See also SlruScanDirCbRemoveMembers.
+ * otherwise.
  */
 static bool
 MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
@@ -2769,6 +2740,9 @@ MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
  *
  * Returns false if the file containing the multi does not exist on disk.
  * Otherwise, returns true and sets *result to the starting member offset.
+ *
+ * This function does not prevent concurrent truncation, so if that's
+ * required, the caller has to protect against that.
  */
 static bool
 find_multixact_start(MultiXactId multi, MultiXactOffset *result)
@@ -2779,9 +2753,22 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	int			slotno;
 	MultiXactOffset *offptr;
 
+	/* XXX: Remove || AmStartupProcess() after WAL page magic bump */
+	Assert(MultiXactState->finishedStartup || AmStartupProcess());
+
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/*
+	 * We need to flush out dirty data, so PhysicalPageExists can work
+	 * correctly, but SimpleLruFlush() is a pretty big hammer for that.
+	 * Alternatively we could add a in-memory version of page exists, but
+	 * find_multixact_start is called infrequently, and it doesn't seem bad to
+	 * flush buffers to disk before truncation.
+	 */
+	SimpleLruFlush(MultiXactOffsetCtl, true);
+	SimpleLruFlush(MultiXactMemberCtl, true);
+
 	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
 		return false;
 
@@ -2887,65 +2874,6 @@ MultiXactMemberFreezeThreshold(void)
 	return multixacts - victim_multixacts;
 }
 
-/*
- * SlruScanDirectory callback.
- *		This callback deletes segments that are outside the range determined by
- *		the given page numbers.
- *
- * Both range endpoints are exclusive (that is, segments containing any of
- * those pages are kept.)
- */
-typedef struct MembersLiveRange
-{
-	int			rangeStart;
-	int			rangeEnd;
-} MembersLiveRange;
-
-static bool
-SlruScanDirCbRemoveMembers(SlruCtl ctl, char *filename, int segpage,
-						   void *data)
-{
-	MembersLiveRange *range = (MembersLiveRange *) data;
-	MultiXactOffset nextOffset;
-
-	if ((segpage == range->rangeStart) ||
-		(segpage == range->rangeEnd))
-		return false;			/* easy case out */
-
-	/*
-	 * To ensure that no segment is spuriously removed, we must keep track of
-	 * new segments added since the start of the directory scan; to do this,
-	 * we update our end-of-range point as we run.
-	 *
-	 * As an optimization, we can skip looking at shared memory if we know for
-	 * certain that the current segment must be kept.  This is so because
-	 * nextOffset never decreases, and we never increase rangeStart during any
-	 * one run.
-	 */
-	if (!((range->rangeStart > range->rangeEnd &&
-		   segpage > range->rangeEnd && segpage < range->rangeStart) ||
-		  (range->rangeStart < range->rangeEnd &&
-		   (segpage < range->rangeStart || segpage > range->rangeEnd))))
-		return false;
-
-	/*
-	 * Update our idea of the end of the live range.
-	 */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	LWLockRelease(MultiXactGenLock);
-	range->rangeEnd = MXOffsetToMemberPage(nextOffset);
-
-	/* Recheck the deletion condition.  If it still holds, perform deletion */
-	if ((range->rangeStart > range->rangeEnd &&
-		 segpage > range->rangeEnd && segpage < range->rangeStart) ||
-		(range->rangeStart < range->rangeEnd &&
-		 (segpage < range->rangeStart || segpage > range->rangeEnd)))
-		SlruDeleteSegment(ctl, filename);
-
-	return false;				/* keep going */
-}
-
 typedef struct mxtruncinfo
 {
 	int			earliestExistingPage;
@@ -2969,6 +2897,52 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
 	return false;				/* keep going */
 }
 
+
+/*
+ * Delete member segments [oldest, newOldest)
+ */
+static void
+PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
+{
+	const int	maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
+	int			startsegment = MXOffsetToMemberSegment(oldestOffset);
+	int			endsegment = MXOffsetToMemberSegment(newOldestOffset);
+	int			segment = startsegment;
+
+	/*
+	 * Delete all the segments but the last one. The last segment can still
+	 * contain, possibly partially, valid data.
+	 */
+	while (segment != endsegment)
+	{
+		elog(DEBUG2, "truncating multixact members segment %x", segment);
+		SlruDeleteSegment(MultiXactMemberCtl, segment);
+
+		/* move to next segment, handling wraparound correctly */
+		if (segment == maxsegment)
+			segment = 0;
+		else
+			segment += 1;
+	}
+}
+
+/*
+ * Delete offsets segments [oldest, newOldest)
+ */
+static void
+PerformOffsetsTruncation(MultiXactId oldestMulti, MultiXactId newOldestMulti)
+{
+	/*
+	 * We step back one multixact to avoid passing a cutoff page that hasn't
+	 * been created yet in the rare case that oldestMulti would be the first
+	 * item on a page and oldestMulti == nextMulti.  In that case, if we
+	 * didn't subtract one, we'd trigger SimpleLruTruncate's wraparound
+	 * detection.
+	 */
+	SimpleLruTruncate(MultiXactOffsetCtl,
+				  MultiXactIdToOffsetPage(PreviousMultiXactId(newOldestMulti)));
+}
+
 /*
  * Remove all MultiXactOffset and MultiXactMember segments before the oldest
  * ones still of interest.
@@ -2979,27 +2953,54 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
  * xlog_redo() will already have called MultiXactAdvanceOldest().  Our
  * latest_page_number will already have been initialized by StartupMultiXact()
  * and kept up to date as new pages are zeroed.
+ *
+ * newOldestMulti is the oldest currently required multixact, newOldestMultiDB
+ * is one of the databases preventing newOldestMulti from increasing.
  */
 void
-TruncateMultiXact(void)
+TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB, bool in_recovery)
 {
-	MultiXactId oldestMXact;
+	MultiXactId oldestMulti;
+	MultiXactId nextMulti;
+	MultiXactOffset newOldestOffset;
 	MultiXactOffset oldestOffset;
-	MultiXactId		nextMXact;
-	MultiXactOffset	nextOffset;
+	MultiXactOffset nextOffset;
 	mxtruncinfo trunc;
 	MultiXactId earliest;
-	MembersLiveRange range;
 
-	Assert(AmCheckpointerProcess() || AmStartupProcess() ||
-		   !IsPostmasterEnvironment);
+	/*
+	 * Need to allow being called in recovery for backwards compatibility,
+	 * when an updated standby replays WAL generated by a non-updated primary.
+	 */
+	Assert(in_recovery || !RecoveryInProgress());
+	Assert(!in_recovery || AmStartupProcess());
+	Assert(in_recovery || MultiXactState->finishedStartup);
+
+	/*
+	 * We can only allow one truncation to happen at once. Otherwise parts of
+	 * members might vanish while we're doing lookups or similar. There's no
+	 * need to have an interlock with creating new multis or such, since those
+	 * are constrained by the limits (which only grow, never shrink).
+	 */
+	LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
 
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	oldestMXact = MultiXactState->lastCheckpointedOldest;
-	nextMXact = MultiXactState->nextMXact;
+	nextMulti = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
+	oldestMulti = MultiXactState->oldestMultiXactId;
 	LWLockRelease(MultiXactGenLock);
-	Assert(MultiXactIdIsValid(oldestMXact));
+	Assert(MultiXactIdIsValid(oldestMulti));
+
+	/*
+	 * Make sure to only attempt truncation if there's values to truncate
+	 * away. In normal processing values shouldn't go backwards, but there's
+	 * some corner cases (due to bugs) where that's possible.
+	 */
+	if (MultiXactIdPrecedesOrEquals(newOldestMulti, oldestMulti))
+	{
+		LWLockRelease(MultiXactTruncationLock);
+		return;
+	}
 
 	/*
 	 * Note we can't just plow ahead with the truncation; it's possible that
@@ -3007,6 +3008,9 @@ TruncateMultiXact(void)
 	 * going to attempt to read the offsets page to determine where to
 	 * truncate the members SLRU.  So we first scan the directory to determine
 	 * the earliest offsets page number that we can read without error.
+	 *
+	 * NB: It's also possible that the page that oldestMulti is on has already
+	 * been truncated away, and we crashed before updating oldestMulti.
 	 */
 	trunc.earliestExistingPage = -1;
 	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
@@ -3014,19 +3018,10 @@ TruncateMultiXact(void)
 	if (earliest < FirstMultiXactId)
 		earliest = FirstMultiXactId;
 
-	/*
-	 * If there's nothing to remove, we can bail out early.
-	 *
-	 * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X,
-	 * oldestMXact might point to a multixact that does not exist.
-	 * Autovacuum will eventually advance it to a value that does exist,
-	 * and we want to set a proper offsetStopLimit when that happens,
-	 * so call DetermineSafeOldestOffset here even if we're not actually
-	 * truncating.
-	 */
-	if (MultiXactIdPrecedes(oldestMXact, earliest))
+	/* If there's nothing to remove, we can bail out early. */
+	if (MultiXactIdPrecedes(oldestMulti, earliest))
 	{
-		DetermineSafeOldestOffset(oldestMXact);
+		LWLockRelease(MultiXactTruncationLock);
 		return;
 	}
 
@@ -3038,46 +3033,95 @@ TruncateMultiXact(void)
 	 * already checked that it doesn't precede the earliest MultiXact on
 	 * disk.  But if it fails, don't truncate anything, and log a message.
 	 */
-	if (oldestMXact == nextMXact)
+	if (oldestMulti == nextMulti)
 		oldestOffset = nextOffset;		/* there are NO MultiXacts */
-	else if (!find_multixact_start(oldestMXact, &oldestOffset))
+	else if (!find_multixact_start(oldestMulti, &oldestOffset))
 	{
 		ereport(LOG,
 				(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
-					oldestMXact, earliest)));
+						oldestMulti, earliest)));
+		LWLockRelease(MultiXactTruncationLock);
 		return;
 	}
 
 	/*
-	 * To truncate MultiXactMembers, we need to figure out the active page
-	 * range and delete all files outside that range.  The start point is the
-	 * start of the segment containing the oldest offset; an end point of the
-	 * segment containing the next offset to use is enough.  The end point is
-	 * updated as MultiXactMember gets extended concurrently, elsewhere.
+	 * Secondly compute up to where to truncate. Lookup the corresponding
+	 * member offset for newOldestMulti for that.
 	 */
-	range.rangeStart = MXOffsetToMemberPage(oldestOffset);
-	range.rangeStart -= range.rangeStart % SLRU_PAGES_PER_SEGMENT;
-
-	range.rangeEnd = MXOffsetToMemberPage(nextOffset);
+	if (newOldestMulti == nextMulti)
+		newOldestOffset = nextOffset; /* there are NO MultiXacts */
+	else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
+	{
+		ereport(LOG,
+				(errmsg("cannot truncate up toMultiXact %u because it does not exist on disk, skipping truncation",
+						newOldestMulti)));
+		LWLockRelease(MultiXactTruncationLock);
+		return;
+	}
 
-	SlruScanDirectory(MultiXactMemberCtl, SlruScanDirCbRemoveMembers, &range);
+	elog(DEBUG1, "performing multixact truncation: "
+		 "offsets [%u, %u), offset segments [%x, %x), "
+		 "members [%u, %u), member segments [%x, %x)",
+		 oldestMulti, newOldestMulti,
+		 MultiXactIdToOffsetSegment(oldestMulti),
+		 MultiXactIdToOffsetSegment(newOldestMulti),
+		 oldestOffset, newOldestOffset,
+		 MXOffsetToMemberSegment(oldestOffset),
+		 MXOffsetToMemberSegment(newOldestOffset));
 
 	/*
-	 * Now we can truncate MultiXactOffset.  We step back one multixact to
-	 * avoid passing a cutoff page that hasn't been created yet in the rare
-	 * case that oldestMXact would be the first item on a page and oldestMXact
-	 * == nextMXact.  In that case, if we didn't subtract one, we'd trigger
-	 * SimpleLruTruncate's wraparound detection.
+	 * Do truncation, and the WAL logging of the truncation, in a critical
+	 * section. That way offsets/members cannot get out of sync anymore, i.e.
+	 * once consistent the newOldestMulti will always exist in members, even if
+	 * we crashed in the wrong moment.
 	 */
-	SimpleLruTruncate(MultiXactOffsetCtl,
-				  MultiXactIdToOffsetPage(PreviousMultiXactId(oldestMXact)));
+	START_CRIT_SECTION();
 
 	/*
-	 * Now, and only now, we can advance the stop point for multixact members.
-	 * If we did it any sooner, the segments we deleted above might already
-	 * have been overwritten with new members.  That would be bad.
+	 * Prevent checkpoints from being scheduled concurrently. This is critical
+	 * because otherwise a truncation record might not be replayed after a
+	 * crash/basebackup, even though the state of the data directory would
+	 * require it.  It's not possible (startup process doesn't have a PGXACT
+	 * entry), and not needed, to do this during recovery, when performing an
+	 * old-style truncation, though. There the entire scheduling depends on
+	 * the replayed WAL records which be the same after a possible crash.
+	 */
+	if (!in_recovery)
+	{
+		Assert(!MyPgXact->delayChkpt);
+		MyPgXact->delayChkpt = true;
+	}
+
+	/* WAL log truncation */
+	if (!in_recovery)
+		WriteMTruncateXlogRec(newOldestMultiDB,
+							  oldestMulti, newOldestMulti,
+							  oldestOffset, newOldestOffset);
+
+	/*
+	 * Update in-memory limits before performing the truncation, while inside
+	 * the critical section: Have to do it before truncation, to prevent
+	 * concurrent lookups of those values. Has to be inside the critical
+	 * section as otherwise a future call to this function would error out,
+	 * while looking up the oldest member in offsets, if our caller crashes
+	 * before updating the limits.
 	 */
-	DetermineSafeOldestOffset(oldestMXact);
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestMultiXactId = newOldestMulti;
+	MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+	LWLockRelease(MultiXactGenLock);
+
+	/* First truncate members */
+	PerformMembersTruncation(oldestOffset, newOldestOffset);
+
+	/* Then offsets */
+	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
+
+	if (!in_recovery)
+		MyPgXact->delayChkpt = false;
+
+	END_CRIT_SECTION();
+	LWLockRelease(MultiXactTruncationLock);
 }
 
 /*
@@ -3174,6 +3218,34 @@ WriteMZeroPageXlogRec(int pageno, uint8 info)
 }
 
 /*
+ * Write a TRUNCATE xlog record
+ *
+ * We must flush the xlog record to disk before returning --- see notes in
+ * TruncateCLOG().
+ */
+static void
+WriteMTruncateXlogRec(Oid oldestMultiDB,
+					  MultiXactId startTruncOff, MultiXactId endTruncOff,
+				MultiXactOffset startTruncMemb, MultiXactOffset endTruncMemb)
+{
+	XLogRecPtr	recptr;
+	xl_multixact_truncate xlrec;
+
+	xlrec.oldestMultiDB = oldestMultiDB;
+
+	xlrec.startTruncOff = startTruncOff;
+	xlrec.endTruncOff = endTruncOff;
+
+	xlrec.startTruncMemb = startTruncMemb;
+	xlrec.endTruncMemb = endTruncMemb;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), SizeOfMultiXactTruncate);
+	recptr = XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_TRUNCATE_ID);
+	XLogFlush(recptr);
+}
+
+/*
  * MULTIXACT resource manager's routines
  */
 void
@@ -3255,6 +3327,49 @@ multixact_redo(XLogReaderState *record)
 			LWLockRelease(XidGenLock);
 		}
 	}
+	else if (info == XLOG_MULTIXACT_TRUNCATE_ID)
+	{
+		xl_multixact_truncate xlrec;
+		int			pageno;
+
+		memcpy(&xlrec, XLogRecGetData(record),
+			   SizeOfMultiXactTruncate);
+
+		elog(DEBUG1, "replaying multixact truncation: "
+			 "offsets [%u, %u), offset segments [%x, %x), "
+			 "members [%u, %u), member segments [%x, %x)",
+			 xlrec.startTruncOff, xlrec.endTruncOff,
+			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
+			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
+			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 MXOffsetToMemberSegment(xlrec.startTruncMemb),
+			 MXOffsetToMemberSegment(xlrec.endTruncMemb));
+
+		/* should not be required, but more than cheap enough */
+		LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
+
+		/*
+		 * Advance the horizon values, so they're current at the end of
+		 * recovery.
+		 */
+		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB);
+
+		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
+
+		/*
+		 * During XLOG replay, latest_page_number isn't necessarily set up
+		 * yet; insert a suitable value to bypass the sanity test in
+		 * SimpleLruTruncate.
+		 */
+		pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
+		MultiXactOffsetCtl->shared->latest_page_number = pageno;
+		PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
+
+		LWLockRelease(MultiXactTruncationLock);
+
+		/* only looked at in the startup process, no lock necessary */
+		MultiXactState->sawTruncationInCkptCycle = true;
+	}
 	else
 		elog(PANIC, "multixact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5fcea11..90c7cf5 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -134,6 +134,7 @@ static int	SlruSelectLRUPage(SlruCtl ctl, int pageno);
 
 static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
 						  int segpage, void *data);
+static void SlruInternalDeleteSegment(SlruCtl ctl, char *filename);
 
 /*
  * Initialization of shared memory
@@ -1075,7 +1076,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
  * Flush dirty pages to disk during checkpoint or database shutdown
  */
 void
-SimpleLruFlush(SlruCtl ctl, bool checkpoint)
+SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 {
 	SlruShared	shared = ctl->shared;
 	SlruFlushData fdata;
@@ -1096,11 +1097,11 @@ SimpleLruFlush(SlruCtl ctl, bool checkpoint)
 		SlruInternalWritePage(ctl, slotno, &fdata);
 
 		/*
-		 * When called during a checkpoint, we cannot assert that the slot is
-		 * clean now, since another process might have re-dirtied it already.
-		 * That's okay.
+		 * In some places (e.g. checkpoints), we cannot assert that the slot
+		 * is clean now, since another process might have re-dirtied it
+		 * already.  That's okay.
 		 */
-		Assert(checkpoint ||
+		Assert(allow_redirtied ||
 			   shared->page_status[slotno] == SLRU_PAGE_EMPTY ||
 			   (shared->page_status[slotno] == SLRU_PAGE_VALID &&
 				!shared->page_dirty[slotno]));
@@ -1210,8 +1211,14 @@ restart:;
 	(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
 }
 
-void
-SlruDeleteSegment(SlruCtl ctl, char *filename)
+/*
+ * Delete an individual SLRU segment, identified by the filename.
+ *
+ * NB: This does not touch the SLRU buffers themselves, callers have to ensure
+ * they either can't yet contain anything, or have already been cleaned out.
+ */
+static void
+SlruInternalDeleteSegment(SlruCtl ctl, char *filename)
 {
 	char		path[MAXPGPATH];
 
@@ -1222,6 +1229,64 @@ SlruDeleteSegment(SlruCtl ctl, char *filename)
 }
 
 /*
+ * Delete an individual SLRU segment, identified by the segment number.
+ */
+void
+SlruDeleteSegment(SlruCtl ctl, int segno)
+{
+	SlruShared	shared = ctl->shared;
+	int			slotno;
+	char		path[MAXPGPATH];
+	bool		did_write;
+
+	/* Clean out any possibly existing references to the segment. */
+	LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+restart:
+	did_write = false;
+	for (slotno = 0; slotno < shared->num_slots; slotno++)
+	{
+		int			pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+
+		if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
+			continue;
+
+		/* not the segment we're looking for */
+		if (pagesegno != segno)
+			continue;
+
+		/* If page is clean, just change state to EMPTY (expected case). */
+		if (shared->page_status[slotno] == SLRU_PAGE_VALID &&
+			!shared->page_dirty[slotno])
+		{
+			shared->page_status[slotno] = SLRU_PAGE_EMPTY;
+			continue;
+		}
+
+		/* Same logic as SimpleLruTruncate() */
+		if (shared->page_status[slotno] == SLRU_PAGE_VALID)
+			SlruInternalWritePage(ctl, slotno, NULL);
+		else
+			SimpleLruWaitIO(ctl, slotno);
+
+		did_write = true;
+	}
+
+	/*
+	 * Be extra careful and re-check. The IO functions release the control
+	 * lock, so new pages could have been read in.
+	 */
+	if (did_write)
+		goto restart;
+
+	snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
+	ereport(DEBUG2,
+			(errmsg("removing file \"%s\"", path)));
+	unlink(path);
+
+	LWLockRelease(shared->ControlLock);
+}
+
+/*
  * SlruScanDirectory callback
  *		This callback reports true if there's any segment prior to the one
  *		containing the page passed as "data".
@@ -1249,7 +1314,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 	int			cutoffPage = *(int *) data;
 
 	if (ctl->PagePrecedes(segpage, cutoffPage))
-		SlruDeleteSegment(ctl, filename);
+		SlruInternalDeleteSegment(ctl, filename);
 
 	return false;				/* keep going */
 }
@@ -1261,7 +1326,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 bool
 SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage, void *data)
 {
-	SlruDeleteSegment(ctl, filename);
+	SlruInternalDeleteSegment(ctl, filename);
 
 	return false;				/* keep going */
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a87f09e..1ac1c05 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6330,7 +6330,6 @@ StartupXLOG(void)
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTs,
 					 checkPoint.newestCommitTs);
-	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
 
@@ -6347,10 +6346,8 @@ StartupXLOG(void)
 	StartupReorderBuffer();
 
 	/*
-	 * Startup MultiXact.  We need to do this early for two reasons: one is
-	 * that we might try to access multixacts when we do tuple freezing, and
-	 * the other is we need its state initialized because we attempt
-	 * truncation during restartpoints.
+	 * Startup MultiXact. We need to do this early to be able to replay
+	 * truncations.
 	 */
 	StartupMultiXact();
 
@@ -8508,12 +8505,6 @@ CreateCheckPoint(int flags)
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that the checkpoint is safely on disk, we can update the point to
-	 * which multixact can be truncated.
-	 */
-	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
-
-	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
 	smgrpostckpt();
@@ -8552,11 +8543,6 @@ CreateCheckPoint(int flags)
 	if (!RecoveryInProgress())
 		TruncateSUBTRANS(GetOldestXmin(NULL, false));
 
-	/*
-	 * Truncate pg_multixact too.
-	 */
-	TruncateMultiXact();
-
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
 
@@ -8887,21 +8873,6 @@ CreateRestartPoint(int flags)
 	}
 
 	/*
-	 * Due to a historical accident multixact truncations are not WAL-logged,
-	 * but just performed everytime the mxact horizon is increased. So, unless
-	 * we explicitly execute truncations on a standby it will never clean out
-	 * /pg_multixact which obviously is bad, both because it uses space and
-	 * because we can wrap around into pre-existing data...
-	 *
-	 * We can only do the truncation here, after the UpdateControlFile()
-	 * above, because we've now safely established a restart point.  That
-	 * guarantees we will not need to access those multis.
-	 *
-	 * It's probably worth improving this.
-	 */
-	TruncateMultiXact();
-
-	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
 	 * the oldest XMIN of any running transaction.  No future transaction will
 	 * attempt to reference any pg_subtrans entry older than that (see Asserts
@@ -9261,9 +9232,14 @@ xlog_redo(XLogReaderState *record)
 		LWLockRelease(OidGenLock);
 		MultiXactSetNextMXact(checkPoint.nextMulti,
 							  checkPoint.nextMultiOffset);
+
+		/*
+		 * NB: This may perform multixact truncation when replaying WAL
+		 * generated by an older primary.
+		 */
+		MultiXactAdvanceOldest(checkPoint.oldestMulti,
+							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-		SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
-		MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9353,14 +9329,17 @@ xlog_redo(XLogReaderState *record)
 		LWLockRelease(OidGenLock);
 		MultiXactAdvanceNextMXact(checkPoint.nextMulti,
 								  checkPoint.nextMultiOffset);
+
+		/*
+		 * NB: This may perform multixact truncation when replaying WAL
+		 * generated by an older primary.
+		 */
+		MultiXactAdvanceOldest(checkPoint.oldestMulti,
+							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
-		MultiXactAdvanceOldest(checkPoint.oldestMulti,
-							   checkPoint.oldestMultiDB);
-		MultiXactSetSafeTruncate(checkPoint.oldestMulti);
-
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 85b0483..698bb35 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1137,11 +1137,11 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 
 	/*
-	 * Truncate CLOG and CommitTs to the oldest computed value. Note we don't
-	 * truncate multixacts; that will be done by the next checkpoint.
+	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
 	 */
 	TruncateCLOG(frozenXID);
 	TruncateCommitTs(frozenXID, true);
+	TruncateMultiXact(minMulti, minmulti_datoid, false);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 96bbfe8..c557cb6 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -45,3 +45,4 @@ ReplicationSlotControlLock			37
 CommitTsControlLock					38
 CommitTsLock						39
 ReplicationOriginLock				40
+MultiXactTruncationLock				41
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6213f8a..47ef38d 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -71,6 +71,7 @@ typedef struct MultiXactMember
 #define XLOG_MULTIXACT_ZERO_OFF_PAGE	0x00
 #define XLOG_MULTIXACT_ZERO_MEM_PAGE	0x10
 #define XLOG_MULTIXACT_CREATE_ID		0x20
+#define XLOG_MULTIXACT_TRUNCATE_ID		0x30
 
 typedef struct xl_multixact_create
 {
@@ -82,6 +83,21 @@ typedef struct xl_multixact_create
 
 #define SizeOfMultiXactCreate (offsetof(xl_multixact_create, members))
 
+typedef struct xl_multixact_truncate
+{
+	Oid			oldestMultiDB;
+
+	/* to-be-truncated range of multixact offsets */
+	MultiXactId startTruncOff;	/* just for completeness' sake */
+	MultiXactId endTruncOff;
+
+	/* to-be-truncated range of multixact members */
+	MultiXactOffset startTruncMemb;
+	MultiXactOffset endTruncMemb;
+} xl_multixact_truncate;
+
+#define SizeOfMultiXactTruncate (sizeof(xl_multixact_truncate))
+
 
 extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status1, TransactionId xid2,
@@ -120,13 +136,12 @@ extern void MultiXactGetCheckptMulti(bool is_shutdown,
 						 Oid *oldestMultiDB);
 extern void CheckPointMultiXact(void);
 extern MultiXactId GetOldestMultiXactId(void);
-extern void TruncateMultiXact(void);
+extern void TruncateMultiXact(MultiXactId oldestMulti, Oid oldestMultiDB, bool in_recovery);
 extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset);
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 						  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern void MultiXactSetSafeTruncate(MultiXactId safeTruncateMulti);
 extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 9c7f019..f60e75b 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -143,14 +143,14 @@ extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
 extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
 						   TransactionId xid);
 extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
-extern void SimpleLruFlush(SlruCtl ctl, bool checkpoint);
+extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
 extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
 extern bool SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno);
 
 typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
 											  void *data);
 extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
-extern void SlruDeleteSegment(SlruCtl ctl, char *filename);
+extern void SlruDeleteSegment(SlruCtl ctl, int segno);
 
 /* SlruScanDirectory public callbacks */
 extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a037f81..0e149ea 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2750,6 +2750,7 @@ xl_invalid_page
 xl_invalid_page_key
 xl_multi_insert_tuple
 xl_multixact_create
+xl_multixact_truncate
 xl_parameter_change
 xl_relmap_update
 xl_replorigin_drop
-- 
2.5.0.400.gff86faf

#20Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#19)
Re: Rework the way multixact truncations work

On Tue, Sep 22, 2015 at 9:20 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-09-21 16:36:03 +0200, Andres Freund wrote:

Agreed. I'll update the patch.

Here's updated patches against master. These include the "legacy"
truncation support. There's no meaningful functional differences in this
version except addressing the review comments that I agreed with, and a
fair amount of additional polishing.

0002 looks fine.

Regarding 0003, I'm still very much not convinced that it's a good
idea to apply this to 9.3 and 9.4. This patch changes the way we do
truncation in those older releases; instead of happening at a
restartpoint, it happens when oldestMultiXid advances. I admit that I
don't see a specific way that that can go wrong, but there are so many
different old versions with slightly different multixact truncation
behaviors that it seems very hard to be sure that we're not going to
make things worse rather than better by introducing yet another
approach to the problem. I realize that you disagree and will
probably commit this to those branches anyway. But I want it to be
clear that I don't endorse that.

I wish more people were paying attention to these patches. These are
critical data-corrupting bugs, the code in question is very tricky,
it's been majorly revised multiple times, and we're revising it again.
And nobody except me and Andres is looking at this, and I'm definitely
not smart enough to get this all right.

Other issues:
- If SlruDeleteSegment fails in unlink(), shouldn't we at the very
least log a message? If that file is still there when we loop back
around, it's going to cause a failure, I think.

Assorted minor nitpicking:
- "happend" is misspelled in the commit message for 0003
- "in contrast to before" should have a comma after it, also in that
commit message
- "how far the next members wraparound is away" -> "how far away the
next members wraparound is"
- "seing" -> "seeing"
- "Upgrade the primary," -> "Upgrade the primary;"
- "toMultiXact" -> "to MultiXact"

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#20)
Re: Rework the way multixact truncations work

Robert Haas wrote:

Regarding 0003, I'm still very much not convinced that it's a good
idea to apply this to 9.3 and 9.4. This patch changes the way we do
truncation in those older releases; instead of happening at a
restartpoint, it happens when oldestMultiXid advances. I admit that I
don't see a specific way that that can go wrong, but there are so many
different old versions with slightly different multixact truncation
behaviors that it seems very hard to be sure that we're not going to
make things worse rather than better by introducing yet another
approach to the problem. I realize that you disagree and will
probably commit this to those branches anyway. But I want it to be
clear that I don't endorse that.

Noted. I am not sure about changing things so invasively either TBH.
The interactions of this stuff with other parts of the system are very
complicated and it's easy to make a mistake that goes unnoticed until
some weird scenario is run elsewhere. (Who would have thought that
things would fail when a basebackup takes 12 hours to take and you have
a custom preemptive tuple freeze script in crontab).

I wish more people were paying attention to these patches. These are
critical data-corrupting bugs, the code in question is very tricky,
it's been majorly revised multiple times, and we're revising it again.
And nobody except me and Andres is looking at this, and I'm definitely
not smart enough to get this all right.

I'm also looking, and yes it's tricky.

Other issues:

It would be good to pgindent the code before producing back-branch
patches. I think some comments will get changed.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#20)
Re: Rework the way multixact truncations work

On 2015-09-22 13:38:58 -0400, Robert Haas wrote:

Regarding 0003, I'm still very much not convinced that it's a good
idea to apply this to 9.3 and 9.4. This patch changes the way we do
truncation in those older releases; instead of happening at a
restartpoint, it happens when oldestMultiXid advances.

The primary reason for doing that is that doing it at restartpoints is
simply *wrong*. Restartpoints aren't scheduled in sync with replay -
which means that a restartpoint can (will actually) happen long long
after the checkpoint from the primary has replayed. Which means that by
the time the restartpoint is performed it's actually not unlikely that
we've already filled all slru segments. Which is bad if we then fail
over/start up.

Aside from the more fundamental issue that restartpoints have to be
"asynchronous" with respect to the checkpoint record for performance
reasons, there's a bunch of additional reasons making this even more
likely to occur: Differing checkpoint segments on the standby and
pending actions (which we got rid off in 9.5+, but ...)

I realize that you disagree and will probably commit this to those
branches anyway. But I want it to be clear that I don't endorse that.

I don't plan to commit/backpatch this over your objection.

I do think it'd be the better approach, and I personally think that
we're much more likely to introduce bugs if we backpatch this in a
year. Which I think we'll end up having to. The longer people run on
these branches, the more issues we'll see.

I wish more people were paying attention to these patches.

+many

Other issues:
- If SlruDeleteSegment fails in unlink(), shouldn't we at the very
least log a message? If that file is still there when we loop back
around, it's going to cause a failure, I think.

The existing unlink() call doesn't, that's the only reason I didn't add
a message there. I'm fine with adding a (LOG or WARNING?) message.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#22)
Re: Rework the way multixact truncations work

On Tue, Sep 22, 2015 at 1:57 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-09-22 13:38:58 -0400, Robert Haas wrote:

Regarding 0003, I'm still very much not convinced that it's a good
idea to apply this to 9.3 and 9.4. This patch changes the way we do
truncation in those older releases; instead of happening at a
restartpoint, it happens when oldestMultiXid advances.

The primary reason for doing that is that doing it at restartpoints is
simply *wrong*. Restartpoints aren't scheduled in sync with replay -
which means that a restartpoint can (will actually) happen long long
after the checkpoint from the primary has replayed. Which means that by
the time the restartpoint is performed it's actually not unlikely that
we've already filled all slru segments. Which is bad if we then fail
over/start up.

1. It would be possible to write a patch that included ONLY the
changes needed to make that happen, and did nothing else. It would be
largely a subset of this. If we want to change 9.3 and 9.4, I
recommend we do that first, and then come back to the rest of this.

2. I agree that what we're doing right now is wrong. And I agree that
this fixes a real problem. But it seems to me to be quite possible,
even likely, that it will create other problems.

For example, suppose that there are files in the data directory that
precede oldestMultiXact. In the current approach, we'll remove those
because they're not in the range we expect to be used. But in this
approach we no longer remove everything we think shouldn't be there.
We remove exactly the stuff we think should go away. As a general
principle, that's clearly superior. But in the back-branches, it
creates a risk: a leftover old file that doesn't get removed the first
time through - for whatever reason - becomes a time bomb that will
explode on the next wraparound. I don't know that that will happen.
But I sure as heck don't know that won't happen with any combination
of the variously broken 9.3.X releases we've put out there. Even if
you can prove that particular risk never materializes to your
satisfaction and mine, I will bet you a beer that there are other
possible hazards neither of us is foreseeing right now.

I realize that you disagree and will probably commit this to those
branches anyway. But I want it to be clear that I don't endorse that.

I don't plan to commit/backpatch this over your objection.

I'm not in a position to demand that you take my advice, but I'm
telling you what I think as honestly as I know how.

To be clear, I am fully in favor of making these changes (without the
legacy truncation stuff) in 9.5 and master, bumping WAL page magic so
that we invalidate any 9.5 alpha standys. I think it's a far more
solid approach than what we've got right now, and it clearly
eliminates a host of dangers. In fact, I think it would be a pretty
stupid idea not to make these changes in those branches. It would be
doubling down on a design we know can never be made robust.

But I do not have confidence that we can change 9.4 and especially 9.3
without knock-on consequences. You may have that confidence. I most
definitely do not. My previous two rounds in the boxing ring with
this problem convinced me that (1) it's incredibly easy to break
things with well-intentioned changes in this area, (2) it's
practically impossible to foresee everything that might go wrong with
some screwy combination of versions, and (3) early 9.3.X releases are
in much worse shape than early 9.4.X releases, to the point where
guessing what any given variable is going to contain on 9.3.X is
essentially throwing darts at the wall. That's an awfully challenging
environment in which to write a bullet-proof patch.

- If SlruDeleteSegment fails in unlink(), shouldn't we at the very
least log a message? If that file is still there when we loop back
around, it's going to cause a failure, I think.

The existing unlink() call doesn't, that's the only reason I didn't add
a message there. I'm fine with adding a (LOG or WARNING?) message.

Cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#23)
Re: Rework the way multixact truncations work

On 2015-09-22 14:52:49 -0400, Robert Haas wrote:

1. It would be possible to write a patch that included ONLY the
changes needed to make that happen, and did nothing else. It would be
largely a subset of this. If we want to change 9.3 and 9.4, I
recommend we do that first, and then come back to the rest of this.

I think that patch would be pretty much what I wrote.

To be correct you basically have to:
1) Never skip a truncation on the standby. Otherwise there might have
already have been wraparound and you read the completely wrong
offset.
2) Always perform truncations on the standby exactly the same moment (in
the replay stream) as on the primary. Otherwise there also can be a
wraparound.
3) Never read anything from an SLRU from the data directory while
inconsistent. In an inconsistent state we can read completely wrong
data. A standby can be inconsistent in many situations, including
crashes, restarts and fresh base backups.

To me these three together leave only the option to never read an SLRUs
contents on a standby. That only leaves minor changes in the patch that
could be removed afaics. I mean we could leave in
DetermineSafeOldestOffset() but it'd be doing pretty much the same as
SetOffsetVacuumLimit().

I think we put at least three layers on bandaid on this issue since
9.3.2, and each layer made things more complicated. We primarily did so
because of the compatibility and complexity concerns. I think that was a
bad mistake. We should have done it mostly right back then, and we'd be
better of now. If we continue with bandaids on the back branches while
having a fixed 9.5+ with significantly different behaviour we'll have a
hellish time fixing things in the back branches. And introduce more bugs
than this might introduce.

2. I agree that what we're doing right now is wrong. And I agree that
this fixes a real problem. But it seems to me to be quite possible,
even likely, that it will create other problems.

Possible. But I think those bugs will be just bugs and not more
fundamental architectural problems.

To be very clear. I'm scared of the idea of backpatching this. I'm more
scared of doing that myself. But even more I am scared of the current
state.

For example, suppose that there are files in the data directory that
precede oldestMultiXact. In the current approach, we'll remove those
because they're not in the range we expect to be used.

Hm. For offsets/ we continue to use SimpleLruTruncate() for truncation,
which scans the directory, so I don't see a problem. For members/ we
won't - but neither do we really today, see
SlruScanDirCbRemoveMembers(). So I don't think there'll be a significant
difference?

a leftover old file that doesn't get removed the first time through -
for whatever reason - becomes a time bomb that will explode on the
next wraparound. I don't know that that will happen.

We should be able to deal with that, otherwise recovery is pretty
borked. It can be a problem for the 'recovery from wrong oldest multi'
case, but that's the same today.

I will bet you a beer that there are other possible hazards neither of
us is foreseeing right now.

Right. I'm not dismissing that. I just think it's much more likely to be
handleable problems than the set we have today. It's incredibly hard to
get an accurate mental model of the combined behaviour & state of
primary and standby today. Even if we three have that today, I'm pretty
sure we won't in half a year. And sure as hell nearly nobody else will
have one.

- If SlruDeleteSegment fails in unlink(), shouldn't we at the very
least log a message? If that file is still there when we loop back
around, it's going to cause a failure, I think.

The existing unlink() call doesn't, that's the only reason I didn't add
a message there. I'm fine with adding a (LOG or WARNING?) message.

Cool.

Hm. When redoing a truncation during [crash] recovery that can cause a
host of spurious warnings if already done before. DEBUG1 to avoid
scaring users?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#24)
Re: Rework the way multixact truncations work

On 2015-09-23 01:24:31 +0200, Andres Freund wrote:

I think we put at least three layers on bandaid on this issue since
9.3.2, and each layer made things more complicated.

2a9b01928f193f529b885ac577051c4fd00bd427 - Cope with possible failure of the oldest MultiXact to exist.
5bbac7ec1b5754043e073a45454e4c257512ce30 - Advance the stop point for multixact offset creation only at checkpoint.
9a28c3752c89ec01fb8b28bb5904c6d547507fda - Have multixact be truncated by checkpoint, not vacuum
215ac4ad6589e0f6a31cc4cd867aedba3cd42924 - Truncate pg_multixact/'s contents during crash recovery

At least these are closely related to the fact that truncation isn't WAL
logged. There are more that are tangentially related. We (primarily me,
writing the timewise first one) should have gone for a new WAL record
from the start. We've discussed that in at least three of the threads
around the above commits...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#25)
Re: Rework the way multixact truncations work

On Tue, Sep 22, 2015 at 7:45 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-09-23 01:24:31 +0200, Andres Freund wrote:

I think we put at least three layers on bandaid on this issue since
9.3.2, and each layer made things more complicated.

2a9b01928f193f529b885ac577051c4fd00bd427 - Cope with possible failure of the oldest MultiXact to exist.
5bbac7ec1b5754043e073a45454e4c257512ce30 - Advance the stop point for multixact offset creation only at checkpoint.
9a28c3752c89ec01fb8b28bb5904c6d547507fda - Have multixact be truncated by checkpoint, not vacuum
215ac4ad6589e0f6a31cc4cd867aedba3cd42924 - Truncate pg_multixact/'s contents during crash recovery

At least these are closely related to the fact that truncation isn't WAL
logged. There are more that are tangentially related. We (primarily me,
writing the timewise first one) should have gone for a new WAL record
from the start. We've discussed that in at least three of the threads
around the above commits...

I'm not disagreeing with any of that. I'm just disagreeing with you
on the likelihood that we're going to make things better vs. making
them worse. But, really, I've said everything I have to say about
this. You have a commit bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#26)
Re: Rework the way multixact truncations work

On 2015-09-22 20:14:11 -0400, Robert Haas wrote:

I'm not disagreeing with any of that. I'm just disagreeing with you
on the likelihood that we're going to make things better vs. making
them worse. But, really, I've said everything I have to say about
this. You have a commit bit.

I'm not going to push backpatch this to 9.3/9.4 without you being on
board. For that I think you're unfortunately too often right, and this
is too critical. But I'm also not going to develop an alternative
stopgap for those versions, since I have no clue how that'd end up being
better.

The only alternative proposal I have right now is to push this to
9.5/9.6 (squashed with a followup patch removing legacy truncations) and
then push the patch including legacy stuff to 9.3/4 after the next set
of releases.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#19)
Re: Rework the way multixact truncations work

@@ -1210,8 +1211,14 @@ restart:;
(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
}

-void
-SlruDeleteSegment(SlruCtl ctl, char *filename)
+/*
+ * Delete an individual SLRU segment, identified by the filename.
+ *
+ * NB: This does not touch the SLRU buffers themselves, callers have to ensure
+ * they either can't yet contain anything, or have already been cleaned out.
+ */
+static void
+SlruInternalDeleteSegment(SlruCtl ctl, char *filename)
{
char		path[MAXPGPATH];

@@ -1222,6 +1229,64 @@ SlruDeleteSegment(SlruCtl ctl, char *filename)
}

/*
+ * Delete an individual SLRU segment, identified by the segment number.
+ */
+void
+SlruDeleteSegment(SlruCtl ctl, int segno)

Is it okay to change the ABI of SlruDeleteSegment?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#28)
Re: Rework the way multixact truncations work

On 2015-09-23 10:29:09 -0300, Alvaro Herrera wrote:

/*
+ * Delete an individual SLRU segment, identified by the segment number.
+ */
+void
+SlruDeleteSegment(SlruCtl ctl, int segno)

Is it okay to change the ABI of SlruDeleteSegment?

I think so. Any previous user of the API is going to be currently broken
anyway due to the missing flushing of buffers.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#19)
Re: Rework the way multixact truncations work

The comment on top of TrimMultiXact states that "no locks are needed
here", but then goes on to grab a few locks. I think we should remove
the comment, or rephrase it to state that we still grab them for
consistency or whatever; or perhaps even remove the lock acquisitions.
(I think the comment is still true: by the time TrimMultiXact runs,
we're out of recovery but not yet running, so it's not possible for
anyone to try to do anything multixact-related.)

I wonder if it would be cleaner to move the setting of finishedStartup
down to just before calling SetMultiXactIdLimit, instead of at the top
of the function.

It's a bit odd that SetMultiXactIdLimit has the "finishedStartup" test
so low. Why bother setting all those local variables only to bail out?
I think it would make more sense to just do it at the top. The only
thing you lose AFAICS is that elog(DEBUG1) message -- is that worth it?
Also, the fact that finishedStartup itself is read without a lock at
least merits a comment.

In MultiXactAdvanceOldest, the test for sawTruncationinCkptCycle seems
reversed?
if (!MultiXactState->sawTruncationInCkptCycle)
surely we should be doing truncation if it's set?

Honestly, I wonder whether this message
ereport(LOG,
(errmsg("performing legacy multixact truncation"),
errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."),
errhint("Upgrade the primary, it is susceptible to data corruption.")));
shouldn't rather be a PANIC. (The main reason not to, I think, is that
once you see this, there is no way to put the standby in a working state
without recloning).

I think the prevOldestOffsetKnown test in line 2667 ("if we failed to
get ...") is better expressed as an else-if of the previous "if" block.

I think the two "there are NO MultiXacts" cases in TruncateMultiXact
would benefit in readability from adding braces around the lone
statement (and moving the comment to the line prior).

If the find_multixact_start(oldestMulti) call in TruncateMultiXact
fails, what recourse does the user have? I wonder if the elog() should
be a FATAL instead of just LOG. It's not like it would work on a
subsequent run, is it?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#30)
Re: Rework the way multixact truncations work

On 2015-09-23 15:03:05 -0300, Alvaro Herrera wrote:

The comment on top of TrimMultiXact states that "no locks are needed
here", but then goes on to grab a few locks.

Hm. Yea. Although that was the case before.

It's a bit odd that SetMultiXactIdLimit has the "finishedStartup" test
so low. Why bother setting all those local variables only to bail
out?

Hm. Doesn't seem to matter much to me, but I can change it.

In MultiXactAdvanceOldest, the test for sawTruncationinCkptCycle seems
reversed?
if (!MultiXactState->sawTruncationInCkptCycle)
surely we should be doing truncation if it's set?

No, that's correct. If there was a checkpoint cycle where oldestMulti
advanced without seing a truncation record we need to perform a legacy
truncation.

Honestly, I wonder whether this message
ereport(LOG,
(errmsg("performing legacy multixact truncation"),
errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."),
errhint("Upgrade the primary, it is susceptible to data corruption.")));
shouldn't rather be a PANIC. (The main reason not to, I think, is that
once you see this, there is no way to put the standby in a working state
without recloning).

Huh? The behaviour in that case is still better than what we have in
9.3+ today (not delayed till the restartpoint). Don't see why that
should be a panic. That'd imo make it pretty much impossible to upgrade
a pair of primary/master where you normally upgrade the standby first?

This is all moot given Robert's objection to backpatching this to
9.3/4.

If the find_multixact_start(oldestMulti) call in TruncateMultiXact
fails, what recourse does the user have? I wonder if the elog() should
be a FATAL instead of just LOG. It's not like it would work on a
subsequent run, is it?

It currently only LOGs, I don't want to change that. The cases where we
currently know it's possible to hit this, it should be fixed by the next
set of emergency autovacuums (which we trigger).

Thanks for the look,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#31)
Re: Rework the way multixact truncations work

Andres Freund wrote:

On 2015-09-23 15:03:05 -0300, Alvaro Herrera wrote:

Honestly, I wonder whether this message
ereport(LOG,
(errmsg("performing legacy multixact truncation"),
errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."),
errhint("Upgrade the primary, it is susceptible to data corruption.")));
shouldn't rather be a PANIC. (The main reason not to, I think, is that
once you see this, there is no way to put the standby in a working state
without recloning).

Huh? The behaviour in that case is still better than what we have in
9.3+ today (not delayed till the restartpoint). Don't see why that
should be a panic. That'd imo make it pretty much impossible to upgrade
a pair of primary/master where you normally upgrade the standby first?

This is all moot given Robert's objection to backpatching this to
9.3/4.

I think we need to make a decision here. Is this a terribly serious
bug/misdesign that needs addressing? If so, we need to backpatch. If
not, then by all means lets leave it alone. I don't think it is a good
idea to leave it open if we think it's serious, which is what I think is
happening.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#32)
Re: Rework the way multixact truncations work

On 2015-09-23 15:57:02 -0300, Alvaro Herrera wrote:

I think we need to make a decision here. Is this a terribly serious
bug/misdesign that needs addressing?

Imo yes. Not sure about terribly, but definitely serious. It's several
data loss bugs in one package.

If so, we need to backpatch. If not, then by all means lets leave it
alone. I don't think it is a good idea to leave it open if we think
it's serious, which is what I think is happening.

Right, but I don't want to backpatch this over an objection, and it
doesn't seem like I have a chance to convince Robert that it'd be a good
idea. So it'll be 9.5+master for now.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#30)
3 attachment(s)
Re: Rework the way multixact truncations work

Hi,

On 2015-09-23 15:03:05 -0300, Alvaro Herrera wrote:

I wonder if it would be cleaner to move the setting of finishedStartup
down to just before calling SetMultiXactIdLimit, instead of at the top
of the function.

Done. I don't think it makes much of a difference, but there's really no
reason not to change it.

It's a bit odd that SetMultiXactIdLimit has the "finishedStartup" test
so low. Why bother setting all those local variables only to bail
out?

But we do more than set local variables? We actually set some in-memory
limits (MultiXactState->multiVacLimit et al). What we can't do is to
startup the members wraparound protection because that requires
accessing the SLRUs.

Perhaps we should, independently of this patch really, rename
SetOffsetVacuumLimit() - it may be rather confusing that it actually is
about members/? The current name is correct, but also a bit
confusing. ComputeMembersVacuumLimits()?

In MultiXactAdvanceOldest, the test for sawTruncationinCkptCycle seems
reversed?
if (!MultiXactState->sawTruncationInCkptCycle)
surely we should be doing truncation if it's set?

I wanted to add a comment explaining this, but the existing job seems to
do a fair job at that:
/*
* If there has been a truncation on the master, detected by seeing a
* moving oldestMulti, without a corresponding truncation record, we
* know that the primary is still running an older version of postgres
* that doesn't yet log multixact truncations. So perform the
* truncation ourselves.
*/

I've done some additional comment smithing.

Attached is 0002 (prev 0003) including the legacy truncation support,
and 0003 removing that and bumping page magic. I'm slightly inclined to
commit them separately (to 9.5 & master) so that we have something to
backpatch from.

Greetings,

Andres Freund

Attachments:

0003-Remove-legacy-multixact-truncation-support.patchtext/x-patch; charset=us-asciiDownload
>From b407a8ca516f7693d6a5166b06e7d9137c7c5d50 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 24 Sep 2015 15:50:01 +0200
Subject: [PATCH 3/3] Remove legacy multixact truncation support.

---
 src/backend/access/transam/multixact.c | 77 +++++++---------------------------
 src/backend/commands/vacuum.c          |  2 +-
 src/include/access/multixact.h         |  2 +-
 src/include/access/xlog_internal.h     |  2 +-
 4 files changed, 17 insertions(+), 66 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d19b4c2..43f6e99 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -220,14 +220,6 @@ typedef struct MultiXactStateData
 	MultiXactOffset oldestOffset;
 	bool		oldestOffsetKnown;
 
-	/*
-	 * True if a multixact truncation WAL record was replayed since the last
-	 * checkpoint. This is used to trigger 'legacy truncations', i.e. truncate
-	 * by looking at the data directory during WAL replay, when the primary is
-	 * too old to generate truncation records.
-	 */
-	bool		sawTruncationInCkptCycle;
-
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -2384,28 +2376,7 @@ MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 	Assert(InRecovery);
 
 	if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
-	{
-		/*
-		 * If there has been a truncation on the master, detected by seeing a
-		 * moving oldestMulti, without a corresponding truncation record, we
-		 * know that the primary is still running an older version of postgres
-		 * that doesn't yet log multixact truncations. So perform the
-		 * truncation ourselves.
-		 */
-		if (!MultiXactState->sawTruncationInCkptCycle)
-		{
-			ereport(LOG,
-					(errmsg("performing legacy multixact truncation"),
-					 errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."),
-					 errhint("Upgrade the primary, it is susceptible to data corruption.")));
-			TruncateMultiXact(oldestMulti, oldestMultiDB, true);
-		}
-
 		SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
-	}
-
-	/* only looked at in the startup process, no lock necessary */
-	MultiXactState->sawTruncationInCkptCycle = false;
 }
 
 /*
@@ -2750,8 +2721,7 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	int			slotno;
 	MultiXactOffset *offptr;
 
-	/* XXX: Remove || AmStartupProcess() after WAL page magic bump */
-	Assert(MultiXactState->finishedStartup || AmStartupProcess());
+	Assert(MultiXactState->finishedStartup);
 
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
@@ -2944,18 +2914,15 @@ PerformOffsetsTruncation(MultiXactId oldestMulti, MultiXactId newOldestMulti)
  * Remove all MultiXactOffset and MultiXactMember segments before the oldest
  * ones still of interest.
  *
- * On a primary this is called as part of vacuum (via
- * vac_truncate_clog()). During recovery truncation is normally done by
- * replaying truncation WAL records instead of this routine; the exception is
- * when replaying records from an older primary that doesn't yet generate
- * truncation WAL records. In that case truncation is triggered by
- * MultiXactAdvanceOldest().
+ * This is only called on a primary as part of vacuum (via
+ * vac_truncate_clog()). During recovery truncation is done by replaying
+ * truncation WAL records logged here.
  *
  * newOldestMulti is the oldest currently required multixact, newOldestMultiDB
  * is one of the databases preventing newOldestMulti from increasing.
  */
 void
-TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB, bool in_recovery)
+TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 {
 	MultiXactId oldestMulti;
 	MultiXactId nextMulti;
@@ -2965,13 +2932,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB, bool in_reco
 	mxtruncinfo trunc;
 	MultiXactId earliest;
 
-	/*
-	 * Need to allow being called in recovery for backwards compatibility,
-	 * when an updated standby replays WAL generated by a non-updated primary.
-	 */
-	Assert(in_recovery || !RecoveryInProgress());
-	Assert(!in_recovery || AmStartupProcess());
-	Assert(in_recovery || MultiXactState->finishedStartup);
+	Assert(!RecoveryInProgress());
+	Assert(MultiXactState->finishedStartup);
 
 	/*
 	 * We can only allow one truncation to happen at once. Otherwise parts of
@@ -3084,22 +3046,15 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB, bool in_reco
 	 * Prevent checkpoints from being scheduled concurrently. This is critical
 	 * because otherwise a truncation record might not be replayed after a
 	 * crash/basebackup, even though the state of the data directory would
-	 * require it.  It's not possible (startup process doesn't have a PGXACT
-	 * entry), and not needed, to do this during recovery, when performing an
-	 * old-style truncation, though. There the entire scheduling depends on
-	 * the replayed WAL records which be the same after a possible crash.
+	 * require it.
 	 */
-	if (!in_recovery)
-	{
-		Assert(!MyPgXact->delayChkpt);
-		MyPgXact->delayChkpt = true;
-	}
+	Assert(!MyPgXact->delayChkpt);
+	MyPgXact->delayChkpt = true;
 
 	/* WAL log truncation */
-	if (!in_recovery)
-		WriteMTruncateXlogRec(newOldestMultiDB,
-							  oldestMulti, newOldestMulti,
-							  oldestOffset, newOldestOffset);
+	WriteMTruncateXlogRec(newOldestMultiDB,
+						  oldestMulti, newOldestMulti,
+						  oldestOffset, newOldestOffset);
 
 	/*
 	 * Update in-memory limits before performing the truncation, while inside
@@ -3120,8 +3075,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB, bool in_reco
 	/* Then offsets */
 	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-	if (!in_recovery)
-		MyPgXact->delayChkpt = false;
+	MyPgXact->delayChkpt = false;
 
 	END_CRIT_SECTION();
 	LWLockRelease(MultiXactTruncationLock);
@@ -3369,9 +3323,6 @@ multixact_redo(XLogReaderState *record)
 		PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
 
 		LWLockRelease(MultiXactTruncationLock);
-
-		/* only looked at in the startup process, no lock necessary */
-		MultiXactState->sawTruncationInCkptCycle = true;
 	}
 	else
 		elog(PANIC, "multixact_redo: unknown op code %u", info);
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 698bb35..6d55148 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1141,7 +1141,7 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	TruncateCLOG(frozenXID);
 	TruncateCommitTs(frozenXID, true);
-	TruncateMultiXact(minMulti, minmulti_datoid, false);
+	TruncateMultiXact(minMulti, minmulti_datoid);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 47ef38d..6ef8ba9 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -136,7 +136,7 @@ extern void MultiXactGetCheckptMulti(bool is_shutdown,
 						 Oid *oldestMultiDB);
 extern void CheckPointMultiXact(void);
 extern MultiXactId GetOldestMultiXactId(void);
-extern void TruncateMultiXact(MultiXactId oldestMulti, Oid oldestMultiDB, bool in_recovery);
+extern void TruncateMultiXact(MultiXactId oldestMulti, Oid oldestMultiDB);
 extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset);
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 590bf17..5096c17 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
 /*
  * Each page of XLOG file has a header like this:
  */
-#define XLOG_PAGE_MAGIC 0xD086	/* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD087	/* can be used as WAL version indicator */
 
 typedef struct XLogPageHeaderData
 {
-- 
2.5.0.400.gff86faf

0001-WIP-dontcommit-Add-functions-to-burn-multixacts.patchtext/x-patch; charset=us-asciiDownload
>From f8cf132251623a2518f1ff479618bf7ed1363eb3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 4 Jun 2015 19:38:32 +0200
Subject: [PATCH 1/3] WIP-dontcommit: Add functions to burn multixacts

This should live in its own module, but we don't have that yet.
---
 contrib/pageinspect/heapfuncs.c          | 43 ++++++++++++++++++++++++++++++++
 contrib/pageinspect/pageinspect--1.3.sql |  6 +++++
 src/backend/access/heap/heapam.c         |  2 +-
 src/backend/access/transam/multixact.c   | 15 ++++++-----
 src/include/access/multixact.h           |  3 ++-
 5 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c
index 8d1666c..7a3aa14 100644
--- a/contrib/pageinspect/heapfuncs.c
+++ b/contrib/pageinspect/heapfuncs.c
@@ -29,6 +29,8 @@
 #include "funcapi.h"
 #include "utils/builtins.h"
 #include "miscadmin.h"
+#include "access/multixact.h"
+#include "access/transam.h"
 
 
 /*
@@ -223,3 +225,44 @@ heap_page_items(PG_FUNCTION_ARGS)
 	else
 		SRF_RETURN_DONE(fctx);
 }
+
+extern Datum
+pg_burn_multixact(PG_FUNCTION_ARGS);
+PG_FUNCTION_INFO_V1(pg_burn_multixact);
+
+Datum
+pg_burn_multixact(PG_FUNCTION_ARGS)
+{
+	int		rep = PG_GETARG_INT32(0);
+	int		size = PG_GETARG_INT32(1);
+	MultiXactMember *members;
+	MultiXactId ret;
+	TransactionId id = ReadNewTransactionId() - size;
+	int		i;
+
+	if (rep < 1)
+		elog(ERROR, "need to burn, burn, burn");
+
+	members = palloc(size * sizeof(MultiXactMember));
+	for (i = 0; i < size; i++)
+	{
+		members[i].xid = id++;
+		members[i].status = MultiXactStatusForShare;
+
+		if (!TransactionIdIsNormal(members[i].xid))
+		{
+			id = FirstNormalTransactionId;
+			members[i].xid = id++;
+		}
+	}
+
+	MultiXactIdSetOldestMember();
+
+	for (i = 0; i < rep; i++)
+	{
+		CHECK_FOR_INTERRUPTS();
+		ret = MultiXactIdCreateFromMembers(size, members, true);
+	}
+
+	PG_RETURN_INT64((int64) ret);
+}
diff --git a/contrib/pageinspect/pageinspect--1.3.sql b/contrib/pageinspect/pageinspect--1.3.sql
index a99e058..22f51bc 100644
--- a/contrib/pageinspect/pageinspect--1.3.sql
+++ b/contrib/pageinspect/pageinspect--1.3.sql
@@ -187,3 +187,9 @@ CREATE FUNCTION gin_leafpage_items(IN page bytea,
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'gin_leafpage_items'
 LANGUAGE C STRICT;
+
+
+CREATE FUNCTION pg_burn_multixact(num int4, size int4)
+RETURNS int4
+AS 'MODULE_PATHNAME', 'pg_burn_multixact'
+LANGUAGE C STRICT;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bcf9871..e167684 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6099,7 +6099,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 		 * Create a new multixact with the surviving members of the previous
 		 * one, to set as new Xmax in the tuple.
 		 */
-		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
+		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers, false);
 		*flags |= FRM_RETURN_IS_MULTI;
 	}
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1933a87..34c5370 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -407,7 +407,7 @@ MultiXactIdCreate(TransactionId xid1, MultiXactStatus status1,
 	members[1].xid = xid2;
 	members[1].status = status2;
 
-	newMulti = MultiXactIdCreateFromMembers(2, members);
+	newMulti = MultiXactIdCreateFromMembers(2, members, false);
 
 	debug_elog3(DEBUG2, "Create: %s",
 				mxid_to_string(newMulti, 2, members));
@@ -473,7 +473,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
 		 */
 		member.xid = xid;
 		member.status = status;
-		newMulti = MultiXactIdCreateFromMembers(1, &member);
+		newMulti = MultiXactIdCreateFromMembers(1, &member, false);
 
 		debug_elog4(DEBUG2, "Expand: %u has no members, create singleton %u",
 					multi, newMulti);
@@ -525,7 +525,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
 
 	newMembers[j].xid = xid;
 	newMembers[j++].status = status;
-	newMulti = MultiXactIdCreateFromMembers(j, newMembers);
+	newMulti = MultiXactIdCreateFromMembers(j, newMembers, false);
 
 	pfree(members);
 	pfree(newMembers);
@@ -744,7 +744,7 @@ ReadNextMultiXactId(void)
  * NB: the passed members[] array will be sorted in-place.
  */
 MultiXactId
-MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
+MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members, bool nocache)
 {
 	MultiXactId multi;
 	MultiXactOffset offset;
@@ -763,7 +763,9 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	 * corner cases where someone else added us to a MultiXact without our
 	 * knowledge, but it's not worth checking for.)
 	 */
-	multi = mXactCacheGetBySet(nmembers, members);
+	multi = nocache ? InvalidMultiXactId :
+		mXactCacheGetBySet(nmembers, members);
+
 	if (MultiXactIdIsValid(multi))
 	{
 		debug_elog2(DEBUG2, "Create: in cache!");
@@ -836,7 +838,8 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	END_CRIT_SECTION();
 
 	/* Store the new MultiXactId in the local cache, too */
-	mXactCachePut(multi, nmembers, members);
+	if (!nocache)
+		mXactCachePut(multi, nmembers, members);
 
 	debug_elog2(DEBUG2, "Create: all done");
 
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index f1448fe..6213f8a 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -86,10 +86,11 @@ typedef struct xl_multixact_create
 extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status1, TransactionId xid2,
 				  MultiXactStatus status2);
+extern MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members, bool nocache);
 extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid,
 				  MultiXactStatus status);
 extern MultiXactId MultiXactIdCreateFromMembers(int nmembers,
-							 MultiXactMember *members);
+							 MultiXactMember *members, bool nocache);
 
 extern MultiXactId ReadNextMultiXactId(void);
 extern bool MultiXactIdIsRunning(MultiXactId multi, bool isLockOnly);
-- 
2.5.0.400.gff86faf

0002-Rework-the-way-multixact-truncations-work.patchtext/x-patch; charset=us-asciiDownload
>From 436fe7c732d22b1b53b85ddf6e7cf75c1e1ff685 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 22 Sep 2015 15:17:09 +0200
Subject: [PATCH 2/3] Rework the way multixact truncations work.

The fact that multixact truncations are not WAL logged has caused a fair
share of problems. Amongst others it requires to do computations during
recovery while the database is not in a consistent state, delaying
truncations till checkpoints, and handling members being truncated, but
offset not.

We tried to put bandaids on lots of these issues over the last years,
but it seems time to change course. Thus this patch introduces WAL
logging for truncation, even in the back branches.

This allows:
1) to perform the truncation directly during VACUUM, instead of delaying it
   to the checkpoint.
2) to avoid looking at the offsets SLRU for truncation during recovery,
   we can just use the master's values.
3) simplify a fair amount of logic to keep in memory limits straight,
   this has gotten much easier

During the course of fixing this a bunch of bugs had to be fixed:
1) Data was not purged from memory the member's slru before deleting
   segments. This happened to be hard or impossible to hit due to the
   interlock between checkpoints and truncation.
2) find_multixact_start() relied on SimpleLruDoesPhysicalPageExist - but
   that doesn't work for offsets that haven't yet been flushed to
   disk. Flush out before running to fix. Not pretty, but it feels
   slightly safer to only make decisions based on on-disk state.
3) find_multixact_start() could be called concurrently with a truncation
   and thus fail. Via SetOffsetVacuumLimit() that could lead to a round
   of emergency vacuuming. The problem remains in
   pg_get_multixact_members(), but that's quite harmless.

To handle the case of an updated standby replaying WAL from a not-yet
upgraded primary we have to recognize that situation and use "old style"
truncation (i.e. looking at the SLRUs) during WAL replay. In contrast to
before, this now happens in the startup process, when replaying a
checkpoint record, instead of the checkpointer. Doing this in the
restartpoint was incorrect, they can happen much later than the original
checkpoint, thereby leading to wraparound. It's also more in line to how
the WAL logging now works.

To avoid "multixact_redo: unknown op code 48" errors standbys should be
upgraded before primaries. This needs to be expressed clearly in the
release notes.

WIP: Backpatch to 9.3, where the use of multixacts was expanded. Arguably
this could be backpatched further, but there doesn't seem to be
sufficient benefit to outweigh the risk of applying a significantly
different patch there.

Discussion: 20150621192409.GA4797@alap3.anarazel.de
Reviewed-By: Robert Haas, Alvaro Herrera, Thomas Munro
Backpatch: probably-not-9.3-but-9.5
---
 src/backend/access/rmgrdesc/mxactdesc.c  |  11 +
 src/backend/access/transam/multixact.c   | 700 ++++++++++++++++++-------------
 src/backend/access/transam/slru.c        |  83 +++-
 src/backend/access/transam/xlog.c        |  53 +--
 src/backend/commands/vacuum.c            |   4 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/access/multixact.h           |  19 +-
 src/include/access/slru.h                |   4 +-
 src/tools/pgindent/typedefs.list         |   1 +
 9 files changed, 533 insertions(+), 343 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 572951e..5b8134f 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -70,6 +70,14 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
+	else if (info == XLOG_MULTIXACT_TRUNCATE_ID)
+	{
+		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
+
+		appendStringInfo(buf, "offsets [%u, %u), members [%u, %u)",
+						 xlrec->startTruncOff, xlrec->endTruncOff,
+						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+	}
 }
 
 const char *
@@ -88,6 +96,9 @@ multixact_identify(uint8 info)
 		case XLOG_MULTIXACT_CREATE_ID:
 			id = "CREATE_ID";
 			break;
+		case XLOG_MULTIXACT_TRUNCATE_ID:
+			id = "TRUNCATE_ID";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 34c5370..d19b4c2 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -49,9 +49,7 @@
  * value is removed; the cutoff value is stored in pg_class.  The minimum value
  * across all tables in each database is stored in pg_database, and the global
  * minimum across all databases is part of pg_control and is kept in shared
- * memory.  At checkpoint time, after the value is known flushed in WAL, any
- * files that correspond to multixacts older than that value are removed.
- * (These files are also removed when a restartpoint is executed.)
+ * memory.  Whenever that minimum is advanced, the SLRUs are truncated.
  *
  * When new multixactid values are to be created, care is taken that the
  * counter does not fall within the wraparound horizon considering the global
@@ -83,6 +81,7 @@
 #include "postmaster/autovacuum.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -109,6 +108,7 @@
 	((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
 #define MultiXactIdToOffsetEntry(xid) \
 	((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
+#define MultiXactIdToOffsetSegment(xid) (MultiXactIdToOffsetPage(xid) / SLRU_PAGES_PER_SEGMENT)
 
 /*
  * The situation for members is a bit more complex: we store one byte of
@@ -153,6 +153,7 @@
 
 /* page in which a member is to be found */
 #define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
+#define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
 
 /* Location (byte offset within page) of flag word for a given member */
 #define MXOffsetToFlagsOffset(xid) \
@@ -212,19 +213,20 @@ typedef struct MultiXactStateData
 	Oid			oldestMultiXactDB;
 
 	/*
-	 * Oldest multixact offset that is potentially referenced by a
-	 * multixact referenced by a relation.  We don't always know this value,
-	 * so there's a flag here to indicate whether or not we currently do.
+	 * Oldest multixact offset that is potentially referenced by a multixact
+	 * referenced by a relation.  We don't always know this value, so there's
+	 * a flag here to indicate whether or not we currently do.
 	 */
 	MultiXactOffset oldestOffset;
 	bool		oldestOffsetKnown;
 
 	/*
-	 * This is what the previous checkpoint stored as the truncate position.
-	 * This value is the oldestMultiXactId that was valid when a checkpoint
-	 * was last executed.
+	 * True if a multixact truncation WAL record was replayed since the last
+	 * checkpoint. This is used to trigger 'legacy truncations', i.e. truncate
+	 * by looking at the data directory during WAL replay, when the primary is
+	 * too old to generate truncation records.
 	 */
-	MultiXactId lastCheckpointedOldest;
+	bool		sawTruncationInCkptCycle;
 
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
@@ -233,8 +235,7 @@ typedef struct MultiXactStateData
 	MultiXactId multiWrapLimit;
 
 	/* support for members anti-wraparound measures */
-	MultiXactOffset offsetStopLimit;
-	bool offsetStopLimitKnown;
+	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
@@ -364,12 +365,14 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 						MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static void DetermineSafeOldestOffset(MultiXactId oldestMXact);
 static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
 						 MultiXactOffset start, uint32 distance);
-static bool SetOffsetVacuumLimit(bool finish_setup);
+static bool SetOffsetVacuumLimit(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int pageno, uint8 info);
+static void WriteMTruncateXlogRec(Oid oldestMultiDB,
+					  MultiXactId startOff, MultiXactId endOff,
+					  MultiXactOffset startMemb, MultiXactOffset endMemb);
 
 
 /*
@@ -1102,7 +1105,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 *----------
 	 */
 #define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactState->offsetStopLimitKnown &&
+	if (MultiXactState->oldestOffsetKnown &&
 		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
 								 nmembers))
 	{
@@ -1142,7 +1145,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 	}
 
-	if (MultiXactState->offsetStopLimitKnown &&
+	if (MultiXactState->oldestOffsetKnown &&
 		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit,
 								 nextOffset,
 								 nmembers + MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS))
@@ -2013,20 +2016,24 @@ StartupMultiXact(void)
 
 /*
  * This must be called ONCE at the end of startup/recovery.
- *
- * We don't need any locks here, really; the SLRU locks are taken only because
- * slru.c expects to be called with locks held.
  */
 void
 TrimMultiXact(void)
 {
-	MultiXactId multi = MultiXactState->nextMXact;
-	MultiXactOffset offset = MultiXactState->nextOffset;
-	MultiXactId	oldestMXact;
+	MultiXactId nextMXact;
+	MultiXactOffset offset;
+	MultiXactId oldestMXact;
+	Oid			oldestMXactDB;
 	int			pageno;
 	int			entryno;
 	int			flagsoff;
 
+	LWLockAcquire(MultiXactGenLock, LW_SHARED);
+	nextMXact = MultiXactState->nextMXact;
+	offset = MultiXactState->nextOffset;
+	oldestMXact = MultiXactState->oldestMultiXactId;
+	oldestMXactDB = MultiXactState->oldestMultiXactDB;
+	LWLockRelease(MultiXactGenLock);
 
 	/* Clean up offsets state */
 	LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
@@ -2034,20 +2041,20 @@ TrimMultiXact(void)
 	/*
 	 * (Re-)Initialize our idea of the latest page number for offsets.
 	 */
-	pageno = MultiXactIdToOffsetPage(multi);
+	pageno = MultiXactIdToOffsetPage(nextMXact);
 	MultiXactOffsetCtl->shared->latest_page_number = pageno;
 
 	/*
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for motivation.
 	 */
-	entryno = MultiXactIdToOffsetEntry(multi);
+	entryno = MultiXactIdToOffsetEntry(nextMXact);
 	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
@@ -2096,12 +2103,13 @@ TrimMultiXact(void)
 
 	LWLockRelease(MultiXactMemberControlLock);
 
-	if (SetOffsetVacuumLimit(true) && IsUnderPostmaster)
-		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	oldestMXact = MultiXactState->lastCheckpointedOldest;
+	/* signal that we're officially up */
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->finishedStartup = true;
 	LWLockRelease(MultiXactGenLock);
-	DetermineSafeOldestOffset(oldestMXact);
+
+	/* Now compute how far away the next members wraparound is. */
+	SetMultiXactIdLimit(oldestMXact, oldestMXactDB);
 }
 
 /*
@@ -2270,8 +2278,20 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	 (errmsg("MultiXactId wrap limit is %u, limited by database with OID %u",
 			 multiWrapLimit, oldest_datoid)));
 
+	/*
+	 * Computing the actual limits is only possible once the data directory is
+	 * in a consistent state. There's no need to compute the limits while
+	 * still replaying WAL - no decisions about new multis are made even
+	 * though multixact creations might be replayed. So we'll only do further
+	 * checks after TrimMultiXact() has been called.
+	 */
+	if (!MultiXactState->finishedStartup)
+		return;
+
+	Assert(!InRecovery);
+
 	/* Set limits for offset vacuum. */
-	needs_offset_vacuum = SetOffsetVacuumLimit(false);
+	needs_offset_vacuum = SetOffsetVacuumLimit();
 
 	/*
 	 * If past the autovacuum force point, immediately signal an autovac
@@ -2281,11 +2301,11 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	 * another iteration immediately if there are still any old databases.
 	 */
 	if ((MultiXactIdPrecedes(multiVacLimit, curMulti) ||
-		 needs_offset_vacuum) && IsUnderPostmaster && !InRecovery)
+		 needs_offset_vacuum) && IsUnderPostmaster)
 		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
 
 	/* Give an immediate warning if past the wrap warn point */
-	if (MultiXactIdPrecedes(multiWarnLimit, curMulti) && !InRecovery)
+	if (MultiXactIdPrecedes(multiWarnLimit, curMulti))
 	{
 		char	   *oldest_datname;
 
@@ -2353,27 +2373,39 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
 }
 
 /*
- * Update our oldestMultiXactId value, but only if it's more recent than
- * what we had.  However, even if not, always update the oldest multixact
- * offset limit.
+ * Update our oldestMultiXactId value, but only if it's more recent than what
+ * we had.
+ *
+ * This may only be called during WAL replay.
  */
 void
 MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 {
+	Assert(InRecovery);
+
 	if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
+	{
+		/*
+		 * If there has been a truncation on the master, detected by seeing a
+		 * moving oldestMulti, without a corresponding truncation record, we
+		 * know that the primary is still running an older version of postgres
+		 * that doesn't yet log multixact truncations. So perform the
+		 * truncation ourselves.
+		 */
+		if (!MultiXactState->sawTruncationInCkptCycle)
+		{
+			ereport(LOG,
+					(errmsg("performing legacy multixact truncation"),
+					 errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."),
+					 errhint("Upgrade the primary, it is susceptible to data corruption.")));
+			TruncateMultiXact(oldestMulti, oldestMultiDB, true);
+		}
+
 		SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
-}
+	}
 
-/*
- * Update the "safe truncation point".  This is the newest value of oldestMulti
- * that is known to be flushed as part of a checkpoint record.
- */
-void
-MultiXactSetSafeTruncate(MultiXactId safeTruncateMulti)
-{
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->lastCheckpointedOldest = safeTruncateMulti;
-	LWLockRelease(MultiXactGenLock);
+	/* only looked at in the startup process, no lock necessary */
+	MultiXactState->sawTruncationInCkptCycle = false;
 }
 
 /*
@@ -2529,132 +2561,56 @@ GetOldestMultiXactId(void)
 }
 
 /*
- * Based on the given oldest MultiXactId, determine what's the oldest member
- * offset and install the limit info in MultiXactState, where it can be used to
- * prevent overrun of old data in the members SLRU area.
- */
-static void
-DetermineSafeOldestOffset(MultiXactId oldestMXact)
-{
-	MultiXactOffset oldestOffset;
-	MultiXactOffset nextOffset;
-	MultiXactOffset offsetStopLimit;
-	MultiXactOffset prevOffsetStopLimit;
-	MultiXactId		nextMXact;
-	bool			finishedStartup;
-	bool			prevOffsetStopLimitKnown;
-
-	/* Fetch values from shared memory. */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	finishedStartup = MultiXactState->finishedStartup;
-	nextMXact = MultiXactState->nextMXact;
-	nextOffset = MultiXactState->nextOffset;
-	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
-	prevOffsetStopLimitKnown = MultiXactState->offsetStopLimitKnown;
-	LWLockRelease(MultiXactGenLock);
-
-	/* Don't worry about this until after we've started up. */
-	if (!finishedStartup)
-		return;
-
-	/*
-	 * Determine the offset of the oldest multixact.  Normally, we can read
-	 * the offset from the multixact itself, but there's an important special
-	 * case: if there are no multixacts in existence at all, oldestMXact
-	 * obviously can't point to one.  It will instead point to the multixact
-	 * ID that will be assigned the next time one is needed.
-	 *
-	 * NB: oldestMXact should be the oldest multixact that still exists in the
-	 * SLRU, unlike in SetOffsetVacuumLimit, where we do this same computation
-	 * based on the oldest value that might be referenced in a table.
-	 */
-	if (nextMXact == oldestMXact)
-		oldestOffset = nextOffset;
-	else
-	{
-		bool		oldestOffsetKnown;
-
-		oldestOffsetKnown = find_multixact_start(oldestMXact, &oldestOffset);
-		if (!oldestOffsetKnown)
-		{
-			ereport(LOG,
-					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
-						oldestMXact)));
-			return;
-		}
-	}
-
-	/* move back to start of the corresponding segment */
-	offsetStopLimit = oldestOffset - (oldestOffset %
-		(MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
-	/* always leave one segment before the wraparound point */
-	offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
-
-	/* if nothing has changed, we're done */
-	if (prevOffsetStopLimitKnown && offsetStopLimit == prevOffsetStopLimit)
-		return;
-
-	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-	MultiXactState->offsetStopLimit = offsetStopLimit;
-	MultiXactState->offsetStopLimitKnown = true;
-	LWLockRelease(MultiXactGenLock);
-
-	if (!prevOffsetStopLimitKnown && IsUnderPostmaster)
-		ereport(LOG,
-				(errmsg("MultiXact member wraparound protections are now enabled")));
-	ereport(DEBUG1,
-			(errmsg("MultiXact member stop limit is now %u based on MultiXact %u",
-				offsetStopLimit, oldestMXact)));
-}
-
-/*
  * Determine how aggressively we need to vacuum in order to prevent member
  * wraparound.
  *
- * To determine the oldest multixact ID, we look at oldestMultiXactId, not
- * lastCheckpointedOldest.  That's because vacuuming can't help with anything
- * older than oldestMultiXactId; anything older than that isn't referenced
- * by any table.  Offsets older than oldestMultiXactId but not as old as
- * lastCheckpointedOldest will go away after the next checkpoint.
+ * To do so determine what's the oldest member offset and install the limit
+ * info in MultiXactState, where it can be used to prevent overrun of old data
+ * in the members SLRU area.
  *
  * The return value is true if emergency autovacuum is required and false
  * otherwise.
  */
 static bool
-SetOffsetVacuumLimit(bool finish_setup)
+SetOffsetVacuumLimit(void)
 {
-	MultiXactId	oldestMultiXactId;
+	MultiXactId oldestMultiXactId;
 	MultiXactId nextMXact;
-	bool		finishedStartup;
-	MultiXactOffset oldestOffset = 0;		/* placate compiler */
+	MultiXactOffset oldestOffset = 0;	/* placate compiler */
+	MultiXactOffset prevOldestOffset;
 	MultiXactOffset nextOffset;
 	bool		oldestOffsetKnown = false;
-	MultiXactOffset prevOldestOffset;
 	bool		prevOldestOffsetKnown;
+	MultiXactOffset offsetStopLimit = 0;
+
+	/*
+	 * NB: Have to prevent concurrent truncation, we might otherwise try to
+	 * lookup a oldestMulti that's concurrently getting truncated away.
+	 */
+	LWLockAcquire(MultiXactTruncationLock, LW_SHARED);
 
 	/* Read relevant fields from shared memory. */
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
 	oldestMultiXactId = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
-	finishedStartup = MultiXactState->finishedStartup;
-	prevOldestOffset = MultiXactState->oldestOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
+	prevOldestOffset = MultiXactState->oldestOffset;
+	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
-	/* Don't do this until after any recovery is complete. */
-	if (!finishedStartup && !finish_setup)
-		return false;
-
 	/*
-	 * If no multixacts exist, then oldestMultiXactId will be the next
-	 * multixact that will be created, rather than an existing multixact.
+	 * Determine the offset of the oldest multixact.  Normally, we can read
+	 * the offset from the multixact itself, but there's an important special
+	 * case: if there are no multixacts in existence at all, oldestMXact
+	 * obviously can't point to one.  It will instead point to the multixact
+	 * ID that will be assigned the next time one is needed.
 	 */
 	if (oldestMultiXactId == nextMXact)
 	{
 		/*
-		 * When the next multixact gets created, it will be stored at the
-		 * next offset.
+		 * When the next multixact gets created, it will be stored at the next
+		 * offset.
 		 */
 		oldestOffset = nextOffset;
 		oldestOffsetKnown = true;
@@ -2662,55 +2618,67 @@ SetOffsetVacuumLimit(bool finish_setup)
 	else
 	{
 		/*
-		 * Figure out where the oldest existing multixact's offsets are stored.
-		 * Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X, the
-		 * supposedly-earliest multixact might not really exist.  We are
+		 * Figure out where the oldest existing multixact's offsets are
+		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
+		 * the supposedly-earliest multixact might not really exist.  We are
 		 * careful not to fail in that case.
 		 */
 		oldestOffsetKnown =
 			find_multixact_start(oldestMultiXactId, &oldestOffset);
-	}
 
-	/*
-	 * Except when initializing the system for the first time, there's no
-	 * need to update anything if we don't know the oldest offset or if it
-	 * hasn't changed.
-	 */
-	if (finish_setup ||
-		(oldestOffsetKnown && !prevOldestOffsetKnown) ||
-		(oldestOffsetKnown && prevOldestOffset != oldestOffset))
-	{
-		/* Install the new limits. */
-		LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
-		MultiXactState->oldestOffset = oldestOffset;
-		MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
-		MultiXactState->finishedStartup = true;
-		LWLockRelease(MultiXactGenLock);
-
-		/* Log the info */
 		if (oldestOffsetKnown)
 			ereport(DEBUG1,
 					(errmsg("oldest MultiXactId member is at offset %u",
-						oldestOffset)));
+							oldestOffset)));
 		else
-			ereport(DEBUG1,
-					(errmsg("oldest MultiXactId member offset unknown")));
+			ereport(LOG,
+					(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
+							oldestMultiXactId)));
 	}
 
+	LWLockRelease(MultiXactTruncationLock);
+
 	/*
-	 * If we failed to get the oldest offset this time, but we have a value
-	 * from a previous pass through this function, assess the need for
-	 * autovacuum based on that old value rather than automatically forcing
-	 * it.
+	 * If we can, compute limits (and install them MultiXactState) to prevent
+	 * overrun of old data in the members SLRU area. We can only do so if the
+	 * oldest offset is known though.
 	 */
-	if (prevOldestOffsetKnown && !oldestOffsetKnown)
+	if (oldestOffsetKnown)
+	{
+		/* move back to start of the corresponding segment */
+		offsetStopLimit = oldestOffset - (oldestOffset %
+					  (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
+
+		/* always leave one segment before the wraparound point */
+		offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
+
+		if (!prevOldestOffsetKnown && IsUnderPostmaster)
+			ereport(LOG,
+					(errmsg("MultiXact member wraparound protections are now enabled")));
+		ereport(DEBUG1,
+		(errmsg("MultiXact member stop limit is now %u based on MultiXact %u",
+				offsetStopLimit, oldestMultiXactId)));
+	}
+	else if (prevOldestOffsetKnown)
 	{
+		/*
+		 * If we failed to get the oldest offset this time, but we have a
+		 * value from a previous pass through this function, use the old value
+		 * rather than automatically forcing it.
+		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
 	}
 
+	/* Install the computed values */
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestOffset = oldestOffset;
+	MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
+	MultiXactState->offsetStopLimit = offsetStopLimit;
+	LWLockRelease(MultiXactGenLock);
+
 	/*
-	 * Do we need an emergency autovacuum?  If we're not sure, assume yes.
+	 * Do we need an emergency autovacuum?	If we're not sure, assume yes.
 	 */
 	return !oldestOffsetKnown ||
 		(nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
@@ -2723,7 +2691,7 @@ SetOffsetVacuumLimit(bool finish_setup)
  * boundary point, hence the name.  The reason we don't want to use the regular
  * 2^31-modulo arithmetic here is that we want to be able to use the whole of
  * the 2^32-1 space here, allowing for more multixacts that would fit
- * otherwise.  See also SlruScanDirCbRemoveMembers.
+ * otherwise.
  */
 static bool
 MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
@@ -2769,6 +2737,9 @@ MultiXactOffsetWouldWrap(MultiXactOffset boundary, MultiXactOffset start,
  *
  * Returns false if the file containing the multi does not exist on disk.
  * Otherwise, returns true and sets *result to the starting member offset.
+ *
+ * This function does not prevent concurrent truncation, so if that's
+ * required, the caller has to protect against that.
  */
 static bool
 find_multixact_start(MultiXactId multi, MultiXactOffset *result)
@@ -2779,9 +2750,22 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
 	int			slotno;
 	MultiXactOffset *offptr;
 
+	/* XXX: Remove || AmStartupProcess() after WAL page magic bump */
+	Assert(MultiXactState->finishedStartup || AmStartupProcess());
+
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/*
+	 * We need to flush out dirty data, so PhysicalPageExists can work
+	 * correctly, but SimpleLruFlush() is a pretty big hammer for that.
+	 * Alternatively we could add a in-memory version of page exists, but
+	 * find_multixact_start is called infrequently, and it doesn't seem bad to
+	 * flush buffers to disk before truncation.
+	 */
+	SimpleLruFlush(MultiXactOffsetCtl, true);
+	SimpleLruFlush(MultiXactMemberCtl, true);
+
 	if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
 		return false;
 
@@ -2887,65 +2871,6 @@ MultiXactMemberFreezeThreshold(void)
 	return multixacts - victim_multixacts;
 }
 
-/*
- * SlruScanDirectory callback.
- *		This callback deletes segments that are outside the range determined by
- *		the given page numbers.
- *
- * Both range endpoints are exclusive (that is, segments containing any of
- * those pages are kept.)
- */
-typedef struct MembersLiveRange
-{
-	int			rangeStart;
-	int			rangeEnd;
-} MembersLiveRange;
-
-static bool
-SlruScanDirCbRemoveMembers(SlruCtl ctl, char *filename, int segpage,
-						   void *data)
-{
-	MembersLiveRange *range = (MembersLiveRange *) data;
-	MultiXactOffset nextOffset;
-
-	if ((segpage == range->rangeStart) ||
-		(segpage == range->rangeEnd))
-		return false;			/* easy case out */
-
-	/*
-	 * To ensure that no segment is spuriously removed, we must keep track of
-	 * new segments added since the start of the directory scan; to do this,
-	 * we update our end-of-range point as we run.
-	 *
-	 * As an optimization, we can skip looking at shared memory if we know for
-	 * certain that the current segment must be kept.  This is so because
-	 * nextOffset never decreases, and we never increase rangeStart during any
-	 * one run.
-	 */
-	if (!((range->rangeStart > range->rangeEnd &&
-		   segpage > range->rangeEnd && segpage < range->rangeStart) ||
-		  (range->rangeStart < range->rangeEnd &&
-		   (segpage < range->rangeStart || segpage > range->rangeEnd))))
-		return false;
-
-	/*
-	 * Update our idea of the end of the live range.
-	 */
-	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	nextOffset = MultiXactState->nextOffset;
-	LWLockRelease(MultiXactGenLock);
-	range->rangeEnd = MXOffsetToMemberPage(nextOffset);
-
-	/* Recheck the deletion condition.  If it still holds, perform deletion */
-	if ((range->rangeStart > range->rangeEnd &&
-		 segpage > range->rangeEnd && segpage < range->rangeStart) ||
-		(range->rangeStart < range->rangeEnd &&
-		 (segpage < range->rangeStart || segpage > range->rangeEnd)))
-		SlruDeleteSegment(ctl, filename);
-
-	return false;				/* keep going */
-}
-
 typedef struct mxtruncinfo
 {
 	int			earliestExistingPage;
@@ -2969,37 +2894,110 @@ SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
 	return false;				/* keep going */
 }
 
+
+/*
+ * Delete members segments [oldest, newOldest)
+ */
+static void
+PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
+{
+	const int	maxsegment = MXOffsetToMemberSegment(MaxMultiXactOffset);
+	int			startsegment = MXOffsetToMemberSegment(oldestOffset);
+	int			endsegment = MXOffsetToMemberSegment(newOldestOffset);
+	int			segment = startsegment;
+
+	/*
+	 * Delete all the segments but the last one. The last segment can still
+	 * contain, possibly partially, valid data.
+	 */
+	while (segment != endsegment)
+	{
+		elog(DEBUG2, "truncating multixact members segment %x", segment);
+		SlruDeleteSegment(MultiXactMemberCtl, segment);
+
+		/* move to next segment, handling wraparound correctly */
+		if (segment == maxsegment)
+			segment = 0;
+		else
+			segment += 1;
+	}
+}
+
+/*
+ * Delete offsets segments [oldest, newOldest)
+ */
+static void
+PerformOffsetsTruncation(MultiXactId oldestMulti, MultiXactId newOldestMulti)
+{
+	/*
+	 * We step back one multixact to avoid passing a cutoff page that hasn't
+	 * been created yet in the rare case that oldestMulti would be the first
+	 * item on a page and oldestMulti == nextMulti.  In that case, if we
+	 * didn't subtract one, we'd trigger SimpleLruTruncate's wraparound
+	 * detection.
+	 */
+	SimpleLruTruncate(MultiXactOffsetCtl,
+			   MultiXactIdToOffsetPage(PreviousMultiXactId(newOldestMulti)));
+}
+
 /*
  * Remove all MultiXactOffset and MultiXactMember segments before the oldest
  * ones still of interest.
  *
- * On a primary, this is called by the checkpointer process after a checkpoint
- * has been flushed; during crash recovery, it's called from
- * CreateRestartPoint().  In the latter case, we rely on the fact that
- * xlog_redo() will already have called MultiXactAdvanceOldest().  Our
- * latest_page_number will already have been initialized by StartupMultiXact()
- * and kept up to date as new pages are zeroed.
+ * On a primary this is called as part of vacuum (via
+ * vac_truncate_clog()). During recovery truncation is normally done by
+ * replaying truncation WAL records instead of this routine; the exception is
+ * when replaying records from an older primary that doesn't yet generate
+ * truncation WAL records. In that case truncation is triggered by
+ * MultiXactAdvanceOldest().
+ *
+ * newOldestMulti is the oldest currently required multixact, newOldestMultiDB
+ * is one of the databases preventing newOldestMulti from increasing.
  */
 void
-TruncateMultiXact(void)
+TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB, bool in_recovery)
 {
-	MultiXactId oldestMXact;
+	MultiXactId oldestMulti;
+	MultiXactId nextMulti;
+	MultiXactOffset newOldestOffset;
 	MultiXactOffset oldestOffset;
-	MultiXactId		nextMXact;
-	MultiXactOffset	nextOffset;
+	MultiXactOffset nextOffset;
 	mxtruncinfo trunc;
 	MultiXactId earliest;
-	MembersLiveRange range;
 
-	Assert(AmCheckpointerProcess() || AmStartupProcess() ||
-		   !IsPostmasterEnvironment);
+	/*
+	 * Need to allow being called in recovery for backwards compatibility,
+	 * when an updated standby replays WAL generated by a non-updated primary.
+	 */
+	Assert(in_recovery || !RecoveryInProgress());
+	Assert(!in_recovery || AmStartupProcess());
+	Assert(in_recovery || MultiXactState->finishedStartup);
+
+	/*
+	 * We can only allow one truncation to happen at once. Otherwise parts of
+	 * members might vanish while we're doing lookups or similar. There's no
+	 * need to have an interlock with creating new multis or such, since those
+	 * are constrained by the limits (which only grow, never shrink).
+	 */
+	LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
 
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
-	oldestMXact = MultiXactState->lastCheckpointedOldest;
-	nextMXact = MultiXactState->nextMXact;
+	nextMulti = MultiXactState->nextMXact;
 	nextOffset = MultiXactState->nextOffset;
+	oldestMulti = MultiXactState->oldestMultiXactId;
 	LWLockRelease(MultiXactGenLock);
-	Assert(MultiXactIdIsValid(oldestMXact));
+	Assert(MultiXactIdIsValid(oldestMulti));
+
+	/*
+	 * Make sure to only attempt truncation if there's values to truncate
+	 * away. In normal processing values shouldn't go backwards, but there's
+	 * some corner cases (due to bugs) where that's possible.
+	 */
+	if (MultiXactIdPrecedesOrEquals(newOldestMulti, oldestMulti))
+	{
+		LWLockRelease(MultiXactTruncationLock);
+		return;
+	}
 
 	/*
 	 * Note we can't just plow ahead with the truncation; it's possible that
@@ -3007,6 +3005,9 @@ TruncateMultiXact(void)
 	 * going to attempt to read the offsets page to determine where to
 	 * truncate the members SLRU.  So we first scan the directory to determine
 	 * the earliest offsets page number that we can read without error.
+	 *
+	 * NB: It's also possible that the page that oldestMulti is on has already
+	 * been truncated away, and we crashed before updating oldestMulti.
 	 */
 	trunc.earliestExistingPage = -1;
 	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
@@ -3014,19 +3015,10 @@ TruncateMultiXact(void)
 	if (earliest < FirstMultiXactId)
 		earliest = FirstMultiXactId;
 
-	/*
-	 * If there's nothing to remove, we can bail out early.
-	 *
-	 * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X,
-	 * oldestMXact might point to a multixact that does not exist.
-	 * Autovacuum will eventually advance it to a value that does exist,
-	 * and we want to set a proper offsetStopLimit when that happens,
-	 * so call DetermineSafeOldestOffset here even if we're not actually
-	 * truncating.
-	 */
-	if (MultiXactIdPrecedes(oldestMXact, earliest))
+	/* If there's nothing to remove, we can bail out early. */
+	if (MultiXactIdPrecedes(oldestMulti, earliest))
 	{
-		DetermineSafeOldestOffset(oldestMXact);
+		LWLockRelease(MultiXactTruncationLock);
 		return;
 	}
 
@@ -3035,49 +3027,104 @@ TruncateMultiXact(void)
 	 * the starting offset of the oldest multixact.
 	 *
 	 * Hopefully, find_multixact_start will always work here, because we've
-	 * already checked that it doesn't precede the earliest MultiXact on
-	 * disk.  But if it fails, don't truncate anything, and log a message.
+	 * already checked that it doesn't precede the earliest MultiXact on disk.
+	 * But if it fails, don't truncate anything, and log a message.
 	 */
-	if (oldestMXact == nextMXact)
-		oldestOffset = nextOffset;		/* there are NO MultiXacts */
-	else if (!find_multixact_start(oldestMXact, &oldestOffset))
+	if (oldestMulti == nextMulti)
+	{
+		/* there are NO MultiXacts */
+		oldestOffset = nextOffset;
+	}
+	else if (!find_multixact_start(oldestMulti, &oldestOffset))
 	{
 		ereport(LOG,
 				(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
-					oldestMXact, earliest)));
+						oldestMulti, earliest)));
+		LWLockRelease(MultiXactTruncationLock);
 		return;
 	}
 
 	/*
-	 * To truncate MultiXactMembers, we need to figure out the active page
-	 * range and delete all files outside that range.  The start point is the
-	 * start of the segment containing the oldest offset; an end point of the
-	 * segment containing the next offset to use is enough.  The end point is
-	 * updated as MultiXactMember gets extended concurrently, elsewhere.
+	 * Secondly compute up to where to truncate. Lookup the corresponding
+	 * member offset for newOldestMulti for that.
 	 */
-	range.rangeStart = MXOffsetToMemberPage(oldestOffset);
-	range.rangeStart -= range.rangeStart % SLRU_PAGES_PER_SEGMENT;
-
-	range.rangeEnd = MXOffsetToMemberPage(nextOffset);
+	if (newOldestMulti == nextMulti)
+	{
+		/* there are NO MultiXacts */
+		newOldestOffset = nextOffset;
+	}
+	else if (!find_multixact_start(newOldestMulti, &newOldestOffset))
+	{
+		ereport(LOG,
+				(errmsg("cannot truncate up to MultiXact %u because it does not exist on disk, skipping truncation",
+						newOldestMulti)));
+		LWLockRelease(MultiXactTruncationLock);
+		return;
+	}
 
-	SlruScanDirectory(MultiXactMemberCtl, SlruScanDirCbRemoveMembers, &range);
+	elog(DEBUG1, "performing multixact truncation: "
+		 "offsets [%u, %u), offsets segments [%x, %x), "
+		 "members [%u, %u), members segments [%x, %x)",
+		 oldestMulti, newOldestMulti,
+		 MultiXactIdToOffsetSegment(oldestMulti),
+		 MultiXactIdToOffsetSegment(newOldestMulti),
+		 oldestOffset, newOldestOffset,
+		 MXOffsetToMemberSegment(oldestOffset),
+		 MXOffsetToMemberSegment(newOldestOffset));
 
 	/*
-	 * Now we can truncate MultiXactOffset.  We step back one multixact to
-	 * avoid passing a cutoff page that hasn't been created yet in the rare
-	 * case that oldestMXact would be the first item on a page and oldestMXact
-	 * == nextMXact.  In that case, if we didn't subtract one, we'd trigger
-	 * SimpleLruTruncate's wraparound detection.
+	 * Do truncation, and the WAL logging of the truncation, in a critical
+	 * section. That way offsets/members cannot get out of sync anymore, i.e.
+	 * once consistent the newOldestMulti will always exist in members, even
+	 * if we crashed in the wrong moment.
 	 */
-	SimpleLruTruncate(MultiXactOffsetCtl,
-				  MultiXactIdToOffsetPage(PreviousMultiXactId(oldestMXact)));
+	START_CRIT_SECTION();
 
 	/*
-	 * Now, and only now, we can advance the stop point for multixact members.
-	 * If we did it any sooner, the segments we deleted above might already
-	 * have been overwritten with new members.  That would be bad.
+	 * Prevent checkpoints from being scheduled concurrently. This is critical
+	 * because otherwise a truncation record might not be replayed after a
+	 * crash/basebackup, even though the state of the data directory would
+	 * require it.  It's not possible (startup process doesn't have a PGXACT
+	 * entry), and not needed, to do this during recovery, when performing an
+	 * old-style truncation, though. There the entire scheduling depends on
+	 * the replayed WAL records which be the same after a possible crash.
+	 */
+	if (!in_recovery)
+	{
+		Assert(!MyPgXact->delayChkpt);
+		MyPgXact->delayChkpt = true;
+	}
+
+	/* WAL log truncation */
+	if (!in_recovery)
+		WriteMTruncateXlogRec(newOldestMultiDB,
+							  oldestMulti, newOldestMulti,
+							  oldestOffset, newOldestOffset);
+
+	/*
+	 * Update in-memory limits before performing the truncation, while inside
+	 * the critical section: Have to do it before truncation, to prevent
+	 * concurrent lookups of those values. Has to be inside the critical
+	 * section as otherwise a future call to this function would error out,
+	 * while looking up the oldest member in offsets, if our caller crashes
+	 * before updating the limits.
 	 */
-	DetermineSafeOldestOffset(oldestMXact);
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->oldestMultiXactId = newOldestMulti;
+	MultiXactState->oldestMultiXactDB = newOldestMultiDB;
+	LWLockRelease(MultiXactGenLock);
+
+	/* First truncate members */
+	PerformMembersTruncation(oldestOffset, newOldestOffset);
+
+	/* Then offsets */
+	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
+
+	if (!in_recovery)
+		MyPgXact->delayChkpt = false;
+
+	END_CRIT_SECTION();
+	LWLockRelease(MultiXactTruncationLock);
 }
 
 /*
@@ -3174,6 +3221,34 @@ WriteMZeroPageXlogRec(int pageno, uint8 info)
 }
 
 /*
+ * Write a TRUNCATE xlog record
+ *
+ * We must flush the xlog record to disk before returning --- see notes in
+ * TruncateCLOG().
+ */
+static void
+WriteMTruncateXlogRec(Oid oldestMultiDB,
+					  MultiXactId startTruncOff, MultiXactId endTruncOff,
+				MultiXactOffset startTruncMemb, MultiXactOffset endTruncMemb)
+{
+	XLogRecPtr	recptr;
+	xl_multixact_truncate xlrec;
+
+	xlrec.oldestMultiDB = oldestMultiDB;
+
+	xlrec.startTruncOff = startTruncOff;
+	xlrec.endTruncOff = endTruncOff;
+
+	xlrec.startTruncMemb = startTruncMemb;
+	xlrec.endTruncMemb = endTruncMemb;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) (&xlrec), SizeOfMultiXactTruncate);
+	recptr = XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_TRUNCATE_ID);
+	XLogFlush(recptr);
+}
+
+/*
  * MULTIXACT resource manager's routines
  */
 void
@@ -3255,6 +3330,49 @@ multixact_redo(XLogReaderState *record)
 			LWLockRelease(XidGenLock);
 		}
 	}
+	else if (info == XLOG_MULTIXACT_TRUNCATE_ID)
+	{
+		xl_multixact_truncate xlrec;
+		int			pageno;
+
+		memcpy(&xlrec, XLogRecGetData(record),
+			   SizeOfMultiXactTruncate);
+
+		elog(DEBUG1, "replaying multixact truncation: "
+			 "offsets [%u, %u), offsets segments [%x, %x), "
+			 "members [%u, %u), members segments [%x, %x)",
+			 xlrec.startTruncOff, xlrec.endTruncOff,
+			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
+			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
+			 xlrec.startTruncMemb, xlrec.endTruncMemb,
+			 MXOffsetToMemberSegment(xlrec.startTruncMemb),
+			 MXOffsetToMemberSegment(xlrec.endTruncMemb));
+
+		/* should not be required, but more than cheap enough */
+		LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
+
+		/*
+		 * Advance the horizon values, so they're current at the end of
+		 * recovery.
+		 */
+		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB);
+
+		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
+
+		/*
+		 * During XLOG replay, latest_page_number isn't necessarily set up
+		 * yet; insert a suitable value to bypass the sanity test in
+		 * SimpleLruTruncate.
+		 */
+		pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
+		MultiXactOffsetCtl->shared->latest_page_number = pageno;
+		PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
+
+		LWLockRelease(MultiXactTruncationLock);
+
+		/* only looked at in the startup process, no lock necessary */
+		MultiXactState->sawTruncationInCkptCycle = true;
+	}
 	else
 		elog(PANIC, "multixact_redo: unknown op code %u", info);
 }
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 5fcea11..90c7cf5 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -134,6 +134,7 @@ static int	SlruSelectLRUPage(SlruCtl ctl, int pageno);
 
 static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
 						  int segpage, void *data);
+static void SlruInternalDeleteSegment(SlruCtl ctl, char *filename);
 
 /*
  * Initialization of shared memory
@@ -1075,7 +1076,7 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
  * Flush dirty pages to disk during checkpoint or database shutdown
  */
 void
-SimpleLruFlush(SlruCtl ctl, bool checkpoint)
+SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 {
 	SlruShared	shared = ctl->shared;
 	SlruFlushData fdata;
@@ -1096,11 +1097,11 @@ SimpleLruFlush(SlruCtl ctl, bool checkpoint)
 		SlruInternalWritePage(ctl, slotno, &fdata);
 
 		/*
-		 * When called during a checkpoint, we cannot assert that the slot is
-		 * clean now, since another process might have re-dirtied it already.
-		 * That's okay.
+		 * In some places (e.g. checkpoints), we cannot assert that the slot
+		 * is clean now, since another process might have re-dirtied it
+		 * already.  That's okay.
 		 */
-		Assert(checkpoint ||
+		Assert(allow_redirtied ||
 			   shared->page_status[slotno] == SLRU_PAGE_EMPTY ||
 			   (shared->page_status[slotno] == SLRU_PAGE_VALID &&
 				!shared->page_dirty[slotno]));
@@ -1210,8 +1211,14 @@ restart:;
 	(void) SlruScanDirectory(ctl, SlruScanDirCbDeleteCutoff, &cutoffPage);
 }
 
-void
-SlruDeleteSegment(SlruCtl ctl, char *filename)
+/*
+ * Delete an individual SLRU segment, identified by the filename.
+ *
+ * NB: This does not touch the SLRU buffers themselves, callers have to ensure
+ * they either can't yet contain anything, or have already been cleaned out.
+ */
+static void
+SlruInternalDeleteSegment(SlruCtl ctl, char *filename)
 {
 	char		path[MAXPGPATH];
 
@@ -1222,6 +1229,64 @@ SlruDeleteSegment(SlruCtl ctl, char *filename)
 }
 
 /*
+ * Delete an individual SLRU segment, identified by the segment number.
+ */
+void
+SlruDeleteSegment(SlruCtl ctl, int segno)
+{
+	SlruShared	shared = ctl->shared;
+	int			slotno;
+	char		path[MAXPGPATH];
+	bool		did_write;
+
+	/* Clean out any possibly existing references to the segment. */
+	LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
+restart:
+	did_write = false;
+	for (slotno = 0; slotno < shared->num_slots; slotno++)
+	{
+		int			pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+
+		if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
+			continue;
+
+		/* not the segment we're looking for */
+		if (pagesegno != segno)
+			continue;
+
+		/* If page is clean, just change state to EMPTY (expected case). */
+		if (shared->page_status[slotno] == SLRU_PAGE_VALID &&
+			!shared->page_dirty[slotno])
+		{
+			shared->page_status[slotno] = SLRU_PAGE_EMPTY;
+			continue;
+		}
+
+		/* Same logic as SimpleLruTruncate() */
+		if (shared->page_status[slotno] == SLRU_PAGE_VALID)
+			SlruInternalWritePage(ctl, slotno, NULL);
+		else
+			SimpleLruWaitIO(ctl, slotno);
+
+		did_write = true;
+	}
+
+	/*
+	 * Be extra careful and re-check. The IO functions release the control
+	 * lock, so new pages could have been read in.
+	 */
+	if (did_write)
+		goto restart;
+
+	snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
+	ereport(DEBUG2,
+			(errmsg("removing file \"%s\"", path)));
+	unlink(path);
+
+	LWLockRelease(shared->ControlLock);
+}
+
+/*
  * SlruScanDirectory callback
  *		This callback reports true if there's any segment prior to the one
  *		containing the page passed as "data".
@@ -1249,7 +1314,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 	int			cutoffPage = *(int *) data;
 
 	if (ctl->PagePrecedes(segpage, cutoffPage))
-		SlruDeleteSegment(ctl, filename);
+		SlruInternalDeleteSegment(ctl, filename);
 
 	return false;				/* keep going */
 }
@@ -1261,7 +1326,7 @@ SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename, int segpage, void *data)
 bool
 SlruScanDirCbDeleteAll(SlruCtl ctl, char *filename, int segpage, void *data)
 {
-	SlruDeleteSegment(ctl, filename);
+	SlruInternalDeleteSegment(ctl, filename);
 
 	return false;				/* keep going */
 }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a87f09e..1ac1c05 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6330,7 +6330,6 @@ StartupXLOG(void)
 	SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
 	SetCommitTsLimit(checkPoint.oldestCommitTs,
 					 checkPoint.newestCommitTs);
-	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 	XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
 	XLogCtl->ckptXid = checkPoint.nextXid;
 
@@ -6347,10 +6346,8 @@ StartupXLOG(void)
 	StartupReorderBuffer();
 
 	/*
-	 * Startup MultiXact.  We need to do this early for two reasons: one is
-	 * that we might try to access multixacts when we do tuple freezing, and
-	 * the other is we need its state initialized because we attempt
-	 * truncation during restartpoints.
+	 * Startup MultiXact. We need to do this early to be able to replay
+	 * truncations.
 	 */
 	StartupMultiXact();
 
@@ -8508,12 +8505,6 @@ CreateCheckPoint(int flags)
 	END_CRIT_SECTION();
 
 	/*
-	 * Now that the checkpoint is safely on disk, we can update the point to
-	 * which multixact can be truncated.
-	 */
-	MultiXactSetSafeTruncate(checkPoint.oldestMulti);
-
-	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
 	smgrpostckpt();
@@ -8552,11 +8543,6 @@ CreateCheckPoint(int flags)
 	if (!RecoveryInProgress())
 		TruncateSUBTRANS(GetOldestXmin(NULL, false));
 
-	/*
-	 * Truncate pg_multixact too.
-	 */
-	TruncateMultiXact();
-
 	/* Real work is done, but log and update stats before releasing lock. */
 	LogCheckpointEnd(false);
 
@@ -8887,21 +8873,6 @@ CreateRestartPoint(int flags)
 	}
 
 	/*
-	 * Due to a historical accident multixact truncations are not WAL-logged,
-	 * but just performed everytime the mxact horizon is increased. So, unless
-	 * we explicitly execute truncations on a standby it will never clean out
-	 * /pg_multixact which obviously is bad, both because it uses space and
-	 * because we can wrap around into pre-existing data...
-	 *
-	 * We can only do the truncation here, after the UpdateControlFile()
-	 * above, because we've now safely established a restart point.  That
-	 * guarantees we will not need to access those multis.
-	 *
-	 * It's probably worth improving this.
-	 */
-	TruncateMultiXact();
-
-	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
 	 * the oldest XMIN of any running transaction.  No future transaction will
 	 * attempt to reference any pg_subtrans entry older than that (see Asserts
@@ -9261,9 +9232,14 @@ xlog_redo(XLogReaderState *record)
 		LWLockRelease(OidGenLock);
 		MultiXactSetNextMXact(checkPoint.nextMulti,
 							  checkPoint.nextMultiOffset);
+
+		/*
+		 * NB: This may perform multixact truncation when replaying WAL
+		 * generated by an older primary.
+		 */
+		MultiXactAdvanceOldest(checkPoint.oldestMulti,
+							   checkPoint.oldestMultiDB);
 		SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
-		SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
-		MultiXactSetSafeTruncate(checkPoint.oldestMulti);
 
 		/*
 		 * If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9353,14 +9329,17 @@ xlog_redo(XLogReaderState *record)
 		LWLockRelease(OidGenLock);
 		MultiXactAdvanceNextMXact(checkPoint.nextMulti,
 								  checkPoint.nextMultiOffset);
+
+		/*
+		 * NB: This may perform multixact truncation when replaying WAL
+		 * generated by an older primary.
+		 */
+		MultiXactAdvanceOldest(checkPoint.oldestMulti,
+							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
 								  checkPoint.oldestXid))
 			SetTransactionIdLimit(checkPoint.oldestXid,
 								  checkPoint.oldestXidDB);
-		MultiXactAdvanceOldest(checkPoint.oldestMulti,
-							   checkPoint.oldestMultiDB);
-		MultiXactSetSafeTruncate(checkPoint.oldestMulti);
-
 		/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
 		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
 		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 85b0483..698bb35 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1137,11 +1137,11 @@ vac_truncate_clog(TransactionId frozenXID,
 		return;
 
 	/*
-	 * Truncate CLOG and CommitTs to the oldest computed value. Note we don't
-	 * truncate multixacts; that will be done by the next checkpoint.
+	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
 	 */
 	TruncateCLOG(frozenXID);
 	TruncateCommitTs(frozenXID, true);
+	TruncateMultiXact(minMulti, minmulti_datoid, false);
 
 	/*
 	 * Update the wrap limit for GetNewTransactionId and creation of new
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 96bbfe8..c557cb6 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -45,3 +45,4 @@ ReplicationSlotControlLock			37
 CommitTsControlLock					38
 CommitTsLock						39
 ReplicationOriginLock				40
+MultiXactTruncationLock				41
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6213f8a..47ef38d 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -71,6 +71,7 @@ typedef struct MultiXactMember
 #define XLOG_MULTIXACT_ZERO_OFF_PAGE	0x00
 #define XLOG_MULTIXACT_ZERO_MEM_PAGE	0x10
 #define XLOG_MULTIXACT_CREATE_ID		0x20
+#define XLOG_MULTIXACT_TRUNCATE_ID		0x30
 
 typedef struct xl_multixact_create
 {
@@ -82,6 +83,21 @@ typedef struct xl_multixact_create
 
 #define SizeOfMultiXactCreate (offsetof(xl_multixact_create, members))
 
+typedef struct xl_multixact_truncate
+{
+	Oid			oldestMultiDB;
+
+	/* to-be-truncated range of multixact offsets */
+	MultiXactId startTruncOff;	/* just for completeness' sake */
+	MultiXactId endTruncOff;
+
+	/* to-be-truncated range of multixact members */
+	MultiXactOffset startTruncMemb;
+	MultiXactOffset endTruncMemb;
+} xl_multixact_truncate;
+
+#define SizeOfMultiXactTruncate (sizeof(xl_multixact_truncate))
+
 
 extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status1, TransactionId xid2,
@@ -120,13 +136,12 @@ extern void MultiXactGetCheckptMulti(bool is_shutdown,
 						 Oid *oldestMultiDB);
 extern void CheckPointMultiXact(void);
 extern MultiXactId GetOldestMultiXactId(void);
-extern void TruncateMultiXact(void);
+extern void TruncateMultiXact(MultiXactId oldestMulti, Oid oldestMultiDB, bool in_recovery);
 extern void MultiXactSetNextMXact(MultiXactId nextMulti,
 					  MultiXactOffset nextMultiOffset);
 extern void MultiXactAdvanceNextMXact(MultiXactId minMulti,
 						  MultiXactOffset minMultiOffset);
 extern void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB);
-extern void MultiXactSetSafeTruncate(MultiXactId safeTruncateMulti);
 extern int	MultiXactMemberFreezeThreshold(void);
 
 extern void multixact_twophase_recover(TransactionId xid, uint16 info,
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 9c7f019..f60e75b 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -143,14 +143,14 @@ extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
 extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
 						   TransactionId xid);
 extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
-extern void SimpleLruFlush(SlruCtl ctl, bool checkpoint);
+extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
 extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
 extern bool SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno);
 
 typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
 											  void *data);
 extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
-extern void SlruDeleteSegment(SlruCtl ctl, char *filename);
+extern void SlruDeleteSegment(SlruCtl ctl, int segno);
 
 /* SlruScanDirectory public callbacks */
 extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a037f81..0e149ea 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2750,6 +2750,7 @@ xl_invalid_page
 xl_invalid_page_key
 xl_multi_insert_tuple
 xl_multixact_create
+xl_multixact_truncate
 xl_parameter_change
 xl_relmap_update
 xl_replorigin_drop
-- 
2.5.0.400.gff86faf

#35Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: Rework the way multixact truncations work

Hi,

I pushed this to 9.5 and master, committing the xlog page magic bump
separately. To avoid using a magic value from master in 9.5 I bumped the
numbers by two in both branches.

Should this get a release note entry given that we're not (at least
immediately) backpatching this?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#35)
Re: Rework the way multixact truncations work

Andres Freund <andres@anarazel.de> writes:

Should this get a release note entry given that we're not (at least
immediately) backpatching this?

I'll probably put something in when I update the release notes for beta1
(next week sometime); no real need to deal with it individually.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#31)
Re: Rework the way multixact truncations work

On 9/23/15 1:48 PM, Andres Freund wrote:

Honestly, I wonder whether this message

ereport(LOG,
(errmsg("performing legacy multixact truncation"),
errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."),
errhint("Upgrade the primary, it is susceptible to data corruption.")));
shouldn't rather be a PANIC. (The main reason not to, I think, is that
once you see this, there is no way to put the standby in a working state
without recloning).

Huh? The behaviour in that case is still better than what we have in
9.3+ today (not delayed till the restartpoint). Don't see why that
should be a panic. That'd imo make it pretty much impossible to upgrade
a pair of primary/master where you normally upgrade the standby first?

IMHO doing just a log of something this serious; it should at least be a
WARNING.

I think the concern about upgrading a replica before the master is
valid; is there some way we could over-ride a PANIC when that's exactly
what someone is trying to do? Check for a special file maybe?

+ bool sawTruncationInCkptCycle;
What happens if someone downgrades the master, back to a version that no
longer logs truncation? (I don't think assuming that the replica will
need to restart if that happens is a safe bet...)

-	if (MultiXactIdPrecedes(oldestMXact, earliest))
+	/* If there's nothing to remove, we can bail out early. */
+	if (MultiXactIdPrecedes(oldestMulti, earliest))
  	{
-		DetermineSafeOldestOffset(oldestMXact);
+		LWLockRelease(MultiXactTruncationLock);
If/when this is backpatched, would it be safer to just leave this alone?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Andres Freund
andres@anarazel.de
In reply to: Jim Nasby (#37)
Re: Rework the way multixact truncations work

On 2015-09-27 14:21:08 -0500, Jim Nasby wrote:

IMHO doing just a log of something this serious; it should at least be a
WARNING.

In postgres LOG, somewhat confusingly, is more severe than WARNING.

I think the concern about upgrading a replica before the master is valid; is
there some way we could over-ride a PANIC when that's exactly what someone
is trying to do? Check for a special file maybe?

I don't understand this concern - that's just the situation we have in
all released branches today.

+ bool sawTruncationInCkptCycle;
What happens if someone downgrades the master, back to a version that no
longer logs truncation? (I don't think assuming that the replica will need
to restart if that happens is a safe bet...)

It'll just to do legacy truncation again - without a restart on the
standby required.

-	if (MultiXactIdPrecedes(oldestMXact, earliest))
+	/* If there's nothing to remove, we can bail out early. */
+	if (MultiXactIdPrecedes(oldestMulti, earliest))
{
-		DetermineSafeOldestOffset(oldestMXact);
+		LWLockRelease(MultiXactTruncationLock);
If/when this is backpatched, would it be safer to just leave this alone?

What do you mean? This can't just isolated be left alone?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#38)
Re: Rework the way multixact truncations work

On 9/27/15 2:25 PM, Andres Freund wrote:

On 2015-09-27 14:21:08 -0500, Jim Nasby wrote:

IMHO doing just a log of something this serious; it should at least be a
WARNING.

In postgres LOG, somewhat confusingly, is more severe than WARNING.

Ahh, right. Which in this case stinks, because WARNING is a lot more
attention grabbing than LOG. :/

I think the concern about upgrading a replica before the master is valid; is
there some way we could over-ride a PANIC when that's exactly what someone
is trying to do? Check for a special file maybe?

I don't understand this concern - that's just the situation we have in
all released branches today.

There was discussion about making this a PANIC instead of a LOG, which I
think is a good idea... but then there'd need to be some way to not
PANIC if you were doing an upgrade.

+ bool sawTruncationInCkptCycle;
What happens if someone downgrades the master, back to a version that no
longer logs truncation? (I don't think assuming that the replica will need
to restart if that happens is a safe bet...)

It'll just to do legacy truncation again - without a restart on the
standby required.

Oh, I thought once that was set it would stay set. NM.

-	if (MultiXactIdPrecedes(oldestMXact, earliest))
+	/* If there's nothing to remove, we can bail out early. */
+	if (MultiXactIdPrecedes(oldestMulti, earliest))
{
-		DetermineSafeOldestOffset(oldestMXact);
+		LWLockRelease(MultiXactTruncationLock);
If/when this is backpatched, would it be safer to just leave this alone?

What do you mean? This can't just isolated be left alone?

I thought removing DetermineSafeOldestOffset was just an optimization,
but I guess I was confused.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#39)
Re: Rework the way multixact truncations work

On Mon, Sep 28, 2015 at 5:47 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

There was discussion about making this a PANIC instead of a LOG, which I
think is a good idea... but then there'd need to be some way to not PANIC if
you were doing an upgrade.

I think you're worrying about a non-problem. This code has not been
back-patched prior to 9.5, and the legacy truncation code has been
removed in 9.5+. So it's a complete non-issue right at the moment.
If at some point we back-patch this further, then it potentially
becomes a live issue, but I would like to respectfully inquire what
exactly you think making it a PANIC would accomplish? There are a lot
of scary things about this patch, but the logic for deciding whether
to perform a legacy truncation is solid as far as I know.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#40)
Re: Rework the way multixact truncations work

On 9/28/15 8:49 PM, Robert Haas wrote:

If at some point we back-patch this further, then it potentially
becomes a live issue, but I would like to respectfully inquire what
exactly you think making it a PANIC would accomplish? There are a lot
of scary things about this patch, but the logic for deciding whether
to perform a legacy truncation is solid as far as I know.

Maybe I'm confused, but I thought the whole purpose of this was to get
rid of the risk associated with that calculation in favor of explicit
truncation boundaries in the WAL log.

Even if that's not the case, ISTM that being big and in your face about
a potential data corruption bug is a good thing, as long as the DBA has
a way to "hit the snooze button".

Either way, I'm not going to make a fuss over it.

Just to make sure we're on the same page; Alvaro's original comment was:

Honestly, I wonder whether this message
ereport(LOG,
(errmsg("performing legacy multixact truncation"),
errdetail("Legacy truncations are sometimes performed when replaying WAL from an older primary."),
errhint("Upgrade the primary, it is susceptible to data corruption.")));
shouldn't rather be a PANIC. (The main reason not to, I think, is that
once you see this, there is no way to put the standby in a working state
without recloning).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Andres Freund
andres@anarazel.de
In reply to: Jim Nasby (#41)
Re: Rework the way multixact truncations work

On 2015-09-28 21:48:00 -0500, Jim Nasby wrote:

On 9/28/15 8:49 PM, Robert Haas wrote:

If at some point we back-patch this further, then it potentially
becomes a live issue, but I would like to respectfully inquire what
exactly you think making it a PANIC would accomplish? There are a lot
of scary things about this patch, but the logic for deciding whether
to perform a legacy truncation is solid as far as I know.

Maybe I'm confused, but I thought the whole purpose of this was to get rid
of the risk associated with that calculation in favor of explicit truncation
boundaries in the WAL log.

Even if that's not the case, ISTM that being big and in your face about a
potential data corruption bug is a good thing, as long as the DBA has a way
to "hit the snooze button".

So we'd end up with a guc that everyone has to set while they
upgrade. That seems like a pointless hassle.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#41)
Re: Rework the way multixact truncations work

On Mon, Sep 28, 2015 at 10:48 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Maybe I'm confused, but I thought the whole purpose of this was to get rid
of the risk associated with that calculation in favor of explicit truncation
boundaries in the WAL log.

Yes. But if the master hasn't been updated yet, then we still need to
do something based on a calculation.

Even if that's not the case, ISTM that being big and in your face about a
potential data corruption bug is a good thing, as long as the DBA has a way
to "hit the snooze button".

Panicking the standby because the master hasn't been updated does not
seem like a good thing to me in any way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Joel Jacobson
joel@trustly.com
In reply to: Andres Freund (#19)
Re: Rework the way multixact truncations work

On Tue, Sep 22, 2015 at 3:20 PM, Andres Freund <andres@anarazel.de> wrote:

What I've tested is the following:
* continous burning of multis, both triggered via members and offsets
* a standby keeping up when the primary is old
* a standby keeping up when the primary is new
* basebackups made while a new primary is under load
* verified that we properly PANIC when a truncation record is replayed
in an old standby.

Are these test scripts available somewhere?
I understand they might be undocumented and perhaps tricky to set it all up,
but I would be very interested in them anyway,
think you could push them somewhere?

Thanks a lot for working on this!

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#43)
Re: Rework the way multixact truncations work

Robert Haas wrote:

On Mon, Sep 28, 2015 at 10:48 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Maybe I'm confused, but I thought the whole purpose of this was to get rid
of the risk associated with that calculation in favor of explicit truncation
boundaries in the WAL log.

Yes. But if the master hasn't been updated yet, then we still need to
do something based on a calculation.

Right.

Even if that's not the case, ISTM that being big and in your face about a
potential data corruption bug is a good thing, as long as the DBA has a way
to "hit the snooze button".

Panicking the standby because the master hasn't been updated does not
seem like a good thing to me in any way.

If we had a way to force the master to upgrade, I think it would be good
because we have a mechanism to get rid of the legacy truncation code;
but as I said several messages ago this doesn't actually work which is
why I dropped the idea of panicking.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#22)
Re: Rework the way multixact truncations work

I'm several days into a review of this change (commits 4f627f8 and aa29c1c).
There's one part of the design I want to understand before commenting on
specific code. What did you anticipate to be the consequences of failing to
remove SLRU segment files that MultiXactState->oldestMultiXactId implies we
should have removed? I ask because, on the one hand, I see code making
substantial efforts to ensure that it truncates exactly as planned:

/*
* Do truncation, and the WAL logging of the truncation, in a critical
* section. That way offsets/members cannot get out of sync anymore, i.e.
* once consistent the newOldestMulti will always exist in members, even
* if we crashed in the wrong moment.
*/
START_CRIT_SECTION();

/*
* Prevent checkpoints from being scheduled concurrently. This is critical
* because otherwise a truncation record might not be replayed after a
* crash/basebackup, even though the state of the data directory would
* require it.
*/
Assert(!MyPgXact->delayChkpt);
MyPgXact->delayChkpt = true;
...
/*
* Update in-memory limits before performing the truncation, while inside
* the critical section: Have to do it before truncation, to prevent
* concurrent lookups of those values. Has to be inside the critical
* section as otherwise a future call to this function would error out,
* while looking up the oldest member in offsets, if our caller crashes
* before updating the limits.
*/

On the other hand, TruncateMultiXact() callees ignore unlink() return values:

On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote:

On 2015-09-22 13:38:58 -0400, Robert Haas wrote:

- If SlruDeleteSegment fails in unlink(), shouldn't we at the very
least log a message? If that file is still there when we loop back
around, it's going to cause a failure, I think.

The existing unlink() call doesn't, that's the only reason I didn't add
a message there. I'm fine with adding a (LOG or WARNING?) message.

Unlinking old pg_clog files is strictly an optimization. If you were to
comment out every unlink() call in slru.c, the only ill effect on CLOG is the
waste of disk space. Is the same true of MultiXact?

If there's anyplace where failure to unlink would cause a malfunction, I think
it would be around the use of SlruScanDirCbFindEarliest(). That function's
result becomes undefined if the range of pg_multixact/offsets segment files
present on disk spans more than about INT_MAX/2 MultiXactId.

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#46)
Re: Rework the way multixact truncations work

Hi,

On 2015-10-24 22:07:00 -0400, Noah Misch wrote:

I'm several days into a review of this change (commits 4f627f8 and
aa29c1c).

Cool!

There's one part of the design I want to understand before commenting on
specific code. What did you anticipate to be the consequences of failing to
remove SLRU segment files that MultiXactState->oldestMultiXactId implies we
should have removed? I ask because, on the one hand, I see code making
substantial efforts to ensure that it truncates exactly as planned:

[portion of TruncateMultiXact]

The reason we can't have checkpoints there, is that checkpoints records
multixact values in the checkpoint record. If we crash-restart before
the truncation has finished we can end up in the situation that
->oldestMultiXactId doesn't exist. Which will trigger a round of
emergency vacuum at the next startup, not something that should happen
due to a concurrency problem.

We could instead update the in-memory values first, but that could lead
to other problems.

So the critical section/delaying of checkpoints is more about having the
on-disk agreeing with the status data in the checkpoint/control file.

On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote:

On 2015-09-22 13:38:58 -0400, Robert Haas wrote:

- If SlruDeleteSegment fails in unlink(), shouldn't we at the very
least log a message? If that file is still there when we loop back
around, it's going to cause a failure, I think.

The existing unlink() call doesn't, that's the only reason I didn't add
a message there. I'm fine with adding a (LOG or WARNING?) message.

Note that I didn't add the warning after all, as it'd be too noisy
during repeated replay, as all the files would already be gone. We could
only emit it when the error is not ENOFILE, if people prefer that.

Unlinking old pg_clog files is strictly an optimization. If you were to
comment out every unlink() call in slru.c, the only ill effect on CLOG is the
waste of disk space. Is the same true of MultiXact?

Well, multixacts are a lot larger than the other SLRUs, I think that
makes some sort of difference.

Thanks,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#10)
Re: Rework the way multixact truncations work

On 10/27/2015 07:44 AM, Andres Freund wrote:

Unlinking old pg_clog files is strictly an optimization. If you were to

comment out every unlink() call in slru.c, the only ill effect on CLOG is the
waste of disk space. Is the same true of MultiXact?

Well, multixacts are a lot larger than the other SLRUs, I think that
makes some sort of difference.

And by "a lot larger" we're talking like 50X to 100X. I regularly see
pg_multixact directories larger than 1GB.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#47)
Re: Rework the way multixact truncations work

On Tue, Oct 27, 2015 at 03:44:10PM +0100, Andres Freund wrote:

On 2015-10-24 22:07:00 -0400, Noah Misch wrote:

On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote:

On 2015-09-22 13:38:58 -0400, Robert Haas wrote:

- If SlruDeleteSegment fails in unlink(), shouldn't we at the very
least log a message? If that file is still there when we loop back
around, it's going to cause a failure, I think.

The existing unlink() call doesn't, that's the only reason I didn't add
a message there. I'm fine with adding a (LOG or WARNING?) message.

Note that I didn't add the warning after all, as it'd be too noisy
during repeated replay, as all the files would already be gone. We could
only emit it when the error is not ENOFILE, if people prefer that.

Unlinking old pg_clog files is strictly an optimization. If you were to
comment out every unlink() call in slru.c, the only ill effect on CLOG is the
waste of disk space. Is the same true of MultiXact?

Well, multixacts are a lot larger than the other SLRUs, I think that
makes some sort of difference.

That helps; thanks. Your design seems good. I've located only insipid
defects. I propose to save some time by writing a patch series eliminating
them, which you could hopefully review. Does that sound good?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#49)
Re: Rework the way multixact truncations work

Hi,

On October 29, 2015 7:59:03 AM GMT+01:00, Noah Misch <noah@leadboat.com> wrote:

On Tue, Oct 27, 2015 at 03:44:10PM +0100, Andres Freund wrote:

On 2015-10-24 22:07:00 -0400, Noah Misch wrote:

On Tue, Sep 22, 2015 at 07:57:27PM +0200, Andres Freund wrote:

On 2015-09-22 13:38:58 -0400, Robert Haas wrote:

- If SlruDeleteSegment fails in unlink(), shouldn't we at the

very

least log a message? If that file is still there when we loop

back

around, it's going to cause a failure, I think.

The existing unlink() call doesn't, that's the only reason I

didn't add

a message there. I'm fine with adding a (LOG or WARNING?)

message.

Note that I didn't add the warning after all, as it'd be too noisy
during repeated replay, as all the files would already be gone. We

could

only emit it when the error is not ENOFILE, if people prefer that.

Unlinking old pg_clog files is strictly an optimization. If you

were to

comment out every unlink() call in slru.c, the only ill effect on

CLOG is the

waste of disk space. Is the same true of MultiXact?

Well, multixacts are a lot larger than the other SLRUs, I think that
makes some sort of difference.

That helps; thanks. Your design seems good. I've located only insipid
defects.

Great!

I propose to save some time by writing a patch series
eliminating
them, which you could hopefully review. Does that sound good?

Yes, it does.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#50)
1 attachment(s)
Re: Rework the way multixact truncations work

On Thu, Oct 29, 2015 at 08:46:52AM +0100, Andres Freund wrote:

On October 29, 2015 7:59:03 AM GMT+01:00, Noah Misch <noah@leadboat.com> wrote:

That helps; thanks. Your design seems good. I've located only insipid
defects.

Great!

I propose to save some time by writing a patch series
eliminating
them, which you could hopefully review. Does that sound good?

Yes, it does.

I have pushed a stack of branches to https://github.com/nmisch/postgresql.git:

mxt0-revert - reverts commits 4f627f8 and aa29c1c
mxt1-disk-independent - see below
mxt2-cosmetic - update already-wrong comments and formatting
mxt3-main - replaces commit 4f627f8
mxt4-rm-legacy - replaces commit aa29c1c

The plan is to squash each branch into one PostgreSQL commit. In addition to
examining overall "git diff mxt2-cosmetic mxt3-main", I recommend reviewing
itemized changes and commit log entries in "git log -p --reverse --no-merges
mxt2-cosmetic..mxt3-main". In particular, when a change involves something
you discussed upthread or was otherwise not obvious, I put a statement of
rationale in the commit log.

+	 * Make sure to only attempt truncation if there's values to truncate
+	 * away. In normal processing values shouldn't go backwards, but there's
+	 * some corner cases (due to bugs) where that's possible.

Which bugs are those? I would like to include more detail if available.

If anything here requires careful study, it's the small mxt1-disk-independent
change, which I have also attached in patch form. That patch removes the
SlruScanDirCbFindEarliest() test from TruncateMultiXact(), which in turn makes
multixact control decisions independent of whether TruncateMultiXact() is
successful at unlinking segments. Today, one undeletable segment file can
cause TruncateMultiXact() to give up on truncation completely for a span of
hundreds of millions of MultiXactId. Patched multixact.c will, like CLOG,
make its decisions strictly based on the content of files it expects to exist.
It will no longer depend on the absence of files it hopes will not exist.

To aid in explaining the change's effects, I will define some terms. A
"logical wrap" occurs when no range of 2^31 integers covers the set of
MultiXactId stored in xmax fields. A "file-level wrap" occurs when there
exists a pair of pg_multixact/offsets segment files such that:

| segno_a * SLRU_PAGES_PER_SEGMENT * MULTIXACT_OFFSETS_PER_PAGE -
segno_b * SLRU_PAGES_PER_SEGMENT * MULTIXACT_OFFSETS_PER_PAGE | > 2^31

A logical wrap implies either a file-level wrap or missing visibility
metadata, but a file-level wrap does not imply other consequences. The
SlruScanDirCbFindEarliest() test is usually redundant with
find_multixact_start(), because MultiXactIdPrecedes(oldestMXact, earliest)
almost implies that find_multixact_start() will fail. The exception arises
when pg_multixact/offsets files compose a file-level wrap, which can happen
when TruncateMultiXact() fails to unlink segments as planned. When it does
happen, the result of SlruScanDirCbFindEarliest(), and therefore the computed
"earliest" value, is undefined. (This outcome is connected to our requirement
to use only half the pg_clog or pg_multixact/offsets address space at any one
time. The PagePrecedes callbacks for these SLRUs cease to be transitive if
more than half their address space is in use.)

The SlruScanDirCbFindEarliest() test can be helpful when a file-level wrap
coexists with incorrect oldestMultiXactId (extant xmax values define "correct
oldestMultiXactId"). If we're lucky with readdir() order, the test will block
truncation so we don't delete still-needed segments. I am content to lose
that, because (a) the code isn't reliable for (or even directed toward) that
purpose and (b) sites running on today's latest point releases no longer have
incorrect oldestMultiXactId.

Attachments:

mxact-trunc-disk-independent-v1.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index cb12fc3..cb4a0cd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2945,29 +2945,6 @@ SlruScanDirCbRemoveMembers(SlruCtl ctl, char *filename, int segpage,
 	return false;				/* keep going */
 }
 
-typedef struct mxtruncinfo
-{
-	int			earliestExistingPage;
-} mxtruncinfo;
-
-/*
- * SlruScanDirectory callback
- *		This callback determines the earliest existing page number.
- */
-static bool
-SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
-{
-	mxtruncinfo *trunc = (mxtruncinfo *) data;
-
-	if (trunc->earliestExistingPage == -1 ||
-		ctl->PagePrecedes(segpage, trunc->earliestExistingPage))
-	{
-		trunc->earliestExistingPage = segpage;
-	}
-
-	return false;				/* keep going */
-}
-
 /*
  * Remove all MultiXactOffset and MultiXactMember segments before the oldest
  * ones still of interest.
@@ -2986,8 +2963,6 @@ TruncateMultiXact(void)
 	MultiXactOffset oldestOffset;
 	MultiXactId		nextMXact;
 	MultiXactOffset	nextOffset;
-	mxtruncinfo trunc;
-	MultiXactId earliest;
 	MembersLiveRange range;
 
 	Assert(AmCheckpointerProcess() || AmStartupProcess() ||
@@ -3001,49 +2976,20 @@ TruncateMultiXact(void)
 	Assert(MultiXactIdIsValid(oldestMXact));
 
 	/*
-	 * Note we can't just plow ahead with the truncation; it's possible that
-	 * there are no segments to truncate, which is a problem because we are
-	 * going to attempt to read the offsets page to determine where to
-	 * truncate the members SLRU.  So we first scan the directory to determine
-	 * the earliest offsets page number that we can read without error.
-	 */
-	trunc.earliestExistingPage = -1;
-	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
-	earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
-	if (earliest < FirstMultiXactId)
-		earliest = FirstMultiXactId;
-
-	/*
-	 * If there's nothing to remove, we can bail out early.
-	 *
-	 * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X,
-	 * oldestMXact might point to a multixact that does not exist.
-	 * Autovacuum will eventually advance it to a value that does exist,
-	 * and we want to set a proper offsetStopLimit when that happens,
-	 * so call DetermineSafeOldestOffset here even if we're not actually
-	 * truncating.
-	 */
-	if (MultiXactIdPrecedes(oldestMXact, earliest))
-	{
-		DetermineSafeOldestOffset(oldestMXact);
-		return;
-	}
-
-	/*
 	 * First, compute the safe truncation point for MultiXactMember. This is
 	 * the starting offset of the oldest multixact.
 	 *
-	 * Hopefully, find_multixact_start will always work here, because we've
-	 * already checked that it doesn't precede the earliest MultiXact on
-	 * disk.  But if it fails, don't truncate anything, and log a message.
+	 * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X,
+	 * oldestMXact might point to a multixact that does not exist.  Call
+	 * DetermineSafeOldestOffset() to emit the message about disabled member
+	 * wraparound protection.  Autovacuum will eventually advance oldestMXact
+	 * to a value that does exist.
 	 */
 	if (oldestMXact == nextMXact)
 		oldestOffset = nextOffset;		/* there are NO MultiXacts */
 	else if (!find_multixact_start(oldestMXact, &oldestOffset))
 	{
-		ereport(LOG,
-				(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
-					oldestMXact, earliest)));
+		DetermineSafeOldestOffset(oldestMXact);
 		return;
 	}
 
#52Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#51)
Re: Rework the way multixact truncations work

On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com> wrote:

I have pushed a stack of branches to
https://github.com/nmisch/postgresql.git:

mxt0-revert - reverts commits 4f627f8 and aa29c1c
mxt1-disk-independent - see below
mxt2-cosmetic - update already-wrong comments and formatting
mxt3-main - replaces commit 4f627f8
mxt4-rm-legacy - replaces commit aa29c1c

The plan is to squash each branch into one PostgreSQL commit. In
addition to
examining overall "git diff mxt2-cosmetic mxt3-main", I recommend
reviewing
itemized changes and commit log entries in "git log -p --reverse
--no-merges
mxt2-cosmetic..mxt3-main". In particular, when a change involves
something
you discussed upthread or was otherwise not obvious, I put a statement
of
rationale in the commit log.

I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo them prettyfied?

Andres
Hi
--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#52)
Re: Rework the way multixact truncations work

On Sun, Nov 08, 2015 at 11:11:42AM -0800, Andres Freund wrote:

On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com> wrote:

I have pushed a stack of branches to
https://github.com/nmisch/postgresql.git:

mxt0-revert - reverts commits 4f627f8 and aa29c1c
mxt1-disk-independent - see below
mxt2-cosmetic - update already-wrong comments and formatting
mxt3-main - replaces commit 4f627f8
mxt4-rm-legacy - replaces commit aa29c1c

The plan is to squash each branch into one PostgreSQL commit. In
addition to
examining overall "git diff mxt2-cosmetic mxt3-main", I recommend
reviewing
itemized changes and commit log entries in "git log -p --reverse
--no-merges
mxt2-cosmetic..mxt3-main". In particular, when a change involves
something
you discussed upthread or was otherwise not obvious, I put a statement
of
rationale in the commit log.

I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo them prettyfied?

Yes, essentially. Given the volume of updates, this seemed neater than
framing those updates as in-tree incremental development.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#53)
Re: Rework the way multixact truncations work

On November 8, 2015 11:52:05 AM PST, Noah Misch <noah@leadboat.com> wrote:

On Sun, Nov 08, 2015 at 11:11:42AM -0800, Andres Freund wrote:

On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com>

wrote:

I have pushed a stack of branches to
https://github.com/nmisch/postgresql.git:

mxt0-revert - reverts commits 4f627f8 and aa29c1c
mxt1-disk-independent - see below
mxt2-cosmetic - update already-wrong comments and formatting
mxt3-main - replaces commit 4f627f8
mxt4-rm-legacy - replaces commit aa29c1c

The plan is to squash each branch into one PostgreSQL commit. In
addition to
examining overall "git diff mxt2-cosmetic mxt3-main", I recommend
reviewing
itemized changes and commit log entries in "git log -p --reverse
--no-merges
mxt2-cosmetic..mxt3-main". In particular, when a change involves
something
you discussed upthread or was otherwise not obvious, I put a

statement

of
rationale in the commit log.

I'm not following along right now - in order to make cleanups the

plan is to revert a couple commits and then redo them prettyfied?

Yes, essentially. Given the volume of updates, this seemed neater than
framing those updates as in-tree incremental development.

I don't like that plan. I don't have a problem doing that in some development branch somewhere, but I fail to see any benefit doing that in 9.5/master. It'll just make the history more convoluted for no benefit.

I'll obviously still review the changes.

Even for review it's nor particularly convenient, because now the entirety of the large changes essentially needs to be reviewed anew, given they're not the same.

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#54)
Re: Rework the way multixact truncations work

On Sun, Nov 08, 2015 at 11:59:33AM -0800, Andres Freund wrote:

On November 8, 2015 11:52:05 AM PST, Noah Misch <noah@leadboat.com> wrote:

On Sun, Nov 08, 2015 at 11:11:42AM -0800, Andres Freund wrote:

On November 8, 2015 12:54:07 AM PST, Noah Misch <noah@leadboat.com> wrote:

I have pushed a stack of branches to
https://github.com/nmisch/postgresql.git:

mxt0-revert - reverts commits 4f627f8 and aa29c1c
mxt1-disk-independent - see below
mxt2-cosmetic - update already-wrong comments and formatting
mxt3-main - replaces commit 4f627f8
mxt4-rm-legacy - replaces commit aa29c1c

The plan is to squash each branch into one PostgreSQL commit. In
addition to
examining overall "git diff mxt2-cosmetic mxt3-main", I recommend
reviewing
itemized changes and commit log entries in "git log -p --reverse
--no-merges
mxt2-cosmetic..mxt3-main". In particular, when a change involves
something
you discussed upthread or was otherwise not obvious, I put a

statement

of
rationale in the commit log.

I'm not following along right now - in order to make cleanups the

plan is to revert a couple commits and then redo them prettyfied?

Yes, essentially. Given the volume of updates, this seemed neater than
framing those updates as in-tree incremental development.

I don't like that plan. I don't have a problem doing that in some development branch somewhere, but I fail to see any benefit doing that in 9.5/master. It'll just make the history more convoluted for no benefit.

I'll obviously still review the changes.

Cleanliness of history is precisely why I did it this way. If I had framed
the changes as in-tree incremental development, no one "git diff" command
would show the truncation rework or a coherent subset. To review the whole,
students of this code might resort to a cherry-pick of the repair commits onto
aa29c1c. That, too, proves dissatisfying; the history would nowhere carry a
finished version of legacy truncation support. A hacker opting to back-patch
in the future, as commit 4f627f8 contemplated, would need to dig through this
thread for the bits added in mxt3-main and removed in mxt4-rm-legacy.

The benefits may become clearer as you continue to review the branches.

Even for review it's nor particularly convenient, because now the entirety of the large changes essentially needs to be reviewed anew, given they're not the same.

Agreed; I optimized for future readers, and I don't doubt this is less
convenient for you and for others already familiar with commits 4f627f8 and
aa29c1c. I published branches, not squashed patches, mostly because I think
the individual branch commits will facilitate your study of the changes. I
admit the cost to you remains high.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Peter Geoghegan
pg@heroku.com
In reply to: Noah Misch (#53)
Re: Rework the way multixact truncations work

On Sun, Nov 8, 2015 at 11:52 AM, Noah Misch <noah@leadboat.com> wrote:

I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo them prettyfied?

Yes, essentially. Given the volume of updates, this seemed neater than
framing those updates as in-tree incremental development.

I think that's an odd way of representing this work. I tend to
remember roughly when major things were committed even years later. An
outright revert should represent a total back out of the original
commit IMV. Otherwise, a git blame can be quite misleading. I can
imagine questioning my recollection, even when it is accurate, if only
because I don't expect this.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57Noah Misch
noah@leadboat.com
In reply to: Peter Geoghegan (#56)
Re: Rework the way multixact truncations work

On Mon, Nov 23, 2015 at 11:44:45AM -0800, Peter Geoghegan wrote:

On Sun, Nov 8, 2015 at 11:52 AM, Noah Misch <noah@leadboat.com> wrote:

I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo them prettyfied?

Yes, essentially. Given the volume of updates, this seemed neater than
framing those updates as in-tree incremental development.

I think that's an odd way of representing this work. I tend to
remember roughly when major things were committed even years later. An
outright revert should represent a total back out of the original
commit IMV. Otherwise, a git blame can be quite misleading.

I think you're saying that "clearer git blame" is a more-important reason than
"volume of updates" for preferring an outright revert over in-tree incremental
development. Fair preference. If that's a correct reading of your message,
then we do agree on the bottom line.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#57)
Re: Rework the way multixact truncations work

On Fri, Nov 27, 2015 at 5:16 PM, Noah Misch <noah@leadboat.com> wrote:

On Mon, Nov 23, 2015 at 11:44:45AM -0800, Peter Geoghegan wrote:

On Sun, Nov 8, 2015 at 11:52 AM, Noah Misch <noah@leadboat.com> wrote:

I'm not following along right now - in order to make cleanups the plan is to revert a couple commits and then redo them prettyfied?

Yes, essentially. Given the volume of updates, this seemed neater than
framing those updates as in-tree incremental development.

I think that's an odd way of representing this work. I tend to
remember roughly when major things were committed even years later. An
outright revert should represent a total back out of the original
commit IMV. Otherwise, a git blame can be quite misleading.

I think you're saying that "clearer git blame" is a more-important reason than
"volume of updates" for preferring an outright revert over in-tree incremental
development. Fair preference. If that's a correct reading of your message,
then we do agree on the bottom line.

Hmm. I read Peter's message as agreeing with Andres rather than with
you. And I have to say I agree with Andres as well. I think it's
weird to back a commit out only to put a bunch of very similar stuff
back in. Even figuring out what you've actually changed here seems
rather hard. I couldn't get github to give me a diff showing your
changes vs. master.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#58)
Re: Rework the way multixact truncations work

On Tue, Dec 1, 2015 at 2:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm. I read Peter's message as agreeing with Andres rather than with
you. And I have to say I agree with Andres as well. I think it's
weird to back a commit out only to put a bunch of very similar stuff
back in.

Your interpretation was correct. I think it's surprising to structure
things this way, especially since we haven't done things this way in
the past.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#58)
Re: Rework the way multixact truncations work

On Tue, Dec 01, 2015 at 05:07:15PM -0500, Robert Haas wrote:

I think it's weird to back a commit out only to put a bunch of very similar
stuff back in.

I agree with that. If the original patches and their replacements shared 95%
of diff lines in common, we wouldn't be having this conversation. These
replacements redo closer to 50% of the lines, so the patches are not very
similar by verbatim line comparison. Even so, let's stipulate that my
proposal is weird. I'd rather be weird than lose one of the benefits I
articulated upthread, let alone all of them.

My post-commit review of RLS woke me up to the problems of gradually finishing
work in-tree. By the time I started that review in 2015-06, the base RLS
feature already spanned twenty-two commits. (That count has since more than
doubled.) Reviewing each commit threatened to be wasteful, because I would
presumably find matters already fixed later. I tried to cherry-pick the
twenty-two commits onto a branch, hoping to review the overall diff as "git
diff master...squash-rls", but that yielded complex merge conflicts. Where
the conflicts were too much, I reviewed entire files instead. (Granted, no
matter how this thread ends, I do not expect an outcome that opaque.) Hackers
have been too reticent to revert and redo defect-laden commits. If doing that
is weird today, let it be normal.

Even figuring out what you've actually changed here seems
rather hard. I couldn't get github to give me a diff showing your
changes vs. master.

If you, an expert in the 2015-09-26 commits, want to study the key changes I
made, I recommend perusing these outputs:

git remote add nmisch_github https://github.com/nmisch/postgresql.git
git fetch nmisch_github
git diff nmisch_github/mxt0-revert nmisch_github/mxt1-disk-independent
git log -p --reverse --no-merges nmisch_github/mxt2-cosmetic..nmisch_github/mxt3-main

For the overall diff vs. master that you sought:

git remote add community git://git.postgresql.org/git/postgresql.git
git remote add nmisch_github https://github.com/nmisch/postgresql.git
git fetch --multiple community nmisch_github
git diff community/master...nmisch_github/mxt4-rm-legacy

If anyone not an author or reviewer of the 2015-09-26 commits wishes to review
the work, don't read the above diffs; the intermediate states are
uninteresting. Read these four:

git remote add nmisch_github https://github.com/nmisch/postgresql.git
git fetch nmisch_github
git diff nmisch_github/mxt0-revert nmisch_github/mxt1-disk-independent
git diff nmisch_github/mxt1-disk-independent nmisch_github/mxt2-cosmetic
git diff nmisch_github/mxt2-cosmetic nmisch_github/mxt3-main
git diff nmisch_github/mxt3-main nmisch_github/mxt4-rm-legacy

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#60)
Re: Rework the way multixact truncations work

On 2015-12-02 09:57:19 -0500, Noah Misch wrote:

On Tue, Dec 01, 2015 at 05:07:15PM -0500, Robert Haas wrote:

I think it's weird to back a commit out only to put a bunch of very similar
stuff back in.

I agree with that. If the original patches and their replacements shared 95%
of diff lines in common, we wouldn't be having this conversation. These
replacements redo closer to 50% of the lines, so the patches are not very
similar by verbatim line comparison.

Which is a huge problem, because it makes it very hard to see what your
changes actually do. And a significant portion of the changes relative
to master aren't particularly critical. Which is easy to see if if a
commit only changes comments, but harder if you see one commit reverting
things, and another redoing most of the same things.

Hackers have been too reticent to revert and redo defect-laden
commits. If doing that is weird today, let it be normal.

Why? Especially if reverting and redoing includes conflicts that mainly
increases the chance of accidental bugs.

git remote add community git://git.postgresql.org/git/postgresql.git
git remote add nmisch_github https://github.com/nmisch/postgresql.git
git fetch --multiple community nmisch_github
git diff community/master...nmisch_github/mxt4-rm-legacy

That's a nearly useless diff, because it includes a lot of other changes
(218 files changed, 2828 insertions(+), 8742 deletions(-)) made since
you published the changes. What kinda works is
git diff $(git merge-base community/master nmisch_github/mxt4-rm-legacy)..nmisch_github/mxt4-rm-legacy
which shows the diff to the version of master you start off from.

Review of the above diff:

@@ -2013,7 +2017,7 @@ TrimMultiXact(void)
{
MultiXactId nextMXact;
MultiXactOffset offset;
-	MultiXactId oldestMXact;
+	MultiXactId	oldestMXact;

That's a bit weird, given that nextMXact isn't indented...

@@ -2190,7 +2194,8 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
-	/*
-	 * Computing the actual limits is only possible once the data directory is
-	 * in a consistent state. There's no need to compute the limits while
-	 * still replaying WAL - no decisions about new multis are made even
-	 * though multixact creations might be replayed. So we'll only do further
-	 * checks after TrimMultiXact() has been called.
-	 */
+	/* Before the TrimMultiXact() call at end of recovery, skip the rest. */
if (!MultiXactState->finishedStartup)
return;
-
Assert(!InRecovery);
-	/* Set limits for offset vacuum. */
+	/*
+	 * Setting MultiXactState->oldestOffset entails a find_multixact_start()
+	 * call, which is only possible once the data directory is in a consistent
+	 * state. There's no need for an offset limit while still replaying WAL;
+	 * no decisions about new multis are made even though multixact creations
+	 * might be replayed.
+	 */
needs_offset_vacuum = SetOffsetVacuumLimit();

I don't really see the benefit of this change. The previous location of
the comment is where we return early, so it seems appropriate to
document the reason there?

/*
@@ -2354,6 +2356,12 @@ MultiXactAdvanceNextMXact(MultiXactId minMulti,
debug_elog3(DEBUG2, "MultiXact: setting next multi to %u", minMulti);
MultiXactState->nextMXact = minMulti;
}
+
+	/*
+	 * MultiXactOffsetPrecedes() gives the wrong answer if nextOffset would
+	 * advance more than 2^31 between calls.  Since we get a call for each
+	 * XLOG_MULTIXACT_CREATE_ID, that should never happen.
+	 */

Independent comment improvement. Good idea though.

/*
- * Update our oldestMultiXactId value, but only if it's more recent than what
- * we had.
- *
- * This may only be called during WAL replay.
+ * Update our oldestMultiXactId value, but only if it's more recent than
+ * what we had.  This may only be called during WAL replay.
*/

Whatever?

void
MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
@@ -2544,14 +2550,13 @@ GetOldestMultiXactId(void)
static bool
SetOffsetVacuumLimit(void)
{
-	MultiXactId oldestMultiXactId;
+	MultiXactId	oldestMultiXactId;
MultiXactId nextMXact;
-	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
+	MultiXactOffset oldestOffset = 0;		/* placate compiler */
MultiXactOffset nextOffset;
bool		oldestOffsetKnown = false;
+	MultiXactOffset prevOldestOffset;
bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;

I don't see the benefit of the order changes here.

@@ -2588,40 +2590,50 @@ SetOffsetVacuumLimit(void)
else
{
/*
-		 * Figure out where the oldest existing multixact's offsets are
-		 * stored. Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X,
-		 * the supposedly-earliest multixact might not really exist.  We are
+		 * Figure out where the oldest existing multixact's offsets are stored.
+		 * Due to bugs in early release of PostgreSQL 9.3.X and 9.4.X, the
+		 * supposedly-earliest multixact might not really exist.  We are
* careful not to fail in that case.
*/
oldestOffsetKnown =
find_multixact_start(oldestMultiXactId, &oldestOffset);
-
-		if (oldestOffsetKnown)
-			ereport(DEBUG1,
-					(errmsg("oldest MultiXactId member is at offset %u",
-							oldestOffset)));

That's imo a rather useful debug message.

-		else
+		if (!oldestOffsetKnown)
+		{
+			/* XXX This message is incorrect if prevOldestOffsetKnown. */
ereport(LOG,
(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
oldestMultiXactId)));
+		}
}

Hm, the XXX is a "problem" in all 9.3+ - should we just fix it everywhere?

LWLockRelease(MultiXactTruncationLock);

/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
+	 * There's no need to update anything if we don't know the oldest offset
+	 * or if it hasn't changed.
*/

Is that really a worthwhile optimization?

-typedef struct mxtruncinfo
-{
- int earliestExistingPage;
-} mxtruncinfo;
-
-/*
- * SlruScanDirectory callback
- * This callback determines the earliest existing page number.
- */
-static bool
-SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
-{
- mxtruncinfo *trunc = (mxtruncinfo *) data;
-
- if (trunc->earliestExistingPage == -1 ||
- ctl->PagePrecedes(segpage, trunc->earliestExistingPage))
- {
- trunc->earliestExistingPage = segpage;
- }
-
- return false; /* keep going */
-}
-

That really seems like an independent change, deserving its own commit +
explanation. Just referring to "See mailing list submission notes for
rationale." makes understanding the change later imo much harder than
all the incremental commits you try to avoid.

/*
- * Decide which of two MultiXactMember page numbers is "older" for truncation
- * purposes.  There is no "invalid offset number" so use the numbers verbatim.
+ * Dummy notion of which of two MultiXactMember page numbers is "older".
+ *
+ * Due to the MultiXactOffsetPrecedes() specification, this function's result
+ * is meaningless unless the system is preserving less than 2^31 members.  It
+ * is adequate for SlruSelectLRUPage() guessing the cheapest slot to reclaim.
+ * Do not pass MultiXactMemberCtl to any of the functions that use the
+ * PagePrecedes callback in other ways.
*/
static bool
MultiXactMemberPagePrecedes(int page1, int page2)
@@ -3157,6 +3134,10 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
/*
* Decide which of two offsets is earlier.
+ *
+ * Avoid calling this function.  pg_multixact/members can preserve almost 2^32
+ * members at any given time, but this function is transitive only when the
+ * system is preserving less than 2^31 members.
*/
static bool
MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)

As mentioned before, these really seem unrelated.

@@ -1237,24 +1235,25 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
SlruShared shared = ctl->shared;
int slotno;
char path[MAXPGPATH];
- bool did_write;

/* Clean out any possibly existing references to the segment. */
LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
restart:
-	did_write = false;
for (slotno = 0; slotno < shared->num_slots; slotno++)
{
-		int			pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
+		int			pagesegno;

if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
continue;

/* not the segment we're looking for */
+ pagesegno = shared->page_number[slotno] / SLRU_PAGES_PER_SEGMENT;
if (pagesegno != segno)
continue;

-		/* If page is clean, just change state to EMPTY (expected case). */
+		/*
+		 * If page is clean, just change state to EMPTY (expected case).
+		 */
if (shared->page_status[slotno] == SLRU_PAGE_VALID &&
!shared->page_dirty[slotno])
{
@@ -1267,18 +1266,10 @@ restart:
SlruInternalWritePage(ctl, slotno, NULL);
else
SimpleLruWaitIO(ctl, slotno);
-
-		did_write = true;
-	}
-
-	/*
-	 * Be extra careful and re-check. The IO functions release the control
-	 * lock, so new pages could have been read in.
-	 */
-	if (did_write)
goto restart;
+	}

I don't think that's really a good idea - this way we restart after
every single page write, whereas currently we only restart after passing
through all pages once. In nearly all cases we'll only ever have to
retry once in total, be because such old pages aren't usually going to
be reread/redirtied.

@@ -9216,10 +9212,8 @@ xlog_redo(XLogReaderState *record)
LWLockRelease(OidGenLock);
MultiXactSetNextMXact(checkPoint.nextMulti,
checkPoint.nextMultiOffset);
-
-		MultiXactAdvanceOldest(checkPoint.oldestMulti,
-							   checkPoint.oldestMultiDB);
SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
/*
* If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9309,17 +9303,13 @@ xlog_redo(XLogReaderState *record)
LWLockRelease(OidGenLock);
MultiXactAdvanceNextMXact(checkPoint.nextMulti,
checkPoint.nextMultiOffset);
-
-		/*
-		 * NB: This may perform multixact truncation when replaying WAL
-		 * generated by an older primary.
-		 */
-		MultiXactAdvanceOldest(checkPoint.oldestMulti,
-							   checkPoint.oldestMultiDB);
if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
checkPoint.oldestXid))
SetTransactionIdLimit(checkPoint.oldestXid,
checkPoint.oldestXidDB);
+		MultiXactAdvanceOldest(checkPoint.oldestMulti,
+							   checkPoint.oldestMultiDB);
+
/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
ControlFile->checkPointCopy.nextXid =
checkPoint.nextXid;

Why?

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7c4ef58..e2b4f4c 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1136,9 +1136,6 @@ vac_truncate_clog(TransactionId frozenXID,
if (bogus)
return;

- /*
- * Truncate CLOG, multixact and CommitTs to the oldest computed value.
- */
TruncateCLOG(frozenXID);
TruncateCommitTs(frozenXID);
TruncateMultiXact(minMulti, minmulti_datoid);

Why? Sure, it's not a super important comment, but ...?

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 0dc4117..41e51cf 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2192,10 +2192,8 @@ GetOldestSafeDecodingTransactionId(void)
/*
* GetVirtualXIDsDelayingChkpt -- Get the VXIDs of transactions that are
- * delaying checkpoint because they have critical actions in progress.
- *
- * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGXACT.
+ * delaying checkpoint because they have critical actions in progress, as
+ * shown by having delayChkpt set in their PGXACT.
* Returns a palloc'd array that should be freed by the caller.
* *nvxids is the number of valid entries.
@@ -2204,8 +2202,8 @@ GetOldestSafeDecodingTransactionId(void)
* the result is somewhat indeterminate, but we don't really care.  Even in
* a multiprocessor with delayed writes to shared memory, it should be certain
* that setting of delayChkpt will propagate to shared memory when the backend
- * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if
- * it's already inserted its commit record.  Whether it takes a little while
+ * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if it's
+ * already inserted its critical xlog record.  Whether it takes a little while
* for clearing of delayChkpt to propagate is unimportant for correctness.
*/

Seems unrelated, given that this is already used in
MarkBufferDirtyHint(). Don't get me wrong, I think the changes are a
good idea, but it's not really tied to the truncation changes.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#61)
Re: Rework the way multixact truncations work

On Wed, Dec 02, 2015 at 04:46:26PM +0100, Andres Freund wrote:

On 2015-12-02 09:57:19 -0500, Noah Misch wrote:

Hackers have been too reticent to revert and redo defect-laden
commits. If doing that is weird today, let it be normal.

Why?

See my paragraph ending with the two sentences you quoted.

Especially if reverting and redoing includes conflicts that mainly
increases the chance of accidental bugs.

True. (That doesn't apply to these patches.)

git remote add community git://git.postgresql.org/git/postgresql.git
git remote add nmisch_github https://github.com/nmisch/postgresql.git
git fetch --multiple community nmisch_github
git diff community/master...nmisch_github/mxt4-rm-legacy

That's a nearly useless diff, because it includes a lot of other changes
(218 files changed, 2828 insertions(+), 8742 deletions(-)) made since
you published the changes.

Perhaps you used "git diff a..b", not "git diff a...b". If not, please send
the outputs of "git rev-parse community/master nmisch_github/mxt4-rm-legacy"
and "git --version".

@@ -2190,7 +2194,8 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
-	/*
-	 * Computing the actual limits is only possible once the data directory is
-	 * in a consistent state. There's no need to compute the limits while
-	 * still replaying WAL - no decisions about new multis are made even
-	 * though multixact creations might be replayed. So we'll only do further
-	 * checks after TrimMultiXact() has been called.
-	 */
+	/* Before the TrimMultiXact() call at end of recovery, skip the rest. */
if (!MultiXactState->finishedStartup)
return;
-
Assert(!InRecovery);
-	/* Set limits for offset vacuum. */
+	/*
+	 * Setting MultiXactState->oldestOffset entails a find_multixact_start()
+	 * call, which is only possible once the data directory is in a consistent
+	 * state. There's no need for an offset limit while still replaying WAL;
+	 * no decisions about new multis are made even though multixact creations
+	 * might be replayed.
+	 */
needs_offset_vacuum = SetOffsetVacuumLimit();

I don't really see the benefit of this change. The previous location of
the comment is where we return early, so it seems appropriate to
document the reason there?

I made that low-importance change for two reasons. First, returning at that
point skips more than just the setting a limit; it also skips autovacuum
signalling and wraparound warnings. Second, the function has just computed
mxid "actual limits", so branch mxt2-cosmetic made the comment specify that we
defer an offset limit, not any and all limits.

static bool
SetOffsetVacuumLimit(void)
{
-	MultiXactId oldestMultiXactId;
+	MultiXactId	oldestMultiXactId;
MultiXactId nextMXact;
-	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
+	MultiXactOffset oldestOffset = 0;		/* placate compiler */
MultiXactOffset nextOffset;
bool		oldestOffsetKnown = false;
+	MultiXactOffset prevOldestOffset;
bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;

I don't see the benefit of the order changes here.

I reacted the same way. Commit 4f627f8 reordered some declarations, which I
reverted when I refinished that commit as branch mxt3-main.

- if (oldestOffsetKnown)
- ereport(DEBUG1,
- (errmsg("oldest MultiXactId member is at offset %u",
- oldestOffset)));

That's imo a rather useful debug message.

The branches emit that message at the same times 4f627f8^ and earlier emit it.

-		else
+		if (!oldestOffsetKnown)
+		{
+			/* XXX This message is incorrect if prevOldestOffsetKnown. */
ereport(LOG,
(errmsg("MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact %u does not exist on disk",
oldestMultiXactId)));
+		}
}

Hm, the XXX is a "problem" in all 9.3+ - should we just fix it everywhere?

I welcome a project to fix it.

LWLockRelease(MultiXactTruncationLock);

/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
+	 * There's no need to update anything if we don't know the oldest offset
+	 * or if it hasn't changed.
*/

Is that really a worthwhile optimization?

I would neither remove that longstanding optimization nor add it from scratch
today. Branch commit 06c9979 restored it as part of a larger restoration to
the pre-4f627f8 structure of SetOffsetVacuumLimit().

-typedef struct mxtruncinfo
-{
- int earliestExistingPage;
-} mxtruncinfo;
-
-/*
- * SlruScanDirectory callback
- * This callback determines the earliest existing page number.
- */
-static bool
-SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)
-{
- mxtruncinfo *trunc = (mxtruncinfo *) data;
-
- if (trunc->earliestExistingPage == -1 ||
- ctl->PagePrecedes(segpage, trunc->earliestExistingPage))
- {
- trunc->earliestExistingPage = segpage;
- }
-
- return false; /* keep going */
-}
-

That really seems like an independent change, deserving its own commit +
explanation.

Indeed. I explained that change at length in
/messages/by-id/20151108085407.GA1097830@tornado.leadboat.com,
including that it's alone on a branch (mxt1-disk-independent), to become its
own PostgreSQL commit.

[branch commit 89a7232]

I don't think that's really a good idea - this way we restart after
every single page write, whereas currently we only restart after passing
through all pages once. In nearly all cases we'll only ever have to
retry once in total, be because such old pages aren't usually going to
be reread/redirtied.

Your improvement sounds fine, then. Would both SimpleLruTruncate() and
SlruDeleteSegment() benefit from it?

@@ -9216,10 +9212,8 @@ xlog_redo(XLogReaderState *record)
LWLockRelease(OidGenLock);
MultiXactSetNextMXact(checkPoint.nextMulti,
checkPoint.nextMultiOffset);
-
-		MultiXactAdvanceOldest(checkPoint.oldestMulti,
-							   checkPoint.oldestMultiDB);
SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
+		SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
/*
* If we see a shutdown checkpoint while waiting for an end-of-backup
@@ -9309,17 +9303,13 @@ xlog_redo(XLogReaderState *record)
LWLockRelease(OidGenLock);
MultiXactAdvanceNextMXact(checkPoint.nextMulti,
checkPoint.nextMultiOffset);
-
-		/*
-		 * NB: This may perform multixact truncation when replaying WAL
-		 * generated by an older primary.
-		 */
-		MultiXactAdvanceOldest(checkPoint.oldestMulti,
-							   checkPoint.oldestMultiDB);
if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
checkPoint.oldestXid))
SetTransactionIdLimit(checkPoint.oldestXid,
checkPoint.oldestXidDB);
+		MultiXactAdvanceOldest(checkPoint.oldestMulti,
+							   checkPoint.oldestMultiDB);
+
/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
ControlFile->checkPointCopy.nextXid =
checkPoint.nextXid;

Why?

master does not and will not have legacy truncation, so the deleted comment
does not belong in master. Regarding the SetMultiXactIdLimit() call:

commit 611a2ec
Author: Noah Misch <noah@leadboat.com>
AuthorDate: Sat Nov 7 15:06:28 2015 -0500
Commit: Noah Misch <noah@leadboat.com>
CommitDate: Sat Nov 7 15:06:28 2015 -0500

In xlog_redo(), believe a SHUTDOWN checkPoint.oldestMulti exactly.

It was so before this branch. This restores consistency with the
handling of nextXid, nextMulti and oldestMulti: we treat them as exact
for XLOG_CHECKPOINT_SHUTDOWN and as minima for XLOG_CHECKPOINT_ONLINE.
I do not know of a case where this definitely matters for any of these
counters. It might matter if a bug causes oldestXid to move forward
wrongly, causing it to then move backward later. (I don't know if
VACUUM does ever move oldestXid backward, but it's a plausible thing to
do if on-disk state fully agrees with an older value.) That example has
no counterpart for oldestMultiXactId, because any update first arrives
in an XLOG_MULTIXACT_TRUNCATE_ID record. Therefore, this commit is
probably cosmetic.

- /*
- * Truncate CLOG, multixact and CommitTs to the oldest computed value.
- */
TruncateCLOG(frozenXID);
TruncateCommitTs(frozenXID);
TruncateMultiXact(minMulti, minmulti_datoid);

Why? Sure, it's not a super important comment, but ...?

Yeah, it scarcely matters either way. Commit 4f627f8 reduced this comment to
merely restating the code, so I removed it instead.

@@ -2204,8 +2202,8 @@ GetOldestSafeDecodingTransactionId(void)
* the result is somewhat indeterminate, but we don't really care.  Even in
* a multiprocessor with delayed writes to shared memory, it should be certain
* that setting of delayChkpt will propagate to shared memory when the backend
- * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if
- * it's already inserted its commit record.  Whether it takes a little while
+ * takes a lock, so we cannot fail to see a virtual xact as delayChkpt if it's
+ * already inserted its critical xlog record.  Whether it takes a little while
* for clearing of delayChkpt to propagate is unimportant for correctness.
*/

Seems unrelated, given that this is already used in
MarkBufferDirtyHint(). Don't get me wrong, I think the changes are a
good idea, but it's not really tied to the truncation changes.

Quite so; its branch (one branch = one proposed PostgreSQL commit),
mxt2-cosmetic, contains no truncation changes. Likewise for the other
independent comment improvements you noted.

nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#62)
Re: Rework the way multixact truncations work

On 2015-12-03 04:38:45 -0500, Noah Misch wrote:

On Wed, Dec 02, 2015 at 04:46:26PM +0100, Andres Freund wrote:

Especially if reverting and redoing includes conflicts that mainly
increases the chance of accidental bugs.

True. (That doesn't apply to these patches.)

Uh, it does. You had conflicts in your process, and it's hard to verify
that the re-applied patch is actually functionally identical given the
volume of changes. It's much easier to see what actually changes by
looking at iterative commits forward from the current state.

Sorry, but I really just want to see these changes as iterative patches
ontop of 9.5/HEAD instead of this process. I won't revert the reversion
if you push it anyway, but I think it's a rather bad idea.

git remote add community git://git.postgresql.org/git/postgresql.git
git remote add nmisch_github https://github.com/nmisch/postgresql.git
git fetch --multiple community nmisch_github
git diff community/master...nmisch_github/mxt4-rm-legacy

That's a nearly useless diff, because it includes a lot of other changes
(218 files changed, 2828 insertions(+), 8742 deletions(-)) made since
you published the changes.

Perhaps you used "git diff a..b", not "git diff a...b".

Ah yes. Neat, didn't know that one.

@@ -2190,7 +2194,8 @@ MultiXactSetNextMXact(MultiXactId nextMulti,
-	/*
-	 * Computing the actual limits is only possible once the data directory is
-	 * in a consistent state. There's no need to compute the limits while
-	 * still replaying WAL - no decisions about new multis are made even
-	 * though multixact creations might be replayed. So we'll only do further
-	 * checks after TrimMultiXact() has been called.
-	 */
+	/* Before the TrimMultiXact() call at end of recovery, skip the rest. */
if (!MultiXactState->finishedStartup)
return;
-
Assert(!InRecovery);
-	/* Set limits for offset vacuum. */
+	/*
+	 * Setting MultiXactState->oldestOffset entails a find_multixact_start()
+	 * call, which is only possible once the data directory is in a consistent
+	 * state. There's no need for an offset limit while still replaying WAL;
+	 * no decisions about new multis are made even though multixact creations
+	 * might be replayed.
+	 */
needs_offset_vacuum = SetOffsetVacuumLimit();

I don't really see the benefit of this change. The previous location of
the comment is where we return early, so it seems appropriate to
document the reason there?

I made that low-importance change for two reasons. First, returning at that
point skips more than just the setting a limit; it also skips autovacuum
signalling and wraparound warnings. Second, the function has just computed
mxid "actual limits", so branch mxt2-cosmetic made the comment specify that we
defer an offset limit, not any and all limits.

My question was more about the comment being after the "early return"
than about the content change, should have made that clearer. Can we
just move your comment up?

static bool
SetOffsetVacuumLimit(void)
{
-	MultiXactId oldestMultiXactId;
+	MultiXactId	oldestMultiXactId;
MultiXactId nextMXact;
-	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
+	MultiXactOffset oldestOffset = 0;		/* placate compiler */
MultiXactOffset nextOffset;
bool		oldestOffsetKnown = false;
+	MultiXactOffset prevOldestOffset;
bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;

I don't see the benefit of the order changes here.

I reacted the same way. Commit 4f627f8 reordered some declarations, which I
reverted when I refinished that commit as branch mxt3-main.

But the other changes are there, and in the history anyway. As the new
order isn't more meaningful than the current one...

- if (oldestOffsetKnown)
- ereport(DEBUG1,
- (errmsg("oldest MultiXactId member is at offset %u",
- oldestOffset)));

That's imo a rather useful debug message.

The branches emit that message at the same times 4f627f8^ and earlier emit it.

During testing I found it rather helpful if it was emitted regularly.

LWLockRelease(MultiXactTruncationLock);

/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
+	 * There's no need to update anything if we don't know the oldest offset
+	 * or if it hasn't changed.
*/

Is that really a worthwhile optimization?

I would neither remove that longstanding optimization nor add it from scratch
today. Branch commit 06c9979 restored it as part of a larger restoration to
the pre-4f627f8 structure of SetOffsetVacuumLimit().

There DetermineSafeOldestOffset() did it unconditionally.

-SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)

That really seems like an independent change, deserving its own commit +
explanation.

Indeed. I explained that change at length in
/messages/by-id/20151108085407.GA1097830@tornado.leadboat.com,
including that it's alone on a branch (mxt1-disk-independent), to become its
own PostgreSQL commit.

The comment there doesn't include the explanation...

[branch commit 89a7232]

I don't think that's really a good idea - this way we restart after
every single page write, whereas currently we only restart after passing
through all pages once. In nearly all cases we'll only ever have to
retry once in total, be because such old pages aren't usually going to
be reread/redirtied.

Your improvement sounds fine, then. Would both SimpleLruTruncate() and
SlruDeleteSegment() benefit from it?

It probably makes sense to do it in SimpleLruTruncate too - but it does
additional checks as part of the restarts which aren't applicable for
DeleteSegment(), which is IIRC why I didn't also change it.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#63)
Re: Rework the way multixact truncations work

On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote:

On 2015-12-03 04:38:45 -0500, Noah Misch wrote:

On Wed, Dec 02, 2015 at 04:46:26PM +0100, Andres Freund wrote:

Especially if reverting and redoing includes conflicts that mainly
increases the chance of accidental bugs.

True. (That doesn't apply to these patches.)

Uh, it does. You had conflicts in your process, and it's hard to verify
that the re-applied patch is actually functionally identical given the
volume of changes. It's much easier to see what actually changes by
looking at iterative commits forward from the current state.

Ah, we were talking about different topics after all. I was talking about
_merge_ conflicts in a reversion commit.

Sorry, but I really just want to see these changes as iterative patches
ontop of 9.5/HEAD instead of this process. I won't revert the reversion
if you push it anyway, but I think it's a rather bad idea.

I hear you. I evaluated your request and judged that the benefits you cited
did not make up for the losses I cited. Should you wish to change my mind,
your best bet is to find defects in the commits I proposed. If I introduced
juicy defects, that discovery would lend much weight to your conjectures.

My question was more about the comment being after the "early return"
than about the content change, should have made that clearer. Can we
just move your comment up?

Sure, I will.

static bool
SetOffsetVacuumLimit(void)
{
-	MultiXactId oldestMultiXactId;
+	MultiXactId	oldestMultiXactId;
MultiXactId nextMXact;
-	MultiXactOffset oldestOffset = 0;	/* placate compiler */
-	MultiXactOffset prevOldestOffset;
+	MultiXactOffset oldestOffset = 0;		/* placate compiler */
MultiXactOffset nextOffset;
bool		oldestOffsetKnown = false;
+	MultiXactOffset prevOldestOffset;
bool		prevOldestOffsetKnown;
-	MultiXactOffset offsetStopLimit = 0;

I don't see the benefit of the order changes here.

I reacted the same way. Commit 4f627f8 reordered some declarations, which I
reverted when I refinished that commit as branch mxt3-main.

But the other changes are there, and in the history anyway. As the new
order isn't more meaningful than the current one...

Right. A revert+redo patch series can and should purge formatting changes
that did not belong in its predecessor commits. Alternate change delivery
strategies wouldn't do that.

- if (oldestOffsetKnown)
- ereport(DEBUG1,
- (errmsg("oldest MultiXactId member is at offset %u",
- oldestOffset)));

That's imo a rather useful debug message.

The branches emit that message at the same times 4f627f8^ and earlier emit it.

During testing I found it rather helpful if it was emitted regularly.

I wouldn't oppose a patch making it happen more often.

LWLockRelease(MultiXactTruncationLock);

/*
-	 * If we can, compute limits (and install them MultiXactState) to prevent
-	 * overrun of old data in the members SLRU area. We can only do so if the
-	 * oldest offset is known though.
+	 * There's no need to update anything if we don't know the oldest offset
+	 * or if it hasn't changed.
*/

Is that really a worthwhile optimization?

I would neither remove that longstanding optimization nor add it from scratch
today. Branch commit 06c9979 restored it as part of a larger restoration to
the pre-4f627f8 structure of SetOffsetVacuumLimit().

There DetermineSafeOldestOffset() did it unconditionally.

That is true; one won't be consistent with both. 06c9979 materially shortened
the final patch and eliminated some user-visible message emission changes.
Moreover, this is clearly a case of SetOffsetVacuumLimit() absorbing
DetermineSafeOldestOffset(), not vice versa.

-SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int segpage, void *data)

That really seems like an independent change, deserving its own commit +
explanation.

Indeed. I explained that change at length in
/messages/by-id/20151108085407.GA1097830@tornado.leadboat.com,
including that it's alone on a branch (mxt1-disk-independent), to become its
own PostgreSQL commit.

The comment there doesn't include the explanation...

If you visit that URL, everything from "If anything here requires careful
study, it's the small mxt1-disk-independent change, which ..." to the end of
the message is my explanation of this change. What else would you like to
know about it?

[branch commit 89a7232]

I don't think that's really a good idea - this way we restart after
every single page write, whereas currently we only restart after passing
through all pages once. In nearly all cases we'll only ever have to
retry once in total, be because such old pages aren't usually going to
be reread/redirtied.

Your improvement sounds fine, then. Would both SimpleLruTruncate() and
SlruDeleteSegment() benefit from it?

It probably makes sense to do it in SimpleLruTruncate too - but it does
additional checks as part of the restarts which aren't applicable for
DeleteSegment(), which is IIRC why I didn't also change it.

Understood. There's no rule that these two functions must look as similar as
possible, so I will undo 89a7232.

Thanks,
nm

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#64)
Re: Rework the way multixact truncations work

On 2015-12-04 21:55:29 -0500, Noah Misch wrote:

On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote:

Sorry, but I really just want to see these changes as iterative patches
ontop of 9.5/HEAD instead of this process. I won't revert the reversion
if you push it anyway, but I think it's a rather bad idea.

I hear you.

Not just me.

I evaluated your request and judged that the benefits you cited
did not make up for the losses I cited. Should you wish to change my mind,
your best bet is to find defects in the commits I proposed. If I introduced
juicy defects, that discovery would lend much weight to your conjectures.

I've absolutely no interest in "proving you wrong". And my desire to
review patches that are in a, in my opinion, barely reviewable format is
pretty low as well.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#65)
Re: Rework the way multixact truncations work

On Tue, Dec 8, 2015 at 6:43 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-12-04 21:55:29 -0500, Noah Misch wrote:

On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote:

Sorry, but I really just want to see these changes as iterative patches
ontop of 9.5/HEAD instead of this process. I won't revert the reversion
if you push it anyway, but I think it's a rather bad idea.

I hear you.

Not just me.

I evaluated your request and judged that the benefits you cited
did not make up for the losses I cited. Should you wish to change my mind,
your best bet is to find defects in the commits I proposed. If I introduced
juicy defects, that discovery would lend much weight to your conjectures.

I've absolutely no interest in "proving you wrong". And my desire to
review patches that are in a, in my opinion, barely reviewable format is
pretty low as well.

I agree. Noah, it seems to me that you are offering a novel theory of
how patches should be submitted, reviewed, and committed, but you've
got three people, two of them committers, telling you that we don't
like that approach. I seriously doubt you're going to find anyone who
does. When stuff gets committed to the tree, people want to to be
able to answer the question "what has just now changed?" and it is
indisputable that what you want to do here will make that harder.
That's not a one-time problem for Andres during the course of review;
that is a problem for every single person who looks at the commit
history from now until the end of time. I don't think you have the
right to force your proposed approach through in the face of concerted
opposition.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#66)
Re: Rework the way multixact truncations work

On Tue, Dec 08, 2015 at 01:05:03PM -0500, Robert Haas wrote:

On Tue, Dec 8, 2015 at 6:43 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-12-04 21:55:29 -0500, Noah Misch wrote:

On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote:

Sorry, but I really just want to see these changes as iterative patches
ontop of 9.5/HEAD instead of this process. I won't revert the reversion
if you push it anyway, but I think it's a rather bad idea.

I hear you.

Not just me.

I evaluated your request and judged that the benefits you cited
did not make up for the losses I cited. Should you wish to change my mind,
your best bet is to find defects in the commits I proposed. If I introduced
juicy defects, that discovery would lend much weight to your conjectures.

I've absolutely no interest in "proving you wrong". And my desire to
review patches that are in a, in my opinion, barely reviewable format is
pretty low as well.

I agree. Noah, it seems to me that you are offering a novel theory of
how patches should be submitted, reviewed, and committed, but you've
got three people, two of them committers, telling you that we don't
like that approach. I seriously doubt you're going to find anyone who
does.

Andres writing the patch that became commit 4f627f8 and you reviewing it were
gifts to Alvaro and to the community. Aware of that, I have avoided[1]/messages/by-id/20151029065903.GC770464@tornado.leadboat.com saying
that I was shocked to see that commit's defects. Despite a committer-author
and _two_ committer reviewers, the patch was rife with wrong new comments,
omitted updates to comments it caused to become wrong, and unsolicited
whitespace churn. (Anyone could have missed the data loss bug, but these
collectively leap off the page.) This in beleaguered code critical to data
integrity. You call this thread's latest code a patch submission, but I call
it bandaging the tree after a recent commit that never should have reached the
tree. Hey, if you'd like me to post the traditional patch files, that's easy.
It would have been easier for me. I posted branches because it gives more
metadata to guide review. As for the choice to revert and redo ...

When stuff gets committed to the tree, people want to to be
able to answer the question "what has just now changed?" and it is
indisputable that what you want to do here will make that harder.

I hope those who have not already read commit 4f627f8 will not waste time
reading it. They should instead ignore multixact changes from commit 4f627f8
through its revert. The 2015-09-26 commits have not appeared in a supported
release, and no other work has built upon them. They have no tenure. (I am
glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
would have introduced a data loss bug.) Nobody reported a single defect
before my review overturned half the patch. A revert will indeed impose on
those who invested time to understand commit 4f627f8, but the silence about
its defects suggests the people understanding it number either zero or one.
Even as its author and reviewers, you would do better to set aside what you
thought you knew about this code.

That's not a one-time problem for Andres during the course of review;
that is a problem for every single person who looks at the commit
history from now until the end of time.

It's a service to future readers that no line of "git blame master <...>" will
refer to a 2015-09-26 multixact commit. Blame reports will instead refer to
replacement commits designed to be meaningful for study in isolation. If I
instead structured the repairs as you ask, the blame would find one of 4f627f8
or its various repair commits, none of which would be a self-contained unit of
development. What's to enjoy about discovering that history?

I don't think you have the
right to force your proposed approach through in the face of concerted
opposition.

That's correct; I do not have that right. Your objection still worries me.

nm

[1]: /messages/by-id/20151029065903.GC770464@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#67)
Re: Rework the way multixact truncations work

On 2015-12-09 09:43:19 -0500, Noah Misch wrote:

Aware of that, I have avoided[1] saying that I was shocked to see that
commit's defects. Despite a committer-author and _two_ committer
reviewers, the patch was rife with wrong new comments, omitted updates
to comments it caused to become wrong,

It's not like that patch wasn't posted for review for months...

and unsolicited whitespace churn.

Whitespace churn? The commit includes a pgindent run, because Alvaro
asked me to do that, but that just affected a handful of lines. If you
mean the variable ordering: given several variables were renamed anyway,
additionally putting them in a easier to understand order, seems rather
painless. If you mean 'pgindent immune' long lines - multixact.c is far
from the only one with those, and they're prett harmless.

You call this thread's latest code a patch
submission, but I call it bandaging the tree after a recent commit
that never should have reached the tree.

Oh, for christs sake.

Hey, if you'd like me to
post the traditional patch files, that's easy. It would have been
easier for me.

You've been asked that, repeatedly. At least if you take 'traditional
patch files' to include traditional, iterative, patches ontop of the
current tree.

I hope those who have not already read commit 4f627f8 will not waste time
reading it.

We have to, who knows what's hiding in there. Your git log even shows
that you had conflicts in your approach (83cb04 Conflicts:
src/backend/access/transam/multixact.c).

They should instead ignore multixact changes from commit 4f627f8
through its revert. The 2015-09-26 commits have not appeared in a supported
release, and no other work has built upon them.

They have no tenure.

Man.

(I am glad you talked the author out of back-patching; otherwise,
9.4.5 and 9.3.10 would have introduced a data loss bug.)

Isn't that a bug in a, as far as we know, impossible scenario? Unless I
miss something there's no known case where it's "expected" that
find_multixact_start() fails after initially succeeding? Sure, it sucks
that the bug survived review and that it was written in the first
place. But it not showing up during testing isn't meaningful, given it's
a should-never-happen scenario.

I'm actually kinda inclined to rip out the whole "previous pass" logic
out alltogether, and replace it with a PANIC. It's a hard to test,
should never happen, scenario. If it happens, things have already
seriously gone sour.

That's not a one-time problem for Andres during the course of review;
that is a problem for every single person who looks at the commit
history from now until the end of time.

It's a service to future readers that no line of "git blame master <...>" will
refer to a 2015-09-26 multixact commit.

And a disservice for everyone doing git log, or git blame for
intermediate states of the tree. The benefit for git blame, are almost
nonexistant, not seing a couple newlines changed, or not seing some
intermediate commits isn't really important.

Blame reports will instead refer to
replacement commits designed to be meaningful for study in isolation. If I
instead structured the repairs as you ask, the blame would find one of 4f627f8
or its various repair commits, none of which would be a self-contained unit of
development.

So what? That's how development in general works. And how it actually
happened in this specific case.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#67)
Re: Rework the way multixact truncations work

On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch <noah@leadboat.com> wrote:

Andres writing the patch that became commit 4f627f8 and you reviewing it were
gifts to Alvaro and to the community. Aware of that, I have avoided[1] saying
that I was shocked to see that commit's defects. Despite a committer-author
and _two_ committer reviewers, the patch was rife with wrong new comments,
omitted updates to comments it caused to become wrong, and unsolicited
whitespace churn. (Anyone could have missed the data loss bug, but these
collectively leap off the page.) This in beleaguered code critical to data
integrity. You call this thread's latest code a patch submission, but I call
it bandaging the tree after a recent commit that never should have reached the
tree. Hey, if you'd like me to post the traditional patch files, that's easy.
It would have been easier for me. I posted branches because it gives more
metadata to guide review. As for the choice to revert and redo ...

Yes, I'd like patch files, one per topic.

I wasn't very happy with the way that patch it was written; it seemed
to me that it touched too much code and move a lot of things around
unnecessarily, and I said so at the time. I would have preferred
something more incremental, and I asked for it and didn't get it.
Well, I'm not giving up: I'm asking for the same thing here. I didn't
think it was a good idea for Andres to rearrange that much code in a
single commit, because it was hard to review, and I don't think it's a
good idea for you to do it, either. To the extent that you found
bugs, I think that proves the point that large commits are hard to
review and small commits that change things just a little bit at a
time are the way to go.

I hope those who have not already read commit 4f627f8 will not waste time
reading it. They should instead ignore multixact changes from commit 4f627f8
through its revert. The 2015-09-26 commits have not appeared in a supported
release, and no other work has built upon them. They have no tenure. (I am
glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
would have introduced a data loss bug.) Nobody reported a single defect
before my review overturned half the patch. A revert will indeed impose on
those who invested time to understand commit 4f627f8, but the silence about
its defects suggests the people understanding it number either zero or one.
Even as its author and reviewers, you would do better to set aside what you
thought you knew about this code.

I just don't find this a realistic model of how people use the git
log. Maybe you use it this way; I don't. I don't *want* git blame to
make it seem as if 4f627f8 is not part of the history. For better or
worse, it is. Ripping it out and replacing it monolithically will not
change that; it will only make the detailed history harder to
reconstruct, and I *will* want to reconstruct it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#68)
Re: Rework the way multixact truncations work

On Wed, Dec 9, 2015 at 10:41 AM, Andres Freund <andres@anarazel.de> wrote:

(I am glad you talked the author out of back-patching; otherwise,
9.4.5 and 9.3.10 would have introduced a data loss bug.)

Isn't that a bug in a, as far as we know, impossible scenario? Unless I
miss something there's no known case where it's "expected" that
find_multixact_start() fails after initially succeeding? Sure, it sucks
that the bug survived review and that it was written in the first
place. But it not showing up during testing isn't meaningful, given it's
a should-never-happen scenario.

If I correctly understand the scenario that you are describing, that
does happen - not for the same MXID, but for different ones. At least
the last time I checked, and I'm not sure if we've fixed this, it
could happen because the SLRU page that contains the multixact wasn't
flushed out of the SLRU buffers yet. But apart from that, it could
happen any time there's a gap in the sequence of files, and that sure
doesn't seem like a can't-happen situation. We know that, on 9.3,
there's definitely a sequence of events that leads to a 0000 file
followed by a gap followed by the series of files that are still live.
Given the number of other bugs we've fixed in this area, I would not
like to bet on that being the only scenario where this crops up. It
*shouldn't* happen, and as far as we know, if you start and end on a
version newer than 4f627f8 and aa29c1c, it won't. Older branches,
though, I wouldn't like to bet on.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#70)
Re: Rework the way multixact truncations work

On 2015-12-09 11:18:39 -0500, Robert Haas wrote:

If I correctly understand the scenario that you are describing, that
does happen - not for the same MXID, but for different ones. At least
the last time I checked, and I'm not sure if we've fixed this, it
could happen because the SLRU page that contains the multixact wasn't
flushed out of the SLRU buffers yet.

That should be fixed, with the brute force solution of flushing buffers
before searching for files on disk.

But apart from that, it could
happen any time there's a gap in the sequence of files, and that sure
doesn't seem like a can't-happen situation. We know that, on 9.3,
there's definitely a sequence of events that leads to a 0000 file
followed by a gap followed by the series of files that are still live.
Given the number of other bugs we've fixed in this area, I would not
like to bet on that being the only scenario where this crops up. It
*shouldn't* happen, and as far as we know, if you start and end on a
version newer than 4f627f8 and aa29c1c, it won't. Older branches,
though, I wouldn't like to bet on.

Ok, fair enough.

andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#69)
Re: Rework the way multixact truncations work

On Wed, Dec 09, 2015 at 11:08:32AM -0500, Robert Haas wrote:

On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch <noah@leadboat.com> wrote:

I hope those who have not already read commit 4f627f8 will not waste time
reading it. They should instead ignore multixact changes from commit 4f627f8
through its revert. The 2015-09-26 commits have not appeared in a supported
release, and no other work has built upon them. They have no tenure. (I am
glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
would have introduced a data loss bug.) Nobody reported a single defect
before my review overturned half the patch. A revert will indeed impose on
those who invested time to understand commit 4f627f8, but the silence about
its defects suggests the people understanding it number either zero or one.
Even as its author and reviewers, you would do better to set aside what you
thought you knew about this code.

I just don't find this a realistic model of how people use the git
log. Maybe you use it this way; I don't. I don't *want* git blame to
make it seem as if 4f627f8 is not part of the history. For better or
worse, it is.

I would like to understand how you use git, then. What's one of your models
of using "git log" and/or "git blame" in which you foresee the revert making
history less clear, not more clear?

By the way, it occurs to me that I should also make pg_upgrade blacklist the
range of catversions that might have data loss. No sense in putting ourselves
in the position of asking whether data files of a 9.9.3 cluster spent time in
a 9.5beta2 cluster.

Ripping it out and replacing it monolithically will not
change that; it will only make the detailed history harder to
reconstruct, and I *will* want to reconstruct it.

What's something that might happen six months from now and lead you to inspect
master or 9.5 multixact.c between 4f627f8 and its revert?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#72)
Re: Rework the way multixact truncations work

On 2015-12-09 20:23:06 -0500, Noah Misch wrote:

By the way, it occurs to me that I should also make pg_upgrade blacklist the
range of catversions that might have data loss. No sense in putting ourselves
in the position of asking whether data files of a 9.9.3 cluster spent time in
a 9.5beta2 cluster.

I can't see any benefit in that. We're talking about a bug that, afaics,
needs another unknown bug to trigger (so find_multixact_start() fails),
and then very likely needs significant amounts of new multixacts
consumed, without a restart and without find_multixact_start()
succeeding later.

What I think would actually help for questions like this, is to add, as
discussed in some other threads, the following:
1) 'creating version' to pg_control
2) 'creating version' to each pg_class entry
3) 'last relation rewrite in version' to each pg_class entry
4) 'last full vacuum in version' to each pg_class entry

I think for this purpose 'version' should be something akin to
$catversion||$numericversion (int64 probably?) - that way development
branches and release branches are handled somewhat usefully.

I think that'd be useful, both from an investigatory perspective, as
from a tooling perspective, because it'd allow reusing things like hint
bits.

Ripping it out and replacing it monolithically will not
change that; it will only make the detailed history harder to
reconstruct, and I *will* want to reconstruct it.

What's something that might happen six months from now and lead you to inspect
master or 9.5 multixact.c between 4f627f8 and its revert?

"Hey, what has happened to multixact.c lately? I'm investigating a bug,
and I wonder if it already has been fixed?", "Uh, what was the problem
with that earlier large commit?", "Hey, what has changed between beta2
and the final release?"...

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Peter Geoghegan
pg@heroku.com
In reply to: Andres Freund (#73)
Re: Rework the way multixact truncations work

On Thu, Dec 10, 2015 at 12:34 AM, Andres Freund <andres@anarazel.de> wrote:

Ripping it out and replacing it monolithically will not
change that; it will only make the detailed history harder to
reconstruct, and I *will* want to reconstruct it.

What's something that might happen six months from now and lead you to inspect
master or 9.5 multixact.c between 4f627f8 and its revert?

"Hey, what has happened to multixact.c lately? I'm investigating a bug,
and I wonder if it already has been fixed?", "Uh, what was the problem
with that earlier large commit?", "Hey, what has changed between beta2
and the final release?"...

Quite.

I can't believe we're still having this silly discussion. Can we please move on?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Bert
biertie@gmail.com
In reply to: Peter Geoghegan (#74)
Re: Rework the way multixact truncations work

+1

On Thu, Dec 10, 2015 at 9:58 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Dec 10, 2015 at 12:34 AM, Andres Freund <andres@anarazel.de>
wrote:

Ripping it out and replacing it monolithically will not
change that; it will only make the detailed history harder to
reconstruct, and I *will* want to reconstruct it.

What's something that might happen six months from now and lead you to

inspect

master or 9.5 multixact.c between 4f627f8 and its revert?

"Hey, what has happened to multixact.c lately? I'm investigating a bug,
and I wonder if it already has been fixed?", "Uh, what was the problem
with that earlier large commit?", "Hey, what has changed between beta2
and the final release?"...

Quite.

I can't believe we're still having this silly discussion. Can we please
move on?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bert Desmet
0477/305361

#76Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#72)
Re: Rework the way multixact truncations work

On Wed, Dec 9, 2015 at 8:23 PM, Noah Misch <noah@leadboat.com> wrote:

On Wed, Dec 09, 2015 at 11:08:32AM -0500, Robert Haas wrote:

On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch <noah@leadboat.com> wrote:

I hope those who have not already read commit 4f627f8 will not waste time
reading it. They should instead ignore multixact changes from commit 4f627f8
through its revert. The 2015-09-26 commits have not appeared in a supported
release, and no other work has built upon them. They have no tenure. (I am
glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
would have introduced a data loss bug.) Nobody reported a single defect
before my review overturned half the patch. A revert will indeed impose on
those who invested time to understand commit 4f627f8, but the silence about
its defects suggests the people understanding it number either zero or one.
Even as its author and reviewers, you would do better to set aside what you
thought you knew about this code.

I just don't find this a realistic model of how people use the git
log. Maybe you use it this way; I don't. I don't *want* git blame to
make it seem as if 4f627f8 is not part of the history. For better or
worse, it is.

I would like to understand how you use git, then. What's one of your models
of using "git log" and/or "git blame" in which you foresee the revert making
history less clear, not more clear?

Well, suppose I wanted to know what bugs were fixed between 9.5beta
and 9.5.0, for example. I mean, I'm going to run git log
src/backend/access/transam/multixact.c ... and the existing commits
are going to be there.

By the way, it occurs to me that I should also make pg_upgrade blacklist the
range of catversions that might have data loss. No sense in putting ourselves
in the position of asking whether data files of a 9.9.3 cluster spent time in
a 9.5beta2 cluster.

Maybe. But I think we could use a little more vigorous discussion of
that issue, since Andres doesn't seem to be very convinced by your
analysis, and I don't currently understand what you've fixed because I
can't, as mentioned several times, follow your patch stack.

Ripping it out and replacing it monolithically will not
change that; it will only make the detailed history harder to
reconstruct, and I *will* want to reconstruct it.

What's something that might happen six months from now and lead you to inspect
master or 9.5 multixact.c between 4f627f8 and its revert?

I don't know have anything to add to what others have said in response
to this point, except this: the whole point of using a source code
management system is to tell you what changed and when. What you are
proposing to do makes it unusable for that purpose.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#76)
Re: Rework the way multixact truncations work

On 2015-12-10 08:55:54 -0500, Robert Haas wrote:

Maybe. But I think we could use a little more vigorous discussion of
that issue, since Andres doesn't seem to be very convinced by your
analysis, and I don't currently understand what you've fixed because I
can't, as mentioned several times, follow your patch stack.

The issue at hand is that the following block
oldestOffsetKnown =
find_multixact_start(oldestMultiXactId, &oldestOffset);

...
else if (prevOldestOffsetKnown)
{
/*
* If we failed to get the oldest offset this time, but we have a
* value from a previous pass through this function, use the old value
* rather than automatically forcing it.
*/
oldestOffset = prevOldestOffset;
oldestOffsetKnown = true;
}
in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then
is set in shared memory:
/* Install the computed values */
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->oldestOffset = oldestOffset;
MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
MultiXactState->offsetStopLimit = offsetStopLimit;
LWLockRelease(MultiXactGenLock);

so, if find_multixact_start() failed - a "should never happen" occurance
- we install a wrong stop limit. It does get 'repaired' upon the next
suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart
though.

Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#77)
Re: Rework the way multixact truncations work

On Thu, Dec 10, 2015 at 9:04 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-12-10 08:55:54 -0500, Robert Haas wrote:

Maybe. But I think we could use a little more vigorous discussion of
that issue, since Andres doesn't seem to be very convinced by your
analysis, and I don't currently understand what you've fixed because I
can't, as mentioned several times, follow your patch stack.

The issue at hand is that the following block
oldestOffsetKnown =
find_multixact_start(oldestMultiXactId, &oldestOffset);

...
else if (prevOldestOffsetKnown)
{
/*
* If we failed to get the oldest offset this time, but we have a
* value from a previous pass through this function, use the old value
* rather than automatically forcing it.
*/
oldestOffset = prevOldestOffset;
oldestOffsetKnown = true;
}
in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then
is set in shared memory:
/* Install the computed values */
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->oldestOffset = oldestOffset;
MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
MultiXactState->offsetStopLimit = offsetStopLimit;
LWLockRelease(MultiXactGenLock);

so, if find_multixact_start() failed - a "should never happen" occurance
- we install a wrong stop limit. It does get 'repaired' upon the next
suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart
though.

Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.

So let's do that, then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#78)
Re: Rework the way multixact truncations work

On Thu, Dec 10, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 10, 2015 at 9:04 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-12-10 08:55:54 -0500, Robert Haas wrote:

Maybe. But I think we could use a little more vigorous discussion of
that issue, since Andres doesn't seem to be very convinced by your
analysis, and I don't currently understand what you've fixed because I
can't, as mentioned several times, follow your patch stack.

The issue at hand is that the following block
oldestOffsetKnown =
find_multixact_start(oldestMultiXactId, &oldestOffset);

...
else if (prevOldestOffsetKnown)
{
/*
* If we failed to get the oldest offset this time, but we have a
* value from a previous pass through this function, use the old value
* rather than automatically forcing it.
*/
oldestOffset = prevOldestOffset;
oldestOffsetKnown = true;
}
in SetOffsetVacuumLimit() fails to restore offsetStopLimit, which then
is set in shared memory:
/* Install the computed values */
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
MultiXactState->oldestOffset = oldestOffset;
MultiXactState->oldestOffsetKnown = oldestOffsetKnown;
MultiXactState->offsetStopLimit = offsetStopLimit;
LWLockRelease(MultiXactGenLock);

so, if find_multixact_start() failed - a "should never happen" occurance
- we install a wrong stop limit. It does get 'repaired' upon the next
suceeding find_multixact_start() in SetOffsetVacuumLimit() or a restart
though.

Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.

So let's do that, then.

Who is going to take care of this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#76)
Re: Rework the way multixact truncations work

On Thu, Dec 10, 2015 at 08:55:54AM -0500, Robert Haas wrote:

I don't know have anything to add to what others have said in response
to this point, except this: the whole point of using a source code
management system is to tell you what changed and when. What you are
proposing to do makes it unusable for that purpose.

Based on your comments, I'm calling the patch series returned with feedback.
I built the series around the goal of making history maximally reviewable for
persons not insiders to commit 4f627f8. Having spent 90% of my 2015
PostgreSQL contribution time finding or fixing committed defects, my judgment
of how best to achieve that is no shout from the peanut gallery. (Neither is
your judgment.) In particular, I had in view two works, RLS and pg_audit,
that used the post-commit repair strategy you've advocated. But you gave me a
fair chance to make the case, and you stayed convinced that my repairs oppose
my goal. I can now follow your development of that belief, which is enough.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#79)
2 attachment(s)
Re: Rework the way multixact truncations work

Noah, Robert, All

On 2015-12-11 11:20:21 -0500, Robert Haas wrote:

On Thu, Dec 10, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.

So let's do that, then.

Who is going to take care of this?

Here's two patches:

1) The fix to SetOffsetVacuumLimit().

I've tested this by introducing a probabilistic "return false;" to
find_multixact_start(), and torturing postgres by burning through
billions of multixactids of various sizes. Behaves about as
bad^Wgood as without the induced failures; before the patch there
were moments of spurious warnings/errors when ->offsetStopLimit was
out of whack.

2) A subset of the comment changes from Noah's repository. Some of the
comment changes didn't make sense without the removal
SlruScanDirCbFindEarliest(), a small number of other changes I couldn't
fully agree with.

Noah, are you ok with pushing that subset of your changes? Is
"Slightly edited subset of a larger patchset by Noah Misch" an OK
attribution?

Noah, on a first glance I think e50cca0ae ("Remove the
SlruScanDirCbFindEarliest() test from TruncateMultiXact().") is a good
idea. So I do encourage you to submit that as a separate patch.

Regards,

Andres

Attachments:

0001-Fix-bug-in-SetOffsetVacuumLimit-triggered-by-find_mu.patchtext/x-patch; charset=us-asciiDownload
>From d26d41a96a7fc8dbe9827d55837875790b21ca1b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 12 Dec 2015 17:57:21 +0100
Subject: [PATCH 1/2] Fix bug in SetOffsetVacuumLimit() triggered by
 find_multixact_start() failure.

In case find_multixact_start() failed SetOffsetVacuumLimit() installed 0
into MultiXactState->offsetStopLimit. Unlike oldestOffset the
to-be-installed value was not restored in the error branch.

Luckily there are no known cases where find_multixact_start() will
return an error in 9.5 and above.

But if the bug is triggered, e.g. due to filesystem permission issues,
it'd be somewhat bad: GetNewMultiXactId() could continue allocating
mxids even if close to a wraparound, or it could erroneously stop
allocating mxids, even if no wraparound is looming.  Luckily the wrong
value would be corrected the next time SetOffsetVacuumLimit() is called,
or by a restart.

Reported-By: Noah Misch, although this is not his preferred fix
Backpatch: 9.5, where the bug was introduced as part of 4f627f
---
 src/backend/access/transam/multixact.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b66a2b6..d2619bd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2552,6 +2552,7 @@ SetOffsetVacuumLimit(void)
 	bool		oldestOffsetKnown = false;
 	bool		prevOldestOffsetKnown;
 	MultiXactOffset offsetStopLimit = 0;
+	MultiXactOffset prevOffsetStopLimit;
 
 	/*
 	 * NB: Have to prevent concurrent truncation, we might otherwise try to
@@ -2566,6 +2567,7 @@ SetOffsetVacuumLimit(void)
 	nextOffset = MultiXactState->nextOffset;
 	prevOldestOffsetKnown = MultiXactState->oldestOffsetKnown;
 	prevOldestOffset = MultiXactState->oldestOffset;
+	prevOffsetStopLimit = MultiXactState->offsetStopLimit;
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
@@ -2633,11 +2635,13 @@ SetOffsetVacuumLimit(void)
 	{
 		/*
 		 * If we failed to get the oldest offset this time, but we have a
-		 * value from a previous pass through this function, use the old value
-		 * rather than automatically forcing it.
+		 * value from a previous pass through this function, use the old
+		 * values rather than automatically forcing an emergency autovacuum
+		 * cycle again.
 		 */
 		oldestOffset = prevOldestOffset;
 		oldestOffsetKnown = true;
+		offsetStopLimit = prevOffsetStopLimit;
 	}
 
 	/* Install the computed values */
-- 
2.6.0.rc3

0002-Improve-Fix-comments-around-multixact-truncation.patchtext/x-patch; charset=us-asciiDownload
>From addafc27e7b89187a3f55a2cbce136d58722c97b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 12 Dec 2015 17:53:41 +0100
Subject: [PATCH 2/2] Improve/Fix comments around multixact truncation.

My commits 4f627f8 ("Rework the way multixact truncations work.") and
aa29c1c ("Remove legacy multixact truncation support.") missed updating
a number of comments. Fix that. Additionally improve accuracy of a few
of the added comments.

Reported-By: Noah Misch
Author: Slightly edited subset of a larger patchset by Noah Misch
Backpatch: 9.5, which is the first branch to contain the above commits
---
 src/backend/access/heap/README.tuplock |  6 ++----
 src/backend/access/transam/multixact.c | 36 +++++++++++++++++++---------------
 src/backend/access/transam/xlog.c      |  4 ----
 3 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock
index 10b8d78..f845958 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -102,10 +102,8 @@ the MultiXacts in them are no longer of interest.
 VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
 The lower bound used by vacuum (that is, the value below which all multixacts
 are removed) is stored as pg_class.relminmxid for each table; the minimum of
-all such values is stored in pg_database.datminmxid.  The minimum across
-all databases, in turn, is recorded in checkpoint records, and CHECKPOINT
-removes pg_multixact/ segments older than that value once the checkpoint
-record has been flushed.
+all such values is stored in pg_database.datminmxid.  VACUUM removes
+pg_multixact/ segments older than the minimum datminmxid across databases.
 
 Infomask Bits
 -------------
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d2619bd..fdeb90b 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -257,7 +257,7 @@ typedef struct MultiXactStateData
 	 * we compute it (using nextMXact if none are valid).  Each backend is
 	 * required not to attempt to access any SLRU data for MultiXactIds older
 	 * than its own OldestVisibleMXactId[] setting; this is necessary because
-	 * the checkpointer could truncate away such data at any instant.
+	 * VACUUM could truncate away such data at any instant.
 	 *
 	 * The oldest valid value among all of the OldestMemberMXactId[] and
 	 * OldestVisibleMXactId[] entries is considered by vacuum as the earliest
@@ -271,8 +271,8 @@ typedef struct MultiXactStateData
 	 * the freezing point so computed is used as the new pg_class.relminmxid
 	 * value.  The minimum of all those values in a database is stored as
 	 * pg_database.datminmxid.  In turn, the minimum of all of those values is
-	 * stored in pg_control and used as truncation point for pg_multixact.  At
-	 * checkpoint or restartpoint, unneeded segments are removed.
+	 * stored in pg_control and used as truncation point for pg_multixact.
+	 * VACUUM removes unneeded segments.
 	 */
 	MultiXactId perBackendXactIds[FLEXIBLE_ARRAY_MEMBER];
 } MultiXactStateData;
@@ -662,8 +662,8 @@ MultiXactIdSetOldestMember(void)
  *
  * We set the OldestVisibleMXactId for a given transaction the first time
  * it's going to inspect any MultiXactId.  Once we have set this, we are
- * guaranteed that the checkpointer won't truncate off SLRU data for
- * MultiXactIds at or after our OldestVisibleMXactId.
+ * guaranteed that VACUUM won't truncate off SLRU data for MultiXactIds at
+ * or after our OldestVisibleMXactId.
  *
  * The value to set is the oldest of nextMXact and all the valid per-backend
  * OldestMemberMXactId[] entries.  Because of the locking we do, we can be
@@ -2207,9 +2207,7 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	/*
 	 * We pretend that a wrap will happen halfway through the multixact ID
 	 * space, but that's not really true, because multixacts wrap differently
-	 * from transaction IDs.  Note that, separately from any concern about
-	 * multixact IDs wrapping, we must ensure that multixact members do not
-	 * wrap.  Limits for that are set in DetermineSafeOldestOffset, not here.
+	 * from transaction IDs.
 	 */
 	multiWrapLimit = oldest_datminmxid + (MaxMultiXactId >> 1);
 	if (multiWrapLimit < FirstMultiXactId)
@@ -2253,7 +2251,7 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 	if (multiVacLimit < FirstMultiXactId)
 		multiVacLimit += FirstMultiXactId;
 
-	/* Grab lock for just long enough to set the new limit values */
+	/* Grab lock for just long enough to set the new MultiXactId bounds */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestMultiXactId = oldest_datminmxid;
 	MultiXactState->oldestMultiXactDB = oldest_datoid;
@@ -2270,11 +2268,11 @@ SetMultiXactIdLimit(MultiXactId oldest_datminmxid, Oid oldest_datoid)
 			 multiWrapLimit, oldest_datoid)));
 
 	/*
-	 * Computing the actual limits is only possible once the data directory is
-	 * in a consistent state. There's no need to compute the limits while
-	 * still replaying WAL - no decisions about new multis are made even
-	 * though multixact creations might be replayed. So we'll only do further
-	 * checks after TrimMultiXact() has been called.
+	 * Setting MultiXactState->oldestOffset (in SetOffsetVacuumLimit())
+	 * entails a find_multixact_start() call, which is only possible once the
+	 * data directory is in a consistent state. There's no need for an offset
+	 * limit while still replaying WAL; no decisions about new multis are made
+	 * even though multixact creations might be replayed.
 	 */
 	if (!MultiXactState->finishedStartup)
 		return;
@@ -3069,8 +3067,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	 * the critical section: Have to do it before truncation, to prevent
 	 * concurrent lookups of those values. Has to be inside the critical
 	 * section as otherwise a future call to this function would error out,
-	 * while looking up the oldest member in offsets, if our caller crashes
-	 * before updating the limits.
+	 * while looking up the oldest member in offsets, given ereport(ERROR)
+	 * before our caller updates the limits.
 	 */
 	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 	MultiXactState->oldestMultiXactId = newOldestMulti;
@@ -3185,6 +3183,12 @@ WriteMZeroPageXlogRec(int pageno, uint8 info)
 /*
  * Write a TRUNCATE xlog record
  *
+ * Flushing the record to disk serves two critical roles.  Like it is with
+ * TruncateCLOG(), recent HEAP_FREEZE records must reach disk before we delete
+ * any MultiXactId they overwrote.  This flush before unlinking any segment
+ * ensures that a crash/basebackup and recovery will not make
+ * MultiXactState->oldestMultiXactId point to a missing segment.
+
  * We must flush the xlog record to disk before returning --- see notes in
  * TruncateCLOG().
  */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 71fc8ff..35d76de 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9319,10 +9319,6 @@ xlog_redo(XLogReaderState *record)
 		MultiXactAdvanceNextMXact(checkPoint.nextMulti,
 								  checkPoint.nextMultiOffset);
 
-		/*
-		 * NB: This may perform multixact truncation when replaying WAL
-		 * generated by an older primary.
-		 */
 		MultiXactAdvanceOldest(checkPoint.oldestMulti,
 							   checkPoint.oldestMultiDB);
 		if (TransactionIdPrecedes(ShmemVariableCache->oldestXid,
-- 
2.6.0.rc3

#82Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#81)
Re: Rework the way multixact truncations work

On Sat, Dec 12, 2015 at 12:02 PM, Andres Freund <andres@anarazel.de> wrote:

Noah, Robert, All

On 2015-12-11 11:20:21 -0500, Robert Haas wrote:

On Thu, Dec 10, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Adding a 'prevOffsetStopLimit' and using it seems like a ~5 line patch.

So let's do that, then.

Who is going to take care of this?

Here's two patches:

1) The fix to SetOffsetVacuumLimit().

I've tested this by introducing a probabilistic "return false;" to
find_multixact_start(), and torturing postgres by burning through
billions of multixactids of various sizes. Behaves about as
bad^Wgood as without the induced failures; before the patch there
were moments of spurious warnings/errors when ->offsetStopLimit was
out of whack.

I find the commit message you wrote a little difficult to read, and
propose the following version instead, which reads better to me:

Previously, if find_multixact_start() failed, SetOffsetVacuumLimit()
would install 0 into MultiXactState->offsetStopLimit. Luckily, there
are no known cases where find_multixact_start() will return an error
in 9.5 and above. But if it were to happen, for example due to
filesystem permission issues, it'd be somewhat bad:
GetNewMultiXactId() could continue allocating mxids even if close to a
wraparound, or it could erroneously stop allocating mxids, even if no
wraparound is looming. The wrong value would be corrected the next
time SetOffsetVacuumLimit() is called, or by a restart.

(I have no comments on the substance of either patch and have reviewed
the first one to a negligible degree - it doesn't look obviously wrong
- and the second one not at all.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers